Abstract: A vital step in river provenance analysis is analyzing and identifying sand and sediment components. Traditional statistical processes not only are time consuming and laborious but also yield data with uneven quality. Since these data are generated by different laboratories using different processing standards, they often cannot be compared. Although automatic identification through machine learning can potentially relieve geologists from such tedious and time-consuming work, it requires a large number of microscopic images for machine training. To facilitate data disclosure and sharing, the authors hereby publish a photomicrograph dataset of sand grains obtained from the Yarlung Tsangpo, Tibet, China. The dataset comprises 8,734 tagged clastic particle images and corresponding coordinate information files, 1,878 sand microscope images, 120 numbered base maps, two tables for sand composition identification, which we hope will provide a good foundation for the machine training of automatic sand component identification. Furthermore, it provides references for the identification of other river-sand detrital components.
Keywords: sand grains; photomicrograph; sedimentology; machine learning; Yarlung Tsangpo; river sand
|Title||Photomicrograph dataset of sand grains from the Yarlung Tsangpo in Tibet|
|Data corresponding author||Hu Xiumian (firstname.lastname@example.org)|
|Data authors||Dong Xiaolong, Hu Xiumian, Lai wen|
|Time range||Modern river sand samples were collected in June 2016; Polarized photomicrographs of thin sections were taken in 2019.|
|Geographical scope||The sampling site is located at the trunk river of Yarlung Tsangpo in Xigaze, Tibet; GPS: 29°19′13.5″N & 88°51′28.4″E.|
|Polarized microscopel resolution||4908 × 3264 pixels|
|Data volume||10.3 GB|
|Data format||*.xml; *.jpg; *.xls|
|Data service system||<https://dx.doi.org/10.11922/sciencedb.j00001.00035>|
|Source of funding||The Second Tibetan Plateau Scientific Expedition and Research Program, Ministry of Science and Technology, China (Grant No. 2019QZKK0204).|
|Dataset composition||The dataset includes three data files: “photomicrographs for labeled single grain.zip” “labeled base map.zip” and “information table of single grain.xls.” (1) “Photomicrographs for labeled single grain.zip” stores the coordinates of all sand grains (*.xml) and their 1876 polarized photomicrographs (*.jpg), with a data volume of 9.49 GB; (2) “Labeled base map.zip” stores the serial number of the particles and their corresponding photomicrograph photographic field, with 120 photos totaling a data volume of 911 MB; (3) “Information table of single grain.xls” are data sheets for the identification of sand grains in the thin sections, with a data volume of 162 KB.|
The composition and content of detritus in sand or sandstone are a solid foundation for judging the source of samples. To obtain the composition and content, the samples obtained by traditional methods are cut into standard thin sections, and the Gazzi–Dickinson method is used to count more than 400 grains under a polarizing microscope . However, this method of point-by-point recognition is not only time consuming but also laborious; furthermore, it is easily influenced by subjective knowledge and experience. This implies that comparing datasets obtained by different laboratories is difficult. Accordingly, a new method is required that will free geologists from tedious and time-consuming detrital statistics and thus improve their work efficiency.
In recent years, computer-aided methods based on machine learning have been applied to the automatic identification of coal components , ore minerals , and heavy minerals  to not only help reduce the heavy workloads of geologists but also improve identification accuracy and achieve data comparison between different laboratories. First, the photomicrograph classification method based on machine learning extracts photomicrograph features, such as color, cleavage, structure, and shape, and constructs the feature representation of geological photomicrographs in the feature space. Then, the machine learning algorithm is used to learn the differences between different types of features, and a feature classifier is constructed to realize automatic identification and the classification statistics of detritus based on photomicrographs.
For the automatic identification technology of detrital components based on photomicrographs, a large number of photomicrograph datasets are required for the machine learning of samples in the early stage. However, this type of data is still in the early stages, and many researchers are struggling to find such published database to use the labeled photomicrograph dataset for in-depth learning. Based on the principles of data sharing and open utilization, this study collates and shares the photomicrograph datasets that were individually photographed and labeled, consuming a lot of time and energy.
The river sand sample 16A063 (Fig. 1) was collected from the active river bars of the Yarlung Tsangpo River in June 2016. The sampling standard follows existing literature[5-8]. The sand sample weighed roughly 2 kg and was divided into two parts: 16A063-1 and 16A063-2. Sample 16A063-1 underwent wet sieving to obtain a grain size of 63–2,000 μm, while sample 16A063-2 underwent wet sieving to obtain a grain size of 63–500 μm. The samples were sent to Hebei Langfang Chengxin Geological Service Co. Ltd., where they were cut into standard thin sections. The grains of samples 16A063-1 and 16A063-2 comprised blue epoxy resin and colorless epoxy resin, respectively.
Fig. 1 Simplified geological map of the Himalayan Range and the southern Tibetan Plateau with sample locations (after ). MBT: Main Boundary Thrust; STDZ: South Tibet Detachment Zone; GKT: Gyirong–Kangmar Thrust; YTSZ: Yarlung–Ysangpo Suture Zone; LMF: Luobadui Milashan Fault; SNMZ: Shiquanhe NamTso Mélange Zone
To remove uneven edges, a certain rectangular area was drawn on the thin section. Then, the base photomicrograph was taken with a standard polarizing microscope (Nikon ECLIPSE LV 100POL, ×10 eyepiece); plane- and crossed-polarized photos were simultaneously taken. Small overlaps were retained when taking the photomicrographs to ensure they could be completely stitched together later on. According to the sand size, a 2.5-fold objective lens was selected for the base photomicrography of the 16A063-1 thin section, and a 10-fold objective lens was selected for the single-grain photomicrography. The base photomicrograph of sample 16A063-2 was taken with a 5-fold objective lens, and that of the single-grain photomicrograph was taken with a 20-fold objective lens. Thin section photography and data collection were performed according to the “Special Subject of Rock Photomicrograph” standard . The field of view for each single-grain photomicrograph was framed on the base map to quickly identify the position of each photomicrograph. After collecting and polarizing the photomicrographs, grains were individually identified according to the 17 grain types. The identification results were marked on the base map, and the labeled grains were connected by broken lines, as shown in Fig. 2. In the figure, each inflection point in the broken line represents a grain, which is numbered with a spacing of 10 times. Each grain was simultaneously numbered in an Excel table to facilitate future grain marking. The single-grain photomicrograph file was opened with LabelImg, which is an open-source annotation tool (download website: http://tzutalin.github.io/labelImg/), and each grain for each photomicrograph was labeled to obtain the labeled sample dataset.
This dataset comprises three parts: data folder, labeled base map folder, and sand information folder. It has 8732 sand grains of different types and 1996 photomicrographs of sand thin sections, including 1876 single-grain photomicrographs and 120 labeled base maps. Sand grains are classified according to six categories and 17 subgroups (Table 1). The classification standard follows existing literature. For quartz, only single-crystal quartz and polycrystalline quartz are distinguished, and feldspar is distinguished from plagioclase and potassium feldspar. The number of different types of sand is shown in Table 2.
Q = Qm + Qp
F = P + K
|Lvf||Acid-intermediate volcanic rock fragments|
Lv = Lvf + Lvm + Lvi
|Lvm||Mafic volcanic rock fragments|
|Lvi||Intrusive rock fragments|
|Lv||Volcanic rock fragments|
Ls = Lsc + Lsm + Lss + Cht
|Lsm||Mudstone or shale grain|
|Lss||Sandstone & siltstone|
|Ls||Sedimentary rock fragments|
Lm = Lml + Lmp + Lms + Lmu + Lmc
|Lmu||Metamorphic rock fragments of ultramafic rocks|
|Lmc||Lithic grains of marble|
|Lm||Metamorphic rock fragments|
|Others||Heavy minerals; opaque; unidentified|
|16A063-1||Others = HM(187) + Opaque(7) +Unidentified(2)|
|16A063-2||Others = HM(338) + Opaque(163) + Unidentified(5)|
3.1 Labeled dataset of single detritus
The dataset information was saved in a data folder, containing a photomicrograph folder, an annotation folder, and a category annotation predefined classes file. The annotation folder contains one-to-one annotation files corresponding to the photomicrographs in the photomicrograph folder (Fig. 3). This kind of data folder organization can be easily read by the computer.
The photomicrograph labeling was accomplished by the software LabelImg. The sand photomicrograph was opened with LabelImg, and the grain position and category were manually labeled. Since the grain positions of single plane- and crossed-polarized photomicrographs are one-to-one corresponding, only the single plane-polarized photomicrographs required labelling. The computer can automatically extract the grain position of the crossed-polarized photomicrograph according to the position coordinates of the plane-polarized photomicrograph. The tag information was saved in the annotation file in xml format. The software Notepad++ was used for each grain marking coordinate file in the annotation file, the operating system of which is Windows V7.8.8. When the labeled photomicrograph position is opened with LabelImg, the name of the opened picture folder must correspond to the xml folder name saved by the tag (two red box positions in Fig. 4A) to display the labeled position.
The photomicrograph folder of the sand-photo dataset comprises 1876 single-grain polarized photomicrographs. Each single-grain field of view contains one plane-polarized photomicrograph and one corresponding crossed-polarized photomicrograph. The numbering styles of the micrographs are “a1−”and “a1+”, where “a1” is the position of the corresponding base photomicrograph photographing field of view, “−” represents a plane-polarized photo, and “+” represents a crossed-polarized photo (Fig. 4B). The color of the photomicrographs is consistent with naked-eye observations under a polarizing microscope. The resolution of the micrograph is 4908 × 3264, and the format is JPG.
3.2 Labeled base map
The labeled base map folder contains 120 labeled photomicrographs. The file name “a*” is the field of view of a single grain taken under the microscope (Fig. 5A). The file name “a*−1” corresponds to the “a*” field of view, and the grains of each identification number are connected with a broken line and sequentially numbered with 10 as the spacing (Fig. 5B).
3.3 Information table of single-sand dataset
The sand information table is the grain identification results of two thin sections of samples 16A063-1 and 16A063-2. The number in the information table is consistent with the numbering sequence of the base map “a*−1” (Fig. 5b). The identification results of a single grain should be updated in the sand information sheet in abbreviated form according to the order of the bottom chart. In section 16A063-1, deeply etched plagioclase (P) and potassium feldspar (K) are labeled as P1 and K1, respectively, to distinguish them from the unetched sample. The grain compositions for the two samples are shown in Fig. 6.
The thickness of the thin sections meets the national and international standards. The birefringence of quartz grains observed in the same batch of rock thin sections exhibit first-order interference colors, indicating that the thickness of the thin sections meets the national standard of 0.03 mm. The micrograph possesses high definition and has no color difference. In the microscopy process, automatic exposure and automatic white balance were used to ensure that the color of naked-eye observations and system photos were as consistent as possible. The micrograph resolution was unified with the highest value of 4908 × 3264 pixels of the photography system, and the pictures were uniformly saved in JPG format. Therefore, the quality and clarity of the micrographs are reliable. A scale was added to each photo to facilitate future grain size measurements and roundness and area calculations.
Sand identification was performed after discussion with several different researchers to ensure the accuracy.
The proposed dataset provides a large number of labeled sand photomicrographs and labeled coordinate files. Each grain in each micrograph is labeled, and the coordinate value and corresponding grain type can be easily obtained. At the same time, during the labeling process, the visual field position of the grain photomicrograph was labeled on the base map and the grains were numbered to ensure that the position and type of each labeled grain can be traced.
The disadvantage of this dataset is that the data structure is not balanced. Some grain types are very large, such as quartz, whereas others are very small, such as the metamorphic rock fragment (Fig. 6). This means that the result accuracy of the machine-learning-based photomicrograph recognition is uneven, and thus, further additions are required to reduce the number gap of each grain type in the database. Due to the manual moving of the stage during measurements, a slight deviation is present between the camera and base views, but this does not affect the fast positioning. Some grains are not labeled on the base map; however, when LabelImg was used, the coordinate position and grain type for each picture were labeled to facilitate computer reading.
This dataset contains a large number of labeled single-grain photomicrographs and coordinate files, thereby constituting an important database for the automatic identification of minerals and cuttings in sand using machine learning technology. A large number of identified single-grain micrographs can be used as identification plates. The classification of sand can provide a reference standard for follow-up river sand research and improve the comparability of detritus data obtained by different laboratories. The identification characteristics of modern river sand can be used as a reference for the identification of sandstone composition and thus help us understand the characteristics of ancient sandstone.
The three files in this dataset have strong correlation and corresponding content. Please pay attention to the points outlined below when using it.
(1) The thin sections in the dataset are all stored by the research group of Professor Xiumian Hu of Nanjing University. If the micrographs provided do not meet the needs of future research, the corresponding author can be contacted to apply for further use.
(2) When using the data, the three files should be simultaneously downloaded to ensure that the location information and grain type can be easily identified. When using the label annotation file, LabelImg should be downloaded in advance as well as the coordinate file-reading software Notepad++ to read grain photomicrographs and coordinates. When using LabelImg to open the photomicrograph file, to display the labeled frame position, the saved directory of the mark should be changed to the file name corresponding to the photomicrograph file. If you have any questions regarding the usage of this dataset, please contact the author of this article.
(3) The single-grain photos can be used as a standard for the identification of river sand detritus, and some detrital grains with typical structures can be directly used for teaching and book publishing.
As this dataset is under research, we hereby apply for the protection of this dataset for 3 years. During the protection period, readers can log in to the website https://dx.doi.org/10.11922/sciencedb.j00001.00044 to download part of the dataset for understanding and reference. After the protection period, readers can log in to the official website of the scientific data repository to download and use the data (https://dx.doi.org/10.11922/sciencedb.j00001.00035).
The authors would like to thank Dr. Chao Li and Dr. Anlin Ma for their useful discussion regarding thin-section identification and Ronghua Guo for collecting the field samples.
INGERSOLL R V. The effect of grain size on detrital modes; a test of the Gazzi-Dickinson point-counting method. Journal of Sedimentary Research, 1984, 54(1): 103-116.
SONG X Z, ZHANG Q. Automatic photomicrograph recognition system and key technologies of maceral group. Journal of China Coal Society, 2019, 44(10):3085—3097. doi:10.13225/j.cnki.jccs.2019.1103.
XU S T, ZHOU Y Z. Artificial intelligence identification of ore minerals under microscope based on deep learning algorithm . Acta Petrologica Sinica, 2018, 34(11) : 3244-3252.
HAO H Z, GUO R H, GU Q, et al. Machine learning application to automatically classify heavy minerals in river sand by using SEM/EDS data. Minerals Engineering, 2019, 147. https://doi.org/10.1016/j.mineng.2019.105899.
GARZANTI E. Petrographic classification of sand and sandstone. Earth-Science Reviews, 2019, 192:545-563.
GARZANTI E, VEZZOLI G, ANDÒ S, et al. Petrology of Indus River sands : a key to interpret erosion history of the Western Himalayan Syntaxis. Earth and Planetary ence Letters, 2005, 229(3-4): 287-302.
GARZANTI E, VEZZOLI G, ANDÒ S, et al. Sand petrology and focused erosion in collision orogens: the Brahmaputra case. Earth and Planetary ence Letters, 2004, 220(1): 157-174.
GARZANTI E, LIMONTA M, VEZZOLI G, et al. Petrology and multimineral fingerprinting of modern sand generated from a dissected magmatic arc (Lhasa River, Tibet)// Ingersoll R V, Lawton T F, Graham S A. Tectonics, Sedimentary Basins, and Provenance: A Celebration of William R. Dickinson’s Career. The Geological Society of America, 2018: 197-221.
HU X M, LAI W, XU Y W, et al. Standards for taking and information collecting of digital photomicrograph of sedimentary rock. China Scientific Data, 2020. (2020-03-02). DOI: 10.11922/csdata.2020.0008.zh.
GUO R H, HU X M, Garzanti E, et al. How faithfully the geochronological and geochemical signatures of detrital zircon, titanite, rutile and monazite record magmatic and metamorphic events? A case study from the Himalaya and Tibet. Earth Science Review, 2020. https://doi.org/10.1016/j.earscirev.2020.103082.
DONG X L, HU X M, LAI W. A photomicrograph dataset of sand grains from the Yarlung Tsangpo, Tibet. Science Data Bank, 2020. (2020-07-15). DOI: 10.11922/sciencedb.j00001.00035.
How to cite this article
DONG X L, HU X M, LAI W. A photomicrograph dataset of sand grains from the Yarlung Tsangpo, Tibet. China Scientific Data, 2020, 5(3). (2020-07-15). DOI: 10.11922/csdata.2020.0051.zh.