Abstract: This dataset of remote sensing images of land cover types in China is obtained through manual interpretation of ten-meter-level Landsat 8 multispectral image data and meter-level GF-1 and QuickBird multispectral image data, which covers 31 provinces, municipalities and autonomous regions of China excluding Hong Kong, Macao and Taiwan. Based on satellite images of the past five years (2013 – 2017), including summer and winter seasons, the study uses the classification system of the ground object spectral library. The dataset can offer priori training and testing samples for land cover classification, and provide sample data support for the research on the application of land cover classification. It can also be used to guide the collection of similar image sample points in the image. To promote convenient data usage, we adopted unified and standard data processing methods, sample collecting rules and quality control system, based on which the land cover image data are formed. So far, this dataset has been made publicly available online.
Keywords: land cover type; China; remote sensing image sample; ten meter level multispectral data; meter level multispectral data
|English title||Remote sensing image sample dataset of land cover types in China|
|Corresponding author||Zhao Lijun (email@example.com)|
|Data author(s)||Zhao Lijun, Zheng Ke, Shi Lulu, Bai Yang, Tang Jiwen, Zhang Wei, Rao Mengbin, Zou Song, Li Yanyan|
|Time range||2013 – 2017|
|Geographical scope||31 provinces, municipalities or autonomous regions of China excluding Hong Kong, Macao and Taiwan|
|Spatial resolution||2.4–30 m||Data volume||647 MB (after decompression)|
|Data format||*.tif, *.jpg, *.txt, *.xml|
|Data service system||Website (e.g., http://www.sciencedb.cn/dataSet/handle/1)|
|Source(s) of funding||Basic Research Foundation of Science and Technology (2014FY210800)|
|Dataset/Database composition||The dataset consists of two compressed files, made up of two folders storing meter-level sample data, namely GF1 and QuickBird, and one folder storing ten-meter-level sample data, namely Landsat. Each folder is comprised of several subfolders which are named after sampling regions. Each subfolder consists of compressed files that store data of six major land cover types in the sampling regions, including soil, water body, rock and mineral, vegetation, snow and ice, and man-made objects. Each of the compressed files includes only data of several samples of the same remote sensing image of the same land cover type. After being decompressed, each file is made up of four types of data files, including an original satellite image of the sample (*.tif), a sample image preview file (*.jpg), a text file of DN values of different spectral bands (*.txt), and a metadata file (*.xml).|
Land cover is a complex for various kinds of material types and natural properties and characteristics of the earth's surface, whose spatial distribution directly affects the cycle of matter and energy of the earth's surface1. Land cover monitoring using remote sensing images is the important foundation of ecological environment change research, land resource management and sustainable development and plays an important role in the global resources monitoring and global change detection2.. Currently, computer aided remote sensing image classification has become a main development direction, with many kinds of classification methods having sprung up, like statistical pattern recognition methods, artificial intelligence based classification methods, combination of remote sensing and GIS, object-oriented classification methods, multi-source information compound classification methods3.. Remote sensing image classification is to classify each image pixel or region into several categories. To be specific, by analyzing the spectral characteristics of all kinds of ground objects, the characteristic parameters are selected, the feature space is divided into non-overlapping subspace, each image pixel is then divided into each subspace, and finally the classification is realized4.. In the process of land cover classification of remote sensing images, image sampling can be used to construct supervised classification models, providing a priori knowledge of remote sensing image classification of land covers. At present, the National Fundamental Geographic Information Center has released 2010 GlobalLand 30 product which is a 30-meter land cover classification product around the world. It can be freely downloaded at http://www.webmap.cn/commres.do?method=globeIndex. Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences has released a consecutive years of China's land cover remote sensing monitoring products. Moreover, Tsinghua University has released FROM-GLC2010 land cover classification products (available at http://data.ess.tsinghua.edu.cn/landsat_ pathList_fromglc_0_1. html) as well as global validation sample points (available at http://data.ess.tsinghua.edu.cn/ data/temp/GlobalLandCoverValidationSampleSet_v1.xlsx), with training samples unpublished. The dataset in this paper is funded by the project of Ground Object Spectral Library Construction (2014 FY210800) which is from the Fundamental Science and Technology Major Project of the Ministry of Science and Technology of China. Except for the dataset in this paper, the Ground Object Spectral Library also consists of reflective dataset of typical waters, reflective dataset of typical ground objects, full-band spectral dataset of typical land covers, time series reflective dataset in crop growing seasons, multi-scale reflective dataset of forests and crops, multi-frequency and multi-angle microwave black-body temperature dataset of forests, reflective dataset of rock and mineral specimens in China. All these datasets will be published through the website http://126.96.36.199/spectrum/. Note that all the other datasets in the project are obtained from the ground field spectrum, and the dataset in this paper is generated from the aspect of remote sensing image sampling, so as to support the correlation analysis between the image and the ground spectrum and to provide supplementary samples for remote sensing image classification. This dataset utilizes remote sensing images close to the areas of other datasets in China, with the final image samples obtained by artificial interpretation and comparison. Different from the existing datasets like GlobeLand 30 and FROM-GLC5.6. , this dataset is characterized by (1) up to date (year 2013-year 2017), (2) containing image samples with higher spatial resolution (2.4-8m), (3) finer classification system (ground object spectral oriented), and (4) completely free service (all samples available at the website).
The dataset consists of remote sensing image samples with meter- and ten-meter-level spatial resolutions, in which 30-meter resolution Landsat 8 satellite predominates the ten-meter-level image data and 8-meter resolution GF-1satellite predominates the meter-level image data with some 2.4-meter QuickBird satellite image data as a supplement. All the image data were acquired through the geographical spatial data cloud platform (available at http://www.gscloud.cn/) or by purchase. The Landsat 8 L1T level products were used. According to the information provided by USGS, the product has already been corrected by using ground control points and digital elevation model (DEM) data, with geometric correction accuracy reaching 12 m, less than 0.5 pixels. For GF-1 data, the L1 level products without geometric information were firstly selected, and were then geometrically corrected using the RPC parameter files. For QuickBird data, the L2A level products were used, with geometric accurate correction finished. All the above mentioned images were preprocessed by geometric accurate correction, with no radiometric and atmospheric correction. Although the digital number (DN) values were kept, they cannot be directly applied to construct classification models. Here, geometric information is given more priority, because in practical applications, different sensors may lead to the inconsistency of pixel spectral information owing to their differences in wave bands. However, in remote sensing image classification, it is very convenient to construct training samples according to geographic location and category information. The reflective spectral features can be obtained by radiometric and atmospheric correction on the images to be classified. In terms of spatial scope selection, for ten-meter-level images, they cover 31 provinces, autonomous regions and municipalities in China except Hong Kong, Macao and Taiwan, with at least two scenes of images randomly selected for each administrative division unit. For meter-level images, they cover Northeast China, North China, East China, Central China, South China, Southwest China and Northwest China, with at least two scenes of images randomly selected for each region. In terms of temporal scope selection, two most distinguished seasons including summer and winter were considered, in which winter ranges from December to March and summer ranges from July to September. In terms of years of image acquisition, images of the most recent five years were considered to guarantee the timeliness of sample data. To sum up, the final details of images to sample are listed in Table 1.
|Satellite||Resolution level||Amounts (scene)||Spatial scope|
|Landsat 8||ten-meter||66||provinces, autonomous regions and municipalities in China except Hong Kong, Macao and Taiwan市|
|GF-1||meter||14||Northeast China, North China, East China, Central China, South China, Southwest China and Northwest China|
|QuickBird 02||meter||1||Northeast China|
Based on the preprocessed remote sensing images, image sampling was performed by manual interpretation. The category system strictly complies with the internal standards “Classification Coding Standard for Ground Object Spectral Library (Draft)”. The draft mainly refers to the existing national standards or classification principles or standards with higher recognition and makes modifications according to practical requirements of the survey department and remote sensing image classification studies. The land cover types cover 6 top-level categories, i.e., vegetation, soil, rock and mineral, snow and ice, water body, and man-made objects. In detail, vegetation has six sub-levels of classification system; snow and ice has five sub-levels of classification system; soil, rock and mineral, water body, and man-made objects have four sub-levels of classification system. According to the distinguishing ability of remote sensing images on land covers, the above classification system has been simplified, which is shown in Table 2.
|1st level||2nd level||3rd level||4th level|
|12||Urban green||1201||Manmade green|
|3||Rock and mineral||31||Rock|
|4||Snow and ice||41||Ice||412||Lake ice|
|5||Water body||51||River||511||Perennial river|
|53||Lake||531||Perennial lake/pond||53101||Perennial lake|
|6||Man-made objects||61||Drainage||6101||Man-made ditch|
|62||Settlement and infrastructure||6201||Settlement|
|6202||Industrial mining and facilities|
In the process of manual interpretation and sampling, the existing thematic products and historical data were consulted to ensure the accuracy of sample category labeling. The thematic products and historical data here include single category of classification thematic products of vegetation and water body, Google Earth high-resolution historical images, and the measured data of ground samples collected by other projects of the project team. For different resolution image sample data (10-meter and meter level), the sampling number of each sub-category in each scene image was controlled to 50-700, and the sampling size was 7 pixels by 7 pixels. The selection of edge pixels was avoided in the sampling process. According to the specifications and requirements of data storage within the project, each sample data is recorded in a standardized and unified data organization and storage format. Each sample data corresponds to four files, as shown in Table 3.
|Details of documents||File content|
|image_<top class>_< sensor type>_<acquired time>_AXXX.tif||Raw image of a sample point|
|view_<top class>_<sensor type>_<acquired time>_AXXX.jpg||Preview image of a sample point|
|pixel_<top class>_<sub class>_<acquired time>_AXXX.txt||DN values of each band of raw image|
|pixel_<top class>_<sub class>_<acquired time>_BXXX.xml||Metadata description information of a sample point|
In Table 3, <top class>, <sub class>, <sensor type> and <acquired time> have uniform naming rules, in which <top class> represents top-level categories, which are limited to vege, soil, rock, snow, water, and manmade, respectively corresponding to vegetation, soil, rock and mineral, snow and ice, water body, and man-made objects in Table 2; <sub class> represents abbreviated English names of sub-categories with no more than 15 characters; <sensor type> represents sensor names, which are limited to OLI, PMS, and QuickBird, respectively corresponding to Landsat 8, GF-1and QuickBird 02 satellites; <acquired time> represents the acquisition time, recording the year, month, day, hour and second with the format YYYYMMDDHHMMSS; XXX represents the document number ranging from 001 to 999; A and B are document identifications, with A being identification of the supporting document of sampling point and B being identification of the metadata document.
The national land cover remote sensing image sample dataset includes two spatial resolution levels: ten-meter level and meter level. Among them, 118,324 samples were collected from ten-meter-level images with 58,317 in summer and 60,007 in winter. Here, provinces/autonomous regions/municipalities were taken as a unit, and each region has at least two temporal phases. A total of 29,551 image samples were collected at the meter level including 15,792 samples in summer and 13,759 samples in winter. Here, the national large divisions were taken as a sampling unit, and each division has at least two temporal phases). The spatial distribution of sampling points is shown in Figure 1. Figure 2 and Figure 3 show the composition of ten-meter-level image samples and meter-level image samples, respectively.
This data set is organized and stored in the way of image plus description document. It contains image data and text data, as shown in Table 3. Figure 4 illustrates the list of storage files corresponding to a sample point data in the category of agricultural land.
In Figure 4, TIF file is the original image file of 7 pixels by 7 pixels; JPG file is the preview image of 7 pixels by 7 pixels, which corresponds to the original image; TXT file is the DN value file of the central pixel of the sample point, whose content format is two columns of values, separated by Tab key, with the first column of data being the wavelength, and the second column of data being the DN value; XML file is a metadata description file, which is formatted and stored as shown in Figure 5.
For remote sensing image sample data, the process of data quality control includes checking and collating the sampled image data, checking single sample data points, generating and storing supporting files of sample points, and compiling, checking and storing metadata. Examination of sampled image data includes obvious mistakes such as image projection information, band number, storage damage and file format. The checking of single sample data points mainly aims at correcting and eliminating the wrong class labeling sample data. Examination of sample point supporting files and metadata files includes file naming, file format, standardized field naming in file content and integrity of file content.
To construct the remote sensing image sample dataset, a perfect quality control process (Figure 6) was established during the data entry stage to ensure the correctness, integrity and consistency of the data to be stored in the database. In the quality control of data collection and storage process, the original remote sensing image data and image sample data were arranged and formatted in a unified manner. At the same time, a series of quality control methods, such as correctness check and data consistency check, were adopted to ensure the quality of data. The image sample point supporting documents and metadata files were filled in by the image sample data collectors according to the data standard format (including image file name, longitude and latitude, sensor type, observation time, spatial resolution, spectral type, personnel information, etc.) formulated by the project. In order to reduce the errors caused by manual filling in, all the metadata information was automatically filled in by reading the original image data and the sampling point data in a programming way.
In order to quantitatively evaluate the quality of the dataset, we took the classification results of the remote sensing image used in the sample collection process as the evaluation object, and randomly divided the sample points in the image into two parts, namely, training samples and test samples, using training samples to train a support vector machine (SVM) classifier and using test samples to evaluate the classification accuracy, including overall classification accuracy and Kappa coefficient. We randomly selected samples from different regions and different phases for evaluation (see Table 4). Finally, the average classification accuracy is 81.17%, and the average Kappa coefficient is 0.78. It can be seen from the sampling results that the data quality is generally good.
This data set is the original storage file of the Ground Object Spectral Library (GOSPEL) database. Users can retrieve the relevant information of sample points in the retrieval area through the portal website (http://188.8.131.52/spectrum/), and can also read the spatial information of sample points in the XML files of this dataset by batch parsing through programming. The dataset can provide training and testing sample data for the research of remote sensing image classification algorithms. The distribution of sample points in the image to be classified can be obtained by the transformation of geographic coordinates and pixel coordinates of the image. Classification algorithms such as Maximum Likelihood Classification (MLC), SVM and Convolutional Neural Network (CNN) are used to complete the classification. The flow chart is shown in Figure 7. Since the data set does not cover every region of the country, if the spatial scope of the image to be classified does not cover any image sample points, it is suggested that relevant researchers query the image sample points of adjacent areas and use the selected sample points as reference sample sets to guide the collection of similar sample points in the image to be classified, with which the image classification task can be completed.
On the other hand, the dataset can be used in conjunction with the ground measured spectral data. By matching other ground measured spectral datasets provided by the GOSPEL platform on spatial position (same and near position), users can study and analyze the correlation between the ground measured spectrum and the image spectrum. It not only provides the data basis for the research of related algorithms, but also makes it possible to use the measured spectral data directly to guide the automatic acquisition of image samples.
Qiao W, Guo R, Liu Y, et al. Methods and related issues concerning cropland features extracting from remote sensing images for land cover. Standardization of Surveying and Mapping 29 (2013): 21 – 23.
Zhang W, Zheng K, Tang P, et al. Land cover classification with features extracted by deep convolutional neural network. Journal of Image and Graphics 22 (2017): 1144 – 1153.
Shi Z, Ma Y, Wang Y, et al. Review on the classification methods of land use/cover based on remote sensing image. Chinese Agricultural Science Bulletin 28 (2012): 273 – 278.
Wang K, Qi H. Researching on the assorting methods of land utility and overburden remote sense. Shanxi Architecture 34 (2008): 353 – 354.
Gong P, Wang J, Yu L, et al. Finer resolution observation and monitoring of global land cover: first mapping results with Landsat TM and ETM+ data. International Journal of Remote Sensing 34(2013): 2607 – 2654.
1. Zhao L, Zheng K, Shi L, Bai Y, Tang J, Zhang W, Rao M, Zou S & Li Y. Remote sensing image sample dataset of land cover types in China. Science Data Bank, DOI: 10.11922/sciencedb.663 (2019).
How to cite this article
Zhao L, Zheng K, Shi L, Bai Y, Tang J, Zhang W, Rao M, Zou S & Li Y. Remote sensing image sample dataset of land cover types in China. China Scientific Data 4(2019). DOI: 10.11922/csdata.2018.0058.zh