Zone II • Versions EN2
Abstract: In response to the Belt and Road Initiative and the CAS’s call of Chinese scientists to "Go Out/Global", China and Myanmar have extensively cooperated in the field of biodiversity. To support the Southeast Asian Biodiversity Research Institute of Chinese Academy of Sciences to carry out botanical research, the informatization work team of the Kunming Institute of Botany CAS constructed a data set and developed an information service platform of plant biodiversity in Myanmar. In this data set, we have systematically filtered and integrated most of the published biodiversity data and information scattered across different platforms in the world. The data set contains information on plant lists, biological character description, specimen records and literature referred. It covers about 15 thousand plant species in Myanmar, with 457.3 thousand data records in total. This integrated data set is expected to support botanical research of this region.
Keywords: Myanmar; biodiversity; plant; heterogeneous data integration
|Title||A data set of plantbiodiversity in Myanmar|
|Data authors||He Yanbiao, Zhuang Huifu, Wang Yuhua|
|Data corresponding author||Wang Yuhua(firstname.lastname@example.org)|
|Time range||1800 –2017|
|Geographical scope||Myanmar and its surrounding areas|
|Data volume||2GB, with 457,300 data entries in total|
|Data service system||<http://184.108.40.206>; <http://www.sciencedb.cn/dataSet/handle/499>|
|Sources of funding||China BasicResearch Program (No. 2013FY112600)|
|Datasetcomposition||This data set consists of eight subsets on Myanmar plants obtained from the following eight sources: (1) Biodiversity Heritage Library (BHL) (2,921,736 data entries); (2) Scientific Database of ChinaPlant Species (2,013,806 data entries); (3) Seed Plants of China (15,842 data entries); (4) Smithsonian Institution (14,473 data entries); (5) Global Biodiversity Information Facility (141,764 data entries); (6) Flora of Yunnan (153,396 data entries); (7) Chinese Flora (English edition) (5,254 data entries); (8) Tropicos database (14,589 data entries). We verified and cleaned the above eight datasets according to written records of "distribution in Myanmar", and got 457.3 thousand data entries. These data were then integrated to generate the dataset of plant biodiversity in Myanmar.|
Located in Southeast Asia, the Federal Republic of Myanmar (Myanmar) is not only one of the most important biodiversity hotspots in the world, but also a global hotspot for biodiversity conservation and research.1 There has been a long history of biodiversity research in Myanmar, during which a large amount of literature and scientific data was formed.2 However, there has not been a complete scientific list or data set of plants in Myanmar. Information is so much scattered across different platforms, which hinders biodiversity conservation and sustainable resource utilization. Under the Belt and Road Initiative and the CAS's call of scientists to "Go Out/Global", China and Myanmar have extensively cooperated in the field of biodiversity. The Southeast Asian Biodiversity Research Institute of Chinese Academy of Sciences was established to provide scientific and technological support for Myanmar's environmental protection and sustainable plant resource utilization. To support the research center's botanical research, the informatization work team of the Kunming Institute of Botany CAS systematically integrated and analyzed published biodiversity data scattered across different platforms. This data set has its source data from specimen records, historical literature, and published floras. The source dara have been filtered, correlated and integrated by considering floristic distribution to form this data set of plant diversity in Myanmar. It provides data support for the conservation, research and sustainable utilization of plant diversity in Myanmar.
2.1 Data source selection
With rich plant diversity, Myanmar has always been a hot area for international botanical research. Considering floras and continuous floristic distribution, the project team extensively collected records of plants and specimens from Myanmar and its surrounding countries and regions, such as Flora of China (Chinese edition),3Flora of China (English edition) ,4 and Flora of Yunnan,5 etc. Only plants whose distribution in Myanmar was proven by published written records were included into the plant directory of Myanmar. Sources of the data set include Scientific Database of China Plant Species, iFlora, Seed Plants of China (CD-ROM), Smithsonian Institution's Plants of Myanmar Checklist, eFloras, Global Biodiversity Information Facility, and Biodiversity Heritage Library.6–12
2.2 Data filtering and integration
The Scientific Database of China Plant Species contains both Chinese and English editions of Flora of China and Flora of Yunnan, from which 11,000 records of species were collected.
The Plants of Myanmar Checklist provides high-value scientific and technological information. We used web spider to capture data of the target site. By extracting data index URL lists, downloading and parsing webpage data, we collected 11,000 data entries of Myanmar plants.
Global Biodiversity Information Facility is an open data research infrastructure jointly funded by governments around the world, aiming to provide the public with information about all types of life on Earth. Data downloaded from GBIF were in text format, which could be directly entered into the database through a special import tool. The text importer of PLSQL Developer was used to import the data of Myanmar, through which more than 120,000 data entries were integrated. Detailed data sources are shown in Table 1.
|Data source||Specification||Data entry|
|Biodiversity Heritage Library||Records of plants with distribution in Myanmar||2 921 736|
|Scientific Database of ChinaPlant Species||Records of plants with distribution in Myanmar||2 013 806|
|Seed Plants of China||Records of plants with distribution in Myanmar||15 842|
|Plants of Myanmar Checklist||List of Myanmarplantsby Smithsonian Institution||14 473|
|Global Biodiversity Information Facility||Records of plant distribution in Myanmar||141 764|
|Flora of Yunnan||Records of plants with distribution in Myanmar||153 396|
|eFloras||Records of plants with distribution in Myanmar||5 254|
|Tropicos||Records of plants with distribution in Myanmar||14 589|
2.3 Plant data indexing and plant list building
The Myanmar data parsed out from structured text and semi-structured HTML format contained some outliers, special tags, etc., which needed to be further filtered. Outlier data were singled out for manual checking. Three subsets were formed based on information extracted from Scientific Database of China Plant Species, Plants of Myanmar Checklist, and GBIF. Data of the subsets were then merged based on the plants' Latin name (including their genus, specific epithet, variants epithet). Data under the same Latin name were merged, with all sources annotated. If no Latin name was provided in the original subset, a new Latin name would be added. The multi-source data were integrated to form a relatively comprehensive Plant Reference List in Myanmar.
2.4 Data integration and dataset building
Rules of the data integration included:
(1) Latin name as the thread of data integration
Based on the above Reference List, we used Latin name as the thread to correlate BHL historical data, GBIF geographical data, Floras species description data and economic utilization data, through which to construct a comprehensive biodiversity data set.
(2) Latin name as the thread of synonym correlation
Data of different classification systems which used a synonym of the species' Latin name were treated as follows:
A database was built to correlate the species' scientific name with its synonyms, mainly based on the Scientific Database of China Plant Species and eFloras.org.
A synonym could not be directly used to associate data of a certain species; rather, the synonym first translated into the species' Latin name to correlate data of the same species.
This dataset was integrated at species level, and the integration was conducted through Latin name-synonymy correlation, thereby eradicating any negative impact of different taxonomic systems of the data sources.
While the current Latin name-synonym relational database only covers part of the plants, we are making continuous endeavor to build a comprehensive one, based on species2000, EOL, UBIO, and so forth. This will maximize the data integration level (that is, associate relevant data from different data sources to the utmost).
(3) Latin name normalization
The key of data integration was to unify the plants' Latin name in all subsets. Different subsets adopted different Latin name formats – some subsets set two spaces between the plants' Latin name and specific epithet whilst others set one; some formats contained authors' name whilst others did not; even for author's name, some used complete spelling whilst others used abbreviations. The integration was thus to simplify and unify the Latin name format while retaining Latin name, specific epithet, and lower-level suffix, so as to optimize the association and matching accuracy.
After integration and association, the data set contains 15,000 entries of Myanmar plant species data, 17,000 entries of species description data, 141,700 entries of GBIF biodiversity data and nearly 300,000 entries of BHL references. It has a total amount of about 2 GB. The data integration framework is shown in Figure 1.
Figure 1 Framework of Myanmar plant data integration
Note: All the collected data sources underwent a process of Latin name standardization. Through the plants' Latin name, data of different sources/types were correlated and integrated, including BHL historical data, GBIF geographic data, plant species description data and economic utilization data. Intelligent parsing and manual indexing were then performed on the multi-source data, and data fields were extracted from each source to optimize data search services.
2.5 Development of the information service platform
To facilitate data search and download, an information platform was developed, providing simple, easy-to-use and efficient intelligent search services (Figure 2). It tackles the plurality of data fields and types caused by data merge and streamlines the process of data search, thereby facilitating data (re)use. The platform also enables timely updates of the dataset with continuous feedback from data users and continuous emergence of new knowledge. With an intelligent search window, the program gives a list of possible data types and search results based on user input. The platform supports data retrieval, browsing and download/export (registration required). For external data sources (e,g., BHL, GBIF and eFloras), the platform provides data links, through which users can obtain detailed information on respective platforms.
Taking "Cardiocrinum giganteum (大百合)" as an example. The results page presents a list of data sources, including Flora of China (Chinese edition), Flora of China (English edition), Seed Plants of China, and related data links to BHL and GBIF. A click of the contents directs users to a detailed information page. The following aspects of Cardiocrinum giganteum have been integrated: species classification information, including taxonomic information of the species and its Chinese and Latin names, as well as the source of the information; English description of the species which is mainly from Flora of China; Chinese description of the species, including its living environment, altitude of distribution, Chinese and overseas distribution, place of origin, specimen information and so on. Detailed data of the sample are given in Figure 3.
This dataset is an application based on published data sources, so the quality control mainly involved data content, structure and anomaly checking.2 The process mainly consisted of the following steps:
First, data anomaly checking: this is to check whether the data bear any special marks after being analyzed, such as HTML markup;
Second, data field analysis: this is find out whether the parsed field is consistent with the source data, so as to minimize data field loss.
Third, random inspection: this is to find any other possible data problems.
For anomaly data checking, data with erroneously parsed fields were fully investigated (through a combination of program inspection and manual inspection). Both anomaly data and erroneously parsed data have been removed from the final dataset. The integrated data set maintains a high level of consistency with the source data. To protect data copyright, all external data indexed by the platform are presented to users through links.
Random sampling statistics show that the data are of high quality. After filtering, standardization and integration, the data can achieve a high degree of correlation, and they show a consistency rate of over 95% with source data.
As biodiversity data were rapidly accumulated in recent years, the biggest difficulty for researchers is how to obtain integrated data quickly and accurately. Large amounts of data are scattered on single-function, discrete platforms with heterogeneous storages and discrepant standards. As such, how to integrate multi-source scientific data and construct information service to meet the needs of scientific research will be the focus of future research on scientific database. Through integration of professional databases and online information, this data set was formed to provide plant diversity data in Myanmar, covering historical literature data (BHL), geographical distribution data (GBIF, Seed Plants of China), species description data (eFloras.org, Flora of China, Flora of Yunnan), economic use data and conservation rank directory (Scientific Database of China Plant Species), etc. The dataset can effectively support botanical collection, investigation, research, as well as resource development and utilization in Myanmar.
At present, there are few comprehensive scientific datasets and information service platforms for biodiversity in hot spots in China. The construction of this dataset and the integration of multiple data sources can provide references for constructing other regional or large-scale biodiversity datasets.
The information platform is located at: <http://220.127.116.11>, with a copy of the dataset available at: <http://www.sciencedb.cn/dataSet/handle/499>. Refer to Figure 3 for webpage features. To download the data sets, click the column "Resource Download (资源下载)" or visit: <http://18.104.22.168/Data/DataBaseList>. Follow-up measures include to improve the platform's data analysis capabilities, aiming to build a highly functional information service platform for plant diversity in Myanmar.
We express our deepest gratitude to Dr. Yang Xuefei from the Southeast Asian Biodiversity Research Institute of Chinese Academy of Sciences for his constructive suggestions in the data collection process.
Mon MS, Mizoue N, Htun NZ et al. Factors affecting deforestation and forest degradation in selectively logged production forest: A case study in Myanmar. Forest Ecology and Management 267 (2012): 190 – 198.
Turnell S. Myanmar's fifty-year authoritarian trap. Journal of International Affairs 65 (2011): 79 – 92.
Flora of China Editorial Committee. Flora of China. Beijing: Science Press and Missouri Botanical Garden Press, 2013.
He Y, Zhuang H & Wang Y. China Plants Database. Available: <http://db.kib.ac.cn> [Accessed September 30, 2017].
Zhuang H, Wang Y & Wang Y. iFlora. Available: <http://www.iflora.cn> [Accessed September 30, 2017].
Biodiversity Heritage Library (BHL). Available: <http://www.biodiversitylibrary.org> [Accessed September 30, 2017].
Smithsonian Institution.The List of Myanmar’s Plant. Available: <http://botany.si.edu/myanmar/checklistnames.cfm> [Accessed September 30, 2017].
1. He Y, Zhuang H & Wang Y. A data set of plant diversity in Myanmar. Science Data Bank. DOI: 10.11922/sciencedb.499