Digitization: A Boost to Circulation

In the last two (1,2) posts, I’ve discussed how herbarium specimens have circulated since they were first created, and also how sometimes specimens get stuck in a limbo of uncurated collections.  Now I want to discuss how circulation has changed thanks to the massive digitization projects of the 21st century.  This is a familiar story to those in the herbarium world, but I’ll quickly review it for those who aren’t lucky enough hang around herbaria.  The upshot of digitization is that now everyone can hang around them, at least virtually. 

Digitization is very much tied to the development of computer technologies, but also to globalization that has brought an awareness that the planet we live on is a shared asset and a shared responsibility.  Over the years there have been a number of international conferences and agreements that articulated this vision and made it actionable.  The 1993 international  Convention on Biological Diversity gave each nation sovereignty over its biological wealth, which implies knowledge of that wealth.  This led to the 2002 Global Strategy for Plant Conservation with later updates and goals including a global flora of all known plants with online access, the best way to make the information widely available.  While this goal has yet to be met in full, there have been significant advances toward it.  

In the early 21st century, the Andrew W. Mellon Foundation spearheaded the digitization of type specimens so that researchers around the world could access the plants that were used in describing species.  Because of the way many specimens circulated—collected in the species-rich tropics and transported to botanist-rich Europe and North America—researchers in developing nations did not have ready access to these materials.  Botanical literature was also relatively unavailable so the project digitized many publications of the past as well.  This project morphed into the portal JSTOR Global Plants and also set the stage for other large-scale digitization projects such as ADBC (Advancing Digitization of Biodiversity Collections) in the United States with its massive iDigBio portal and what ultimately became DiSSCo (Distributed System of Biodiversity Collections) in Europe.  Meanwhile GBIF (Global Biodiversity Information Facility) has aggregated data from these projects and others worldwide to create the largest portal for natural history collections along with observational data. 

While an amazing achievement, digitization has not totally solved access problems for those in developing nations.  They often do not have the hardware, software, and internet connections to make good use of these resources.  Still, digitization has broadened availability in other ways.  It was difficult for those not involved in botanical research to visit herbaria, if for no other reason than specimens’ fragility; each use opens the possibility of damage.  This is not a problem with a digital collection, so students can study specimens on the web as can curious gardeners and artists looking for new forms of inspiration, leading to greater plant awareness, a positive counter to what has been called plant blindness. 

Digital collections have already had a major impact on the ways specimens are used in research (Heberling et al., 2021).  For phenological work, botanists can now search GBIF for a particular species and by checking a specimen’s flowering or fruiting status against the date it was collected, they can see if there is a pattern of change in the dates over a period of 100 years or more.  They may check hundreds or even thousands of specimens, something that wouldn’t be possible for physical examination.  Niche or species distribution modeling, determining areas that might provide suitable habitat for a species based on what is known about its range, is another area where digital specimens are pivotal:  geographic coordinate data on where plants were collected are used to create a model of the environmental conditions that meet a species’ habitat requirements.  This research is helpful in identifying possible collection areas and also where a species might be able to grow as the climate changes. 

There’s also an increase in the use of artificial intelligence (AI) tools to recognize traits like leaf shape and even to identify species.  This work requires a great deal of computer power and sophisticated neural networking techniques, so it’s costly in both technology and human input.  However the field is advancing rapidly in exciting ways.  Botanists foresee being able to rapidly analyzing large numbers of specimens and at least sorting them into families or genera if not species.  However, at the moment even the identification of leaf shapes is still in its infancy.  When deep learning AI techniques are tested in identifying specimens, this is done on carefully selected specimen sets.  It requires a great deal of computer capacity, but the increasing frequency with which AI projects presented at conferences on digital specimens suggests that these tools will soon become widely used in biodiversity research.             

I should add that there are obviously many research areas where digital specimens cannot possibly replace the real thing.  There is no DNA in a data file.  Specimens have proved to be goldmines for those working on plant genetics.  As sequencing techniques become more sophisticated, even the rather short degraded DNA fragments found in specimens, hundreds if not thousands of years old, can provide substantial information on a plant’s relationship to other species.  But this isn’t the only reason why physical specimens need to be retained.  They can give clues on chemical changes in plants under siege from herbivores (Zangerl & Berenbaum, 2005), and more than one entomologist has found new insect species hidden away on plant specimens (Whitehead, 2016).  Each specimen is unique:  a particular plant collected at a particular place and time, and therefore irreplaceable.

References

Heberling, J. M., Miller, J. T., Noesgaard, D., Weingart, S. B., & Schigel, D. (2021). Data integration enables global biodiversity synthesis. Proceedings of the National Academy of Sciences, 118(6). https://doi.org/10.1073/pnas.2018093118

Whitehead, D. R. (1976). Collecting Beetles in Exotic Places: The Herbarium. The Coleopterists Bulletin, 30(3), 249–250.

Zangerl, A., & Berenbaum, M. (2005). Increase in toxicity of an invasive weed after reassociation with its coevolved herbivore. Proceedings of the National Academy of Sciences, 102(43), 15529–15532. https://doi.org/10.1073/pnas.0507805102

On the Road, Learning about Herbaria: Digitization

iDigBio Portal

I recently went north, to Yale University, for the third annual Digital Data Biodiversity Research Conference, sponsored by iDigBio, the NSF-sponsored project to digitize natural history specimens.  I attended the first of these conferences two years ago at the University of Michigan (see earlier post).  Both were fascinating and informative, but also different from each other, in that the focus of attention in this field has moved beyond digitizing collections to using digitized collections.  This seems a healthy trend, but as Katherine LeVan of National Ecological Observatory Network (NEON) mentioned, only 6% of insect collections have been even partially digitized, and Anna Monfils of Central Michigan University noted that iDigBio has information from 624 of 1600 natural history collections in the United States.  Admittedly, it’s mostly small collections that aren’t represented, but Monfils went on to show that smaller collections hold larger than expected numbers of local specimens, providing finer grained information on biodiversity.

Despite the caveat about coverage, the results of the NSF funding is impressive and is leading to an explosion in the use of this data.  It is difficult to keep up with the number of publications employing herbarium specimens as sources of information for studies on phenological changes, tracking invasive species, and monitoring herbivore damage.  While the earlier conference included sessions on using data for niche modeling, the meeting at Yale also had presentations on how to integrate such data with other kinds of information.  Integration was definitely a major theme, and two large-scale projects are front and center in this work.  Nico Franz of Arizona State University is principle investigator in NEON, a massive NSF-funded project that includes 22 observatories collecting ecological data, including specimens, and then using that data in studies on environmental change.  Franz noted that while other projects might collect data over short periods of time, NEON plans for the long-term and for building strong communities sharing and using that data.

Another large sale project, one headed by Yale professor Walter Jetz, is called Map of Life (MOL).  Here again, integration is central to this endeavor that invites researchers to upload their biodiversity data and also to take advantage of the wealth of data and tools available through its portal.  As the name implies, biogeography is an important focus, and users can search for distribution maps for species and create species lists for particular areas .  As with many digital projects, this one still has a long way to go in terms of living up to its name, which implies a much broader species representation than is now available.  In a session led by MOL developers, it became clear that the issue of how different kinds of data can be integrated is still extremely fraught.  Even databases for different groups of organisms, vertebrates versus invertebrates for example, are difficult to integrate because important data fields are not consistent:  what is essential in one field, might not be noteworthy at all in another or might be handled in a different way.  Progress is being made, but as Roderick Page of the University of Glasgow notes, even linking to scientific literature is hardly a trivial task, to say nothing of more sophisticated linking.

While this may seem discouraging, there were also many bright points in the presentations.  The massive Global Biodiversity Information Facility (GBIF) has, as I write, 1,330,535,865 occurrence records, that is, data on specimens and observations.  Last year, GBIF launched an impressive new website and often adds new features.  While the tools available through GBIF are not as sophisticated as with some other portals, it is still an incredible resource since iDigBio data is fed into GBIF as well as data from projects around the world.  For example, data from the University of South Carolina, Columbia A.C. Moore Herbarium where I volunteer, which was fed into SERNEC and iDigBio, is now also available in GBIF, so researchers worldwide can access data on this collection that is particularly rich in South Carolina plants.  This was not an easy undertaking—nothing in the digital world is—and it’s important to always keep that in mind as developers have flights of fancy about could be possible in the future.

Another conference highlight for me involved the use of sophisticated neural network software, such as that coming out of the Center for Brain Science at Harvard University.  James Hanken, Professor of Zoology and Director of the Museum of Comparative Zoology at Harvard, reported on a project to scan slides of embryological sections and then use the neural network software to create 3-D reconstructions of the embryos.  Caroline Strömberg of the University of Washington discussed a project to build a 3-D index of shapes for phytoliths, microfossils from grass leaves that can be more accurate for identifying species than pollen grains.  Her lab has studied 200 species and has quantified 3-D shapes, even printing them in 3-D to literally get a feel for them.  They used this information in a study of phytoliths from a dinosaur digestive track suggesting that grasses are older than previously thought.  Others have questioned these results, so Strömberg’s group is now refining the identification process, measuring more points on the phytolith surface.  Reporting on another paleontological study, Rose Aubery of the University of Illinois described image analysis done with Surangi W. Punyasena on plant fossil cuticle specimens to obtain taxonomic information about ancient ecosystems.  What all these presentations had in common was the use of massive computational power to analyze 3-D images.  At the first conference, reports of 3-D imaging were impressive, but now it is the analysis that has taken center stage.  This is a good sign:  all that data is proving valuable.

Digitizing Collections

2 iDigBio

The Digital Data in Biodiversity Research Conference at Ann Arbor, Michigan was cosponsored by the University of Michigan and the iDigBio project, which deals with the digitization of natural history collections at non-government institutions in the United States. iDigBio is a 10-year project now in its sixth year. As Larry Page its director noted, it is designed to provide the infrastructure necessary to store and distribute the results of natural history specimen digitization efforts and also offer training and tools to support these projects. In addition, it aims to encourage development of a community to further this work and to ensure that these electronic resources are maintained and upgraded in the future. That is obviously a tall order, and just how tall became clearer during the two-day conference.

The first general sessions set the stage with Maureen Kearney of the Smithsonian arguing for the importance of “liberating” data from the paper silos where they have been kept and also for including paleobiological information to provide a longer view. Pam Soltis of the Florida Museum of Natural History at the University of Florida discussed the difficulties of linking heterogeneous data, for example, information on specimens, genomics, and phylogeny. Yes, there are data sets dealing with each for many species, but the challenge is to make it all available through one portal. Issues include locating disparate data and dealing with its patchiness and with format differences. There are also vagaries of taxonomic names and of finding ways to get these systems to talk to each other. Progress is being made, particularly in the automation of some phases, such as recording label data using optical recognition systems, but this work takes a great deal of time and money, and it’s never finished, as maintenance is a key issue.

Next came Donald Hobern, executive secretary of GBIF, the Global Biodiversity Information Facility to which the US contributes data in the form of information not only on specimens but on species occurrences. From the GBIF portal, researchers can create species checklists for particular areas and also access data on particular taxa. The GBIF network has over 700 million georeferenced occurrence records making it a massive resource. Organizationally, it is divided into geographic nodes, with each node responsible for inputting and maintaining its data. In the afternoon, I attended the session on the North American node, which includes contributions from Canada and the United States. There Hebern spoke again outlining the network’s three main goals. The first is to remove obstacles to collaboration in the sharing and use of biodiversity data, in other words, to provide tools that allow for uploading and maintaining data in a usable form. Second is to organize evidence of recorded occurrence of any species in time and space, that is, users should be able to access data on species occurrences worldwide or within a particular geographic area and timeframe. Finally, GBIF aims to support the development of a global virtual natural history collection. In one sense, this goal has already been met because there is so much data in GBIF from so many areas, but it is hardly complete in terms of extent or data richness. In order to function at such a large scale, GBIF can only provide limited information on each occurrence. However, the infrastructure that GBIF has created and is continuing to develop is a firm foundation for a richer and robust information system in the future. An indication of this is in Science Review 2017, its annual review of the scientific articles published over the past year using GBIF data. Along with this is a bibliography of these 438 peer-reviewed articles.

The next speaker presented still another acronym, or really two. Gerald “Stinger” Guala of the US Geological Service is director of both BISON (Biodiversity Information Serving Our Nation) and ITIS (Integrated Taxonomic Information System). BISON provides access to 375 million US occurrence records, including 275 million in GBIF. However, for US records, more data on some records are available than just what’s in GBIF. Essentially, BISON is a clearinghouse for US government information on natural history collections. It cleans the data, formats it, takes quality control measures, and allows for data discovery. One of its major services is providing checklists at the local, state and national levels; a user can draw a map around an area and get a species checklist for it. Datasets on particular areas or species are also downloadable. ITIS is more limited in scope; its aim is to provide stable nomenclature. It is linked to the Catalogue of Life, a worldwide database that publishes an annual checklist with over 1.7 species. The biggest difficulty for the latter, as discussed by its director Tom Orwell of the Smithsonian, is how to deal with synonyms. This is a tough problem for all taxonomy and for all biodiversity projects, as noted by Stepen Garnett and Les Christidis (2017) in a recent Nature article on how “taxonomic anarchy” impedes conservation efforts. To put it simply: it’s difficult to enforce regulations on an endangered species if its name changes.

These presentations were followed by two about Canadian projects; James Macklin spoke on CBIF, Canada’s GBIF node, and Anne Bruneau on Canadensys, which aims to provide richer information on species than that available in GBIF. Jon Coddington of the Global Genome Biodiversity Network (GGBN) then brought up a whole different set of issues, namely those involved in storing genetic information, both sequences and specimen data. And Martin Kalfatovic the program director of the Biodiversity Heritage Library (BHL) discussed its role in providing links to relevant literature. In all, this was a mind-bending session that helped me see the differences among the many portals I have come across as I try to educate myself botanically and technologically. In the next post, I’ll discuss some even more ambitious projects that move into the 3D realm.

Reference

Garnett, S. T., & Christidis, L. (2017). Taxonomy anarchy hampers conservation. Nature, 546, 25.