Digitization: A Boost to Circulation

In the last two (1,2) posts, I’ve discussed how herbarium specimens have circulated since they were first created, and also how sometimes specimens get stuck in a limbo of uncurated collections.  Now I want to discuss how circulation has changed thanks to the massive digitization projects of the 21st century.  This is a familiar story to those in the herbarium world, but I’ll quickly review it for those who aren’t lucky enough hang around herbaria.  The upshot of digitization is that now everyone can hang around them, at least virtually. 

Digitization is very much tied to the development of computer technologies, but also to globalization that has brought an awareness that the planet we live on is a shared asset and a shared responsibility.  Over the years there have been a number of international conferences and agreements that articulated this vision and made it actionable.  The 1993 international  Convention on Biological Diversity gave each nation sovereignty over its biological wealth, which implies knowledge of that wealth.  This led to the 2002 Global Strategy for Plant Conservation with later updates and goals including a global flora of all known plants with online access, the best way to make the information widely available.  While this goal has yet to be met in full, there have been significant advances toward it.  

In the early 21st century, the Andrew W. Mellon Foundation spearheaded the digitization of type specimens so that researchers around the world could access the plants that were used in describing species.  Because of the way many specimens circulated—collected in the species-rich tropics and transported to botanist-rich Europe and North America—researchers in developing nations did not have ready access to these materials.  Botanical literature was also relatively unavailable so the project digitized many publications of the past as well.  This project morphed into the portal JSTOR Global Plants and also set the stage for other large-scale digitization projects such as ADBC (Advancing Digitization of Biodiversity Collections) in the United States with its massive iDigBio portal and what ultimately became DiSSCo (Distributed System of Biodiversity Collections) in Europe.  Meanwhile GBIF (Global Biodiversity Information Facility) has aggregated data from these projects and others worldwide to create the largest portal for natural history collections along with observational data. 

While an amazing achievement, digitization has not totally solved access problems for those in developing nations.  They often do not have the hardware, software, and internet connections to make good use of these resources.  Still, digitization has broadened availability in other ways.  It was difficult for those not involved in botanical research to visit herbaria, if for no other reason than specimens’ fragility; each use opens the possibility of damage.  This is not a problem with a digital collection, so students can study specimens on the web as can curious gardeners and artists looking for new forms of inspiration, leading to greater plant awareness, a positive counter to what has been called plant blindness. 

Digital collections have already had a major impact on the ways specimens are used in research (Heberling et al., 2021).  For phenological work, botanists can now search GBIF for a particular species and by checking a specimen’s flowering or fruiting status against the date it was collected, they can see if there is a pattern of change in the dates over a period of 100 years or more.  They may check hundreds or even thousands of specimens, something that wouldn’t be possible for physical examination.  Niche or species distribution modeling, determining areas that might provide suitable habitat for a species based on what is known about its range, is another area where digital specimens are pivotal:  geographic coordinate data on where plants were collected are used to create a model of the environmental conditions that meet a species’ habitat requirements.  This research is helpful in identifying possible collection areas and also where a species might be able to grow as the climate changes. 

There’s also an increase in the use of artificial intelligence (AI) tools to recognize traits like leaf shape and even to identify species.  This work requires a great deal of computer power and sophisticated neural networking techniques, so it’s costly in both technology and human input.  However the field is advancing rapidly in exciting ways.  Botanists foresee being able to rapidly analyzing large numbers of specimens and at least sorting them into families or genera if not species.  However, at the moment even the identification of leaf shapes is still in its infancy.  When deep learning AI techniques are tested in identifying specimens, this is done on carefully selected specimen sets.  It requires a great deal of computer capacity, but the increasing frequency with which AI projects presented at conferences on digital specimens suggests that these tools will soon become widely used in biodiversity research.             

I should add that there are obviously many research areas where digital specimens cannot possibly replace the real thing.  There is no DNA in a data file.  Specimens have proved to be goldmines for those working on plant genetics.  As sequencing techniques become more sophisticated, even the rather short degraded DNA fragments found in specimens, hundreds if not thousands of years old, can provide substantial information on a plant’s relationship to other species.  But this isn’t the only reason why physical specimens need to be retained.  They can give clues on chemical changes in plants under siege from herbivores (Zangerl & Berenbaum, 2005), and more than one entomologist has found new insect species hidden away on plant specimens (Whitehead, 2016).  Each specimen is unique:  a particular plant collected at a particular place and time, and therefore irreplaceable.

References

Heberling, J. M., Miller, J. T., Noesgaard, D., Weingart, S. B., & Schigel, D. (2021). Data integration enables global biodiversity synthesis. Proceedings of the National Academy of Sciences, 118(6). https://doi.org/10.1073/pnas.2018093118

Whitehead, D. R. (1976). Collecting Beetles in Exotic Places: The Herbarium. The Coleopterists Bulletin, 30(3), 249–250.

Zangerl, A., & Berenbaum, M. (2005). Increase in toxicity of an invasive weed after reassociation with its coevolved herbivore. Proceedings of the National Academy of Sciences, 102(43), 15529–15532. https://doi.org/10.1073/pnas.0507805102

On the Road, Learning about Herbaria: Education and Citizen Science

BLUE Port: Biodiversity Literacy in Undergraduate Education

In the last post, I described sessions I attended at the Digital Data Biodiversity Research Conference at Yale University.  Besides presentations on portals that integrate various kinds of data and on projects to create and analyze 3-D images of specimens, there was an emphasis on education.  Now that so much specimen data and other biodiversity information is available digitally, one of the major goals of iDigBio, the National Resource for Advancing Digitization of Biodiversity Collections (ADBC) funded by the National Science Foundation, is to have this data used widely.  This requires education, both of the present research community and of its future members.  For several years, iDigBio has been holding workshops and conferences, like the one at Yale.  These have resulted in a major upswing in the number of studies and publications employing biodiversity data.  Now that many professionals are trained in how to access and analyze the available information, it’s time to leverage this knowledge.  The task is to help these experts teach the next generation.

As every teacher realizes, knowing something is very different from teaching about it.  The subject matter has to be analyzed and organized; ways into the basics have to be found; a learning structure has to be created.  For many years, I was involved with the BioQUEST Curriculum Consortium and attended a number of workshops dealing with using genomic data in teaching genetics and bioinformatics.  The portals for gene sequence data are extremely powerful, but they were built for researchers who committed a great deal of time to learning to use them effectively.  Teachers, and even more so students, do not have the time, the technical support, nor the expertise to make effective use of these portals.  That’s where BioQUEST and other initiatives came into play.  At the workshops I attended, we learned enough about the available resources to “tame” them, to download data and present it to students in a way they could understand and use.  We became part of an education community committed to bringing students into the genetic sequencing research space in a way that would make sense for them.

Now the same kinds of initiatives are being developed for biodiversity research using powerful tools like iDigBio, GBIF, NEON, and MOL discussed at the conference (see last post).  Anna Monfils of Central Michigan University is the principle investigator for an NSF-funded project called BLUE: Biodiversity Literacy in Undergraduate Education that includes participation from BioQUEST.  Monfils and members of her team led a lively session at the conference on the question of what biodiversity literacy means and how to achieve it.  As the conversation developed, it became clear that these are not easy issues to resolve.  However, the BLUE project is a great first step in defining what a biology student needs to have in terms of conceptual understanding and technical skill to tackle the vast ocean of biodiversity data now available to them.  What didn’t arise as strongly is an issue that is dear to my heart:  how do you make biodiversity data understandable and accessible to students who are not majoring in biology or environmental science?  One of iDigBio’s aims has been to broaden the community of biodiversity data users, and non-scientists make up a huge audience.  Taming data for them is very different than for those interested in science, but everyone encounters organisms in their lives every day, so why not make it easier to learn more about them?

One way into such learning is through an area that has burgeoned in the last few years and that had a larger presence at the conference than in the past:  citizen science.  The field has many different aspects from political advocacy to volunteer data entry.  Examples of the latter include the development of portals such as Notes from Nature, where many institutions with natural history collections post well-defined projects such as digitizing specimen data.  The Smithsonian has an online transcription center where notebooks, journals, and letters are posted.  All these sites have sophisticated digital architectures that allow data managers to have confidence in the input, such as by having the same data entered by more than one user and then compared.  Many of those involved have commented on how fast the projects are completed.  Sometimes thousands of individuals participate, with a number being very committed and doing a great deal of data input.  In cases like this, citizen science is another name for unpaid help or volunteering.  With an increasing number of retirees looking for something interesting to do, these projects are very attractive because there is no commute involved and fascinating things to learn.

Still another type of citizen science work is done by those who use portals such as iNaturalist to record field observations and phenological information.  These data ultimately are uploaded into GBIF, a global biodiversity portal, and the citizen science input has grown to the point where it is having a significant impact on biodiversity research.  Walter Jetz of Yale University and principle investigator for the Map of Life (MOL) project, commented on the importance of citizen science several times in his presentation.  Not surprisingly, this is particularly true in ornithological research where amateurs have always been especially welcomed by the scientific community.

On the Road, Learning about Herbaria: Digitization

iDigBio Portal

I recently went north, to Yale University, for the third annual Digital Data Biodiversity Research Conference, sponsored by iDigBio, the NSF-sponsored project to digitize natural history specimens.  I attended the first of these conferences two years ago at the University of Michigan (see earlier post).  Both were fascinating and informative, but also different from each other, in that the focus of attention in this field has moved beyond digitizing collections to using digitized collections.  This seems a healthy trend, but as Katherine LeVan of National Ecological Observatory Network (NEON) mentioned, only 6% of insect collections have been even partially digitized, and Anna Monfils of Central Michigan University noted that iDigBio has information from 624 of 1600 natural history collections in the United States.  Admittedly, it’s mostly small collections that aren’t represented, but Monfils went on to show that smaller collections hold larger than expected numbers of local specimens, providing finer grained information on biodiversity.

Despite the caveat about coverage, the results of the NSF funding is impressive and is leading to an explosion in the use of this data.  It is difficult to keep up with the number of publications employing herbarium specimens as sources of information for studies on phenological changes, tracking invasive species, and monitoring herbivore damage.  While the earlier conference included sessions on using data for niche modeling, the meeting at Yale also had presentations on how to integrate such data with other kinds of information.  Integration was definitely a major theme, and two large-scale projects are front and center in this work.  Nico Franz of Arizona State University is principle investigator in NEON, a massive NSF-funded project that includes 22 observatories collecting ecological data, including specimens, and then using that data in studies on environmental change.  Franz noted that while other projects might collect data over short periods of time, NEON plans for the long-term and for building strong communities sharing and using that data.

Another large sale project, one headed by Yale professor Walter Jetz, is called Map of Life (MOL).  Here again, integration is central to this endeavor that invites researchers to upload their biodiversity data and also to take advantage of the wealth of data and tools available through its portal.  As the name implies, biogeography is an important focus, and users can search for distribution maps for species and create species lists for particular areas .  As with many digital projects, this one still has a long way to go in terms of living up to its name, which implies a much broader species representation than is now available.  In a session led by MOL developers, it became clear that the issue of how different kinds of data can be integrated is still extremely fraught.  Even databases for different groups of organisms, vertebrates versus invertebrates for example, are difficult to integrate because important data fields are not consistent:  what is essential in one field, might not be noteworthy at all in another or might be handled in a different way.  Progress is being made, but as Roderick Page of the University of Glasgow notes, even linking to scientific literature is hardly a trivial task, to say nothing of more sophisticated linking.

While this may seem discouraging, there were also many bright points in the presentations.  The massive Global Biodiversity Information Facility (GBIF) has, as I write, 1,330,535,865 occurrence records, that is, data on specimens and observations.  Last year, GBIF launched an impressive new website and often adds new features.  While the tools available through GBIF are not as sophisticated as with some other portals, it is still an incredible resource since iDigBio data is fed into GBIF as well as data from projects around the world.  For example, data from the University of South Carolina, Columbia A.C. Moore Herbarium where I volunteer, which was fed into SERNEC and iDigBio, is now also available in GBIF, so researchers worldwide can access data on this collection that is particularly rich in South Carolina plants.  This was not an easy undertaking—nothing in the digital world is—and it’s important to always keep that in mind as developers have flights of fancy about could be possible in the future.

Another conference highlight for me involved the use of sophisticated neural network software, such as that coming out of the Center for Brain Science at Harvard University.  James Hanken, Professor of Zoology and Director of the Museum of Comparative Zoology at Harvard, reported on a project to scan slides of embryological sections and then use the neural network software to create 3-D reconstructions of the embryos.  Caroline Strömberg of the University of Washington discussed a project to build a 3-D index of shapes for phytoliths, microfossils from grass leaves that can be more accurate for identifying species than pollen grains.  Her lab has studied 200 species and has quantified 3-D shapes, even printing them in 3-D to literally get a feel for them.  They used this information in a study of phytoliths from a dinosaur digestive track suggesting that grasses are older than previously thought.  Others have questioned these results, so Strömberg’s group is now refining the identification process, measuring more points on the phytolith surface.  Reporting on another paleontological study, Rose Aubery of the University of Illinois described image analysis done with Surangi W. Punyasena on plant fossil cuticle specimens to obtain taxonomic information about ancient ecosystems.  What all these presentations had in common was the use of massive computational power to analyze 3-D images.  At the first conference, reports of 3-D imaging were impressive, but now it is the analysis that has taken center stage.  This is a good sign:  all that data is proving valuable.

Herbaria and More

4 Platanthera psycodes

Platanthera psycodes collected in 1838, University of Michigan Herbarium

Since the cosponsor of the Digital Data in Biodiversity Research Conference (see last post), along with iDigBio, was the University of Michigan, it’s not surprising that there was a trip to its Research Museums Center south of the main campus. Along with a reception, there were tours of the various zoological, paleontological, archaeological, and botanical collections housed there. Naturally, I went on the herbarium tour offered by three botanists who work with the collection: Christopher Dick the director, Richard Rabeler collection manager, and Anton Reznicek curator. As with many large plant collections, the UM staff can only estimate its size: about 1.8 million specimens. Digitization efforts are making for more accurate estimates, but also unearthing more specimens. The herbarium is actively involved in iDigBio and the national digitization effort with more than 560,000 of its sheets digitized. Since the Research Museums Center is a converted warehouse, the herbarium has room to grow, a valuable resource for the future. The herbarium is strong in Michigan plants, including collections from 1837-1838 for a survey of natural resources made at the time Michigan gained statehood (see figure above). Many of these have habitat information, making them valuable in environmental change studies. There is also a large collection amassed by Harley Harris Bartlett, who led the UM Department of Botany from 1927-1944. He used his considerable wealth to fund explorations in the tropics, and so there are a significant number of specimens from these areas, including many wood specimens from Sumatra that are now being digitized. They are particularly important because of the dramatic changes logging has wrought in Sumatra and also because the labels include the names of the trees recorded in the indigenous language.

On my way home from Michigan, I took a rather roundabout route so I could visit the Cornell University herbarium. I wanted to go there primarily because Cornell was the long-time home campus for the botanist/horticulturalist/agronomist, Liberty Hyde Bailey (1863-1959). Bailey had incredible energy and drive during his entire life and became the first dean of Cornell’s agricultural college (Dorf, 1956). He served on Theodore Roosevelt’s National Commission on Country Life which recommended the formation of the 4-H movement, agricultural extension programs, and rural electrification. Bailey retired from the deanship in 1913 when he reached 60 and spent the next 35 years writing on agricultural and horticultural topics as well as studying the taxonomy of palms on which he published extensively.

As the herbarium’s collections manager Anna Stalter explained, Bailey left his extensive herbarium and book collection to Cornell which explains why a third of the specimens are cultivated plants. This makes it interesting horticulturally and associated materials increase its value. Bailey collected seed and plant catalogues from the late nineteenth century through the first half of the twentieth. The herbarium librarian Peter Fraissinet pulled out a selection that were fascinating historical documents. He also showed me an extensive card catalogue maintained for almost 60 years by Bailey’s daughter and assistant Ethel Zoe Bailey. There was a card for each plant variety, noting the catalogue and years it was offered. Researchers interested in heirloom plants and plant lineages still consult it.   Fraissinet also showed me some rare volumes Bailey had collected, including the oldest book in the library, an Italian translation from 1575 of Nicolas Monardes’s work on Mexican plants that includes the first known image of tobacco (see figure below). I’ve read about this treasure, but it was a thrill to see it as well as a German translation of Pietro Matthioli’s commentaries from 1678.

4a Tobacco from Monardes

Tobacco plant pictured in Monardes’s 16th-century work on Mexican plants. Bailey Hortorium Library, Cornell University

Obviously one day was not enough time to even glance at most Cornell botanical treasures, but I did get to see a few of the massive palm pods Bailey collected and was also introduced to a totally different aspect of botany at the Cornell herbarium, its plant anatomy slide collection. Curator Kevin Nixon and senior research associate Maria Gandolfo are heading the NSF-funded project called CUPAC: Cornell University Plant Anatomy Collection. The goal is to digitize the information on 200,000 slides and to image a significant portion of them, at least one from each set of serial sections, often using more than one power of magnification. They also hope to include slides from other institutions’ collections as a way to preserve and make broadly accessible a valuable research tool for future botanists. There are already many images available on their website, and in the future they hope to link the records to the relevant literature. So this is yet another government-funded digital asset available to all researchers, and also I might add, to artists as well since many of the images are stunning and include not only slides but peels of fossil plant structures.

When I left the herbarium I walked through the Cornell Botanic Gardens where living collections complement the horticultural specimens in the herbarium. It’s wonderful to have the two resources so close to each other. And close by is the plant pathology herbarium, still another treasure, but one I had to leave for the future. As my father always said on road trips: “You have to leave something for next time.” On this trip, I had seen a lot, from living plant collections, to personal collections representing place (see post), to herbaria, and the digital future (see 1, 2). I can’t wait to get on the road again.

Reference

Dorf, P. (1956). Liberty Hyde Bailey: An Informal Biography. Ithaca, NY: Cornell University Press.

Digitizing Collections

2 iDigBio

The Digital Data in Biodiversity Research Conference at Ann Arbor, Michigan was cosponsored by the University of Michigan and the iDigBio project, which deals with the digitization of natural history collections at non-government institutions in the United States. iDigBio is a 10-year project now in its sixth year. As Larry Page its director noted, it is designed to provide the infrastructure necessary to store and distribute the results of natural history specimen digitization efforts and also offer training and tools to support these projects. In addition, it aims to encourage development of a community to further this work and to ensure that these electronic resources are maintained and upgraded in the future. That is obviously a tall order, and just how tall became clearer during the two-day conference.

The first general sessions set the stage with Maureen Kearney of the Smithsonian arguing for the importance of “liberating” data from the paper silos where they have been kept and also for including paleobiological information to provide a longer view. Pam Soltis of the Florida Museum of Natural History at the University of Florida discussed the difficulties of linking heterogeneous data, for example, information on specimens, genomics, and phylogeny. Yes, there are data sets dealing with each for many species, but the challenge is to make it all available through one portal. Issues include locating disparate data and dealing with its patchiness and with format differences. There are also vagaries of taxonomic names and of finding ways to get these systems to talk to each other. Progress is being made, particularly in the automation of some phases, such as recording label data using optical recognition systems, but this work takes a great deal of time and money, and it’s never finished, as maintenance is a key issue.

Next came Donald Hobern, executive secretary of GBIF, the Global Biodiversity Information Facility to which the US contributes data in the form of information not only on specimens but on species occurrences. From the GBIF portal, researchers can create species checklists for particular areas and also access data on particular taxa. The GBIF network has over 700 million georeferenced occurrence records making it a massive resource. Organizationally, it is divided into geographic nodes, with each node responsible for inputting and maintaining its data. In the afternoon, I attended the session on the North American node, which includes contributions from Canada and the United States. There Hebern spoke again outlining the network’s three main goals. The first is to remove obstacles to collaboration in the sharing and use of biodiversity data, in other words, to provide tools that allow for uploading and maintaining data in a usable form. Second is to organize evidence of recorded occurrence of any species in time and space, that is, users should be able to access data on species occurrences worldwide or within a particular geographic area and timeframe. Finally, GBIF aims to support the development of a global virtual natural history collection. In one sense, this goal has already been met because there is so much data in GBIF from so many areas, but it is hardly complete in terms of extent or data richness. In order to function at such a large scale, GBIF can only provide limited information on each occurrence. However, the infrastructure that GBIF has created and is continuing to develop is a firm foundation for a richer and robust information system in the future. An indication of this is in Science Review 2017, its annual review of the scientific articles published over the past year using GBIF data. Along with this is a bibliography of these 438 peer-reviewed articles.

The next speaker presented still another acronym, or really two. Gerald “Stinger” Guala of the US Geological Service is director of both BISON (Biodiversity Information Serving Our Nation) and ITIS (Integrated Taxonomic Information System). BISON provides access to 375 million US occurrence records, including 275 million in GBIF. However, for US records, more data on some records are available than just what’s in GBIF. Essentially, BISON is a clearinghouse for US government information on natural history collections. It cleans the data, formats it, takes quality control measures, and allows for data discovery. One of its major services is providing checklists at the local, state and national levels; a user can draw a map around an area and get a species checklist for it. Datasets on particular areas or species are also downloadable. ITIS is more limited in scope; its aim is to provide stable nomenclature. It is linked to the Catalogue of Life, a worldwide database that publishes an annual checklist with over 1.7 species. The biggest difficulty for the latter, as discussed by its director Tom Orwell of the Smithsonian, is how to deal with synonyms. This is a tough problem for all taxonomy and for all biodiversity projects, as noted by Stepen Garnett and Les Christidis (2017) in a recent Nature article on how “taxonomic anarchy” impedes conservation efforts. To put it simply: it’s difficult to enforce regulations on an endangered species if its name changes.

These presentations were followed by two about Canadian projects; James Macklin spoke on CBIF, Canada’s GBIF node, and Anne Bruneau on Canadensys, which aims to provide richer information on species than that available in GBIF. Jon Coddington of the Global Genome Biodiversity Network (GGBN) then brought up a whole different set of issues, namely those involved in storing genetic information, both sequences and specimen data. And Martin Kalfatovic the program director of the Biodiversity Heritage Library (BHL) discussed its role in providing links to relevant literature. In all, this was a mind-bending session that helped me see the differences among the many portals I have come across as I try to educate myself botanically and technologically. In the next post, I’ll discuss some even more ambitious projects that move into the 3D realm.

Reference

Garnett, S. T., & Christidis, L. (2017). Taxonomy anarchy hampers conservation. Nature, 546, 25.