Outreach: Community Science

Homepage for Notes from Nature

In the last 15 years there has been nothing short of an explosion in community science projects in the herbarium world.  Terminology can be tricky here.  What is now called “community science” is also known by the somewhat more familiar “citizen science.”  In many cases, both terms also refer to plain old volunteering.  In the past, being a volunteer in an herbarium often meant mounting specimens, but with the dawn of the digital age, imaging sheets and inputting label data have become crucial activities.  While there may be hundreds or even thousands of specimens to mount, there are millions of specimens requiring digitization. 

When I applied to volunteer at New York Botanical Garden about ten years ago, they had their quota of professional and volunteer mounters, but they always needed more digitizers.  I began by learning to photograph specimens and eventually moved on to label transcription.  That was when the garden was just getting its toes wet with more formal community science projects, and the woman I worked with was writing a guide to digitization for new volunteers.  This led the garden’s participation in the Notes from Nature initiative, part of Zooniverse, a platform for projects pairing volunteers with professional researchers.  Within Zooniverse, Notes from Nature projects are designed to transcribe natural history records; to date the result has been over a million and a half completed records.

There are other such endeavors including the Australia-based DigiVol, DoeDat in Belgium, Herbonaut in France, and Herbaria at Home in Britain.  In the United States, a major spur to community science in natural history is the NSF-sponsored WeDigBio, which organizes events such as days where herbaria invite the public to help digitize specimens.  This has made a major impact on herbarium work because some volunteers become so interested that they continue to contribute long after the event.  Many colleges and universities participate, with some students becoming regular volunteers or even student workers. 

Young people are particularly important in community science programs because through these activities they may develop a life-long interest in the natural world.  The other large community science population includes those at the other end of the age spectrum who are looking for interesting endeavors after retirement.  Many in this category have always had an interest in nature, and these projects give them a new way to learn more about the living world—and maybe pick up some computer skills as well.  They also come with a skill that younger volunteers often lack:  the ability to read the cursive handwriting on many older specimens labels.  These are the very specimens that are of interest now since they are used in longitudinal studies of global warming and environmental change more generally. 

As to computer skills, the latest projects sometimes involve AI, and I want to focus here on an article I read recently about community science participation in a machine learning study (Guralnick et al., 2023).  Over the past ten years, millions of specimens have been imaged, but label transcription lags behind because of difficulties in using optical character recognition (OCR) in this process.  It is not just a matter of transcribing the words and numbers accurately, it involves inserting them in the correct fields, that is, being able to put the species name in one field and collector name in another.  It gets even trickier with associated taxa and locality information. 

AI is much touted, but making AI happen is hardly easy.  Computers have to be trained and that takes a lot of human input.  This report is on two humans-in-the-loop processes, with the humans working in a Notes from Nature framework especially designed for the project.  In this case, volunteers worked on a training set for a program that recognizes where the labels are positioned on a sheet as well as the difference between typed and cursive labels.  The trainers drew boxes around the labels, and then input whether each was typed or handwritten—or a combination of both.  The results from use of the initial training set led to refinements, so the second set gave a much higher success rate of 95% correct when tested on specimens.  The other program entailed developing an OCR pipeline that makes the label data easier for machine reading and puts it through an OCR tool, then the output is further refined and corrected.  To do this work, a test set of labels was needed that itself didn’t contain common errors like misspellings or extra spaces.  Two volunteers transcribed the label information and then crosschecked each other’s work. 

Based on the results with this training set, further refinements were made in the OCR system to improve its output.  For still further improvement, the output was fed into an OCR correction tool in Notes from Nature project.  Volunteers compared the label information with a box containing the OCR output which they corrected as needed.  The results of this work were then be used to further refine the system.  The many tasks involved in creating and using this system gives some idea of just how difficult it is to employ AI with very heterogenous inputs.  It also suggests how much time and human involvement is entailed, and why in a field like natural history collection, where financial resources are so minimal, community science is so important.

Reference

Guralnick, R., LaFrance, R., Denslow, M., Blickhan, S., Bouslog, M., Miller, S., Yost, J., Best, J., Paul, D. L., Ellwood, E., Gilbert, E., & Allen, J. (2024). Humans in the loop: Community science and machine learning synergies for overcoming herbarium digitization bottlenecks. Applications in Plant Sciences, 12(1), e11560. https://doi.org/10.1002/aps3.11560

Leave a comment