Specimen Labels: Digitization

custernybg

The two specimens of Xanthisma spinulosum on the right are from the 1874 Custer Expedition to the Black Hills  [NYBG Herbarium]

I’ve become somewhat familiar with herbarium sheets through my visits to herbaria in several countries. However, where I’ve seen the most sheets is at New York Botanical Garden (NYBG) where I volunteer for their digitization efforts. I’ve photographed specimens, transcribed labels, and created skeleton records for newly mounted specimens before they are photographed. Each process has taught me something different. I’ve photographed thousands of Asteraceae specimens, most from the western United States collected over the past 150 years. As I put each sheet into the lightbox, I glanced at the label, and got to know the names of collectors such as John Merle Coulter, Bassett Maguire, and Per Axel Rydberg. Some of these names were associated with NYBG, others were tied to Western collections, but trades brought these plants East, as did NYBG’s acquisition of “orphan collections” when other institutions decided to rid themselves of space-taking herbaria. That’s how three Wabash College specimens of Xanthisma spinulosum (on the same sheet) ended up at NYBG, two from a General Custer expedition to the Black Hills and another from the California collectors Sara and John Lemmon.

When I input label information from 3700 Arnica labels at the garden, I got to do more than glance at labels. They provided me an introduction to people who I then looked up on the web: Connie and John Taylor of Oklahoma State University whose labels included the names of their three children as collectors; Aven Nelson one of the important early botanists of Wyoming, Marcus Jones a controversial plant collector/mining engineer, and Rupert Barneby a long-time botanist at NYBG. Each had their own style of labeling, in part determined by the age in which they lived and what was expected in terms of essential information. Transcription is labor intensive, and in order to get as many labels done and into the database so they can be accessed by researchers, not all information is recorded. For example, the location is, but data on the habitat, such as nearby plants, are not. The rationale is that searches are usually by locale, so the habitat information is less likely to be investigated and can be added later.

There is a way to transcribe labels without human input, namely with optical character recognition (OCR) software, but this doesn’t work well, to say the least, with cursive handwriting. Even for typed labels, OCR cannot be relied upon to place all the data into correct fields. Inputting label information doesn’t just mean typing it out; it must be typed into the proper areas of the software program, the correct “fields,” so it can be accessed: collector name in one field, location in another, date of collection in a third, with each in proper format. OCR software, while it can “learn” to identify certain types of information, is hardly infallible, and data entries created this way need to be checked by a human. I’ve found that the OCR input needs so much editing that it is easier to type the information directly from the label. As more data becomes available electronically, and therefore is relied upon more by researchers, accuracy becomes more and more crucial. It is nice to be able to find out what specimens were collected in a particular area at a particular time without sifting through thousands of sheets, but only if the data being searched is reliable.

Now at NYBG I am creating records for newly mounted specimens and seeing a much wider variety of label styles from all over the world, with labels written in Spanish, Portuguese, and French, yet largely decipherable because I know what is expected to be on a label and because the Latin name is always an anchor. Going forward, the aim is that all specimens in the NYBG collection will be photographed and the label information digitized—the ultimate goal for all collections in the US and in other countries as well. This will make the information in herbaria and other natural history collections more widely available and accessible for uses that were not even anticipated when many of these specimens were collected.

Advertisements

Arnica

As a volunteer at the New York Botanical Garden, I have been spending three hours a week inputting label data from herbarium specimens that have already been imaged.  I call up the record and image from a database, and then input the data into the relevant fields, including name of collector, date of collection, and location.  This seems relatively straightforward, and often it is, especially now that I’ve been doing it for some time.  This is definitely volunteer-level work; little expertise is required.  If I get to a label where the handwriting is difficult to decipher or the geolocation data is given in an odd format, I can just turn to Mari Roberts, the data entry specialist I work with.  The cursive of a hundred years ago is different from that of today.  Mari is often able to decipher words I have trouble with because she has seen such curls and whirls many times before. However, there have been a few times when even Mari is stumped, and then I put the specimen in the “Pending Review” file and go on to the next label.  Some guru deals with the problem later.

I have gotten to the point that I rarely need to ask for help, and can even do my hours at home because all this information is available on a server housed.  What I am working on is part of a largee digitization project funded by the NSF to document natural history collections.  I am inputting information for the Tri-Trophic Thematic Collection Network which focuses on the interactions among plants, specifically the Asteraceae or sunflower family of North America, the insects that feed on them, and the wasps that parasitize these insects, making for three-way ecological interactions.  Having information about all three available online means that researchers will be able to draw on this data for any number of studies, including how the occurrence of a particular plant or insect species might change over time.

When I began working with Mari last April, she assigned me the genus Arnica, for which there were over 3000 unprocessed specimens.  The first few times that I put in my paltry three hours, I was lucky if I completed 50 records in that time.  I eventually got up to 75 or so, which Mari says is the standard that they all try to meet, about 25 per hour.  This is definitely an average.  Sometimes I hit labels that are tough, for example, having a great deal of data to input–several collectors, a long locality description, or complex georeferencing data.  I guess I am getting the hang of it too because I ask for help less frequently.  And then there is Greenland.  Arnica grows in Greenland and I puzzled through a number of these labels, trying to identify place names using Google and various gazetteers.  Then Greenland disappeared without my even realizing it, until sometime later Mari told me that she had gone in and taken those Arnica records out so I wouldn’t have to wrestle with them.  That shows how thoughtful and helpful she is, but I was a little disappointed–this was my one opportunity to learn about Greenland geography.

As to Arnica itself, I can’t say I’ve learned a great deal about it, but here are some things I’ve discovered just from the sheets.  It is a plant of high places, altitude measurements are often included, and 5,000-10,000 feet readings are the norm.  It is also a plant of the Western United States and Canada:  Nevada, Utah, California, Wyoming, Alberta, Saskatchuan, British Columbia, Washington, Alaska, and Oregon are the areas usually on the labels.  However it also grows in the White Mountains of New Hampshire and on the Gaspé Peninsula in Quebec.  There are hundreds of specimens for most of the species, a real treasure trove of information gathered over a period of at least 140 years.  Some collectors’ names recur many times, and the system is designed to store these so they don’t have to retyped each time.  There is a similar system for finding duplicates that have already been inputted, again saving time.  When I come upon some duplicates my input rate can rise to 35 or more per hour.  But then I get a real puzzler that reminds me why humans have to be involved in this process.  I am happy to say that I’ve completed all 3280 Arnica labels and can’t wait to see what Mari has in store for me next.