crowdsource Digital Humanities digitization

Using Crowdsourcing for Digitization

Mark Twain’s Tom Sawyer getting help to whitewash Aunt Polly’s picket fence.

In his book, “The Adventures of Tom Sawyer,” Mark Twain (Samuel Clemens) provides a useful metaphor for crowdsourcing and digitization. Tom turns a boring chore (whitewashing) into something desirable for others to do. What makes the story timeless is that over 144 years later people are still trying to get others to do the work for them. In the realm of digital humanities, transcribing hand written documents or identifying or vectoring shapes, is a tedious and time consuming task. A project with tens of thousands of documents used to take decades to scan, transcribe, and digitize. However, organizations are turning towards crowdsourcing as a way to reduce costs, and speed up the process while creating a community of interest in the project.

People have been participating in crowdsourcing efforts for years without even knowing it. For example, sites that use photo identification or text entry for non-robot identification, leverage the crowd to accomplish some simple digitization tasks. But it is the bigger trend of organizations purposefully creating interfaces to permit the public to contribute to the digitization process that is worthy of review.

Our crowdsourcing assignment was to review the pros and cons of leveraging public participation in digital collections. It would appear that a growing number of institutions are outsourcing some digitization functions to a public community instead of depending solely on employees. This makes sense considering the sheer drudgery and cost of trying to digitizing large collections. For example, “Papers of the War Department 1784-1800” is a perfect example of how to leverage crowdsourcing to perform correction and transcription contextualization. Currently hosted at the Center for History & New Media at George Mason University, the project’s goal is to restore and make accessible this historic collections of over 42,000 U.S. military records once thought lost in a tragic fire.

Public users accessing the “Papers” site are invited to search, review, and transcribe the remaining hand written documents that need to be digitized. The interface is quite simple and permits users to read a scanned image of the document while attempting to transcribe. In very little time I was able to get familiar with the interface and start transcribing. My only problem had to do with retraining my self to read cursive hand writing from over two hundred years ago.

New York Public Library’s Building Inspector site.

Another example of using crowdsourcing is the New York Public Library’s “Building Inspector” site. This project is a little more whimsical and fun. I think Tom Sawyer would not have had such a hard time getting his friends to participate. In an attempt to gain insight from old New York City inspection maps, the Building Inspector site invites users to assist in vectorization or shape discovery of old buildings outlines. In a rather simple, but addictive process, users have only to visualize an outline of a building and determine if it is correct or not, and whether it needs to be fixed. The reason that humans are better than a machine in reviewing the outlines is our ability to quickly determine if it looks right. A rather mindless activity that even elementary students could participate in. It is a good example of how humans can still do the work of a machine if only enough are willing to provide the time necessary to complete the task.

Of the two examples, Building Inspector is probably a better prototype for the future of crowdsourcing digitization projects. Reviewing building outlines is much more suitable for engaging a bigger crowd. Papers of the War Department requires a more scholarly effort. Transcribing is much more intensive work and I would even say a specialty. The site tries to mitigate that problem by offering various degrees of difficulty for the contributor.

It is clear from this assignment that “contributory” projects are here to stay and will only increase in numbers. As online technologies and interfaces improve more of the public will be able to “interact” or access digital and/or physical artifacts provided by institutions (e.g., providing notes on museums’ objects; tagging on galleries’ digital collection). If the interface and tasks or made to be engaging and interesting, the contributors will come back often and the value of the crowdsource will be self evident. However, if the tasks that the participants are asked to perform are too difficult then the number of contributors will be limited. As with Tom Sawyer, it is not enough to just get someone else to do the work. Eventually, they will want to see what is in it for them.

Digital Humanities

Everything but the kitchen sink.

Does a picture really speak a thousand words? Some of the author’s kitchen condiments.

The assignment was to take still images and videos of some of the objects in my kitchen. It was part of a lesson module to teach my graduate class about digitization. We take it for granted today, but every time you use your smart phone to take a picture or video you are “digitizing” something. This process captures a significant amount of information and for the most part can easily be shared. However, it becomes self-evident that every image or video is subjective in the eye of the beholder as to the information it conveys.

French copper pots

For example, the image above of the French copper pots does not do justice to the wonderful meals that they have been used to prepare over the years. It also does not convey the tedious work required to shine them up. A cooking video would be more appropriate. So how you capture or digitize an item is really part of the story you want to tell.

According to Melissa Terras, “Digitisation and Digital Resources in the Humanities” (2008) anything that is visible is capable of being digitized. She believes that this process “creates the core content for a digital resource.” Marlene Manoff, in her 2006 abstract “The Materiality of Digital Collections: Theoretical and Historical Perspectives,” points out that every object has “material characteristics.” These characteristics, like size, color, shape, texture, usually can be captured in a digital image. But some characteristics like smell, or sound can not be capture with an image. While sound can be captured with a video, smell-o-vision is still not a reality. As result, the accompanying textual content included with an image plays an important role in establishing the object’s informative value. By supplementing the obvious visual data, more detailed characteristics of an object can be preserved.

Manoff believes that the end result of all this digitization is a “content management” process. A process that has become more personalized as everyone must now learn how to manage their digital collections. In the past, managing the capture of multimedia was dependent upon the manual entry of associated descriptive text or meta data. However, the growing utility of cognitive services like facial recognition, and transcribing, are rapidly automating this process.

Paul Conway, in his 2009 paper “Building Meaning in Digitized Photographs,” points out that more and more cultural heritage organizations are recognizing the need to leverage Image Digital Archives (IDA). He adds that the digitization process is adding value to these organization’s archives by creating “digital” surrogates. But Conway also acknowledges that a digital file of an archived photographic print is still just a representation of the original artifact. Well known institutions like the Library of Congress have become the standard in providing researchers access to a variety of file formats and image resolutions. This provides researchers the ability to determine the level of detail they require.

French recipe for chocolate cake

So as a lesson learned the “kitchen digitization” project was useful in determining the broad scope of things that can be digitized, and in what format. One of the assigned items to digitize was a recipe. Shown above is my wife’s family recipe for “Gateau au Chocolat” or chocolate cake. I’m not sure it is a secret, but tasting this cake for the first time probably had something to do with our getting married. But in regard to digitization, just a digital image of the recipe may not have been enough. For those that do not understand French, but are eager to taste this wonderful cake, a translation would be helpful.

Finally, in the field of digital humanities, the creation of these so called “digitized representations” is part of a bigger debate. Are these new surrogates worthy of their own study and should everyone be encouraged to use them as unique objects in their own right?. For me, the issue comes down to accessibility. Having online access to countless numbers of digitized photographs is a tremendous resource. Especially in this time of Covid 19. There is no doubt that in an age of social distancing the need to capture and make public cultural heritage archives is becoming very compelling. Especially if these institutions want to maintain their relevance.