This is a cautionary tale.
Demography and the Imperial Public Sphere utilises newspaper and magazine material from a variety of different sources. At the beginning of the project, nearly all of my material was derived from manual transcriptions from either photocopies, microfilm print outs, or originals consulted at local and national libraries. As time progressed, and my search parameters widened, I came to rely upon digital images of newspaper pages, such as those provided by the British Library's 19th Century Newspapers or the National Library of Australia's Trove Database. When using the latter, a part-OCR, part-manual transcription of the image can be easily obtained via a panel on the left hand side of the screen.
Although using this transcribing requires the manual deletion of line breaks (which are irrelevant to my study) and the correction of common OCR errors, processing these images is significantly faster than taking full, manual transcriptions from the British Library's images. Although the latter have also undergone OCR analysis, in order to create searchable text, this information is not readily accessible to end users; I must instead transcribe the digital images in the same way I would photocopies or originals. The same is true of Readex's Archive of Americana, which features US newspapers.
I had accepted this.
Last month, I began the long and arduous process of packing the deceptively large number of books I had somehow managed to accumulate during my teaching fellowship. As I lifted up my copy of the Chicago Manual of Style, a CD fell from the pages, having evidently been used as a make-shift bookmark. It was one of the numerous software-bundle CDs that I had inherited along with my desk. About to toss it back into the drawer from whence it had come, I noticed that one of programmes listed was ABBYY Finereader. Wait a moment, I thought, wasn't that OCR software?
Momentarily, dispensing with my packing, I popped the disc into my laptop. After a moment of delightful whirring, the programme was installed. I opened one of my many yet-to-be transcribed images and copied it into ABBYY. Another moment of delightful whirring and I had before me a surprisingly accurate transcription of the piece.
There were, of course, several mistakes, but considering the rather poor quality of this scanned microfilm printout, I was duly impressed. For an inexpensive version of ABBYY (currently retailing for £5 on Amazon and free with many scanner software-bundles) it has already saved me a considerable amount of time.
Why had I not turned to OCR software sooner? A variety of reasons. First, I had believed that reliable OCR software would be prohibitively expensive to a scholar without allocated research funding. Second, I had serious doubts about the time saving potential of any OCR process, considering my average typing speed and the unavoidable necessity of carefully checking each transcription in order to determine dissemination pathways. What I have found, however, is that correcting the OCR transcriptions is almost always quicker than undertaking the transcription manually, even with relatively poor images. More importantly, the time originally spent transcribing is now spent verifying its accuracy, making my final data far more accurate without lengthening the data collection process. The moral of the story? Well, it is not to run out and buy ABBYY Finereader, though the above link will earn me an impressive 5p in commission. Instead, it is to occasionally remind yourself that not all digital tools are prohibitively expensive--they might even be free, hiding away in your mystery CD drawer--and that even imperfect tools can improve your overall efficiency (and help stave off carpal tunnel syndrome).
*Image Courtesy of lyzadanger