Tuesday, April 07, 2009

Google Books Conversation

Picking up the Google Books conversation again, after Marie's excellent post, I have recently come across several other interesting criticisms of digitization projects. These are aspects I had not considered.

1. Paul Duguid, at First Monday vol 12, no.8, Aug. 6, 2007, writes an article, Inheritance and Loss: A brief survey of Google Books." Duguid notes the size of the Google Book project, and its stated goal of creating the library of the future, is more and more shutting other digitization projects out of funding and making them seem superfluous. Duguid compares two methods of quality control. Innovative techniques, which could not have existed before the Internet, such as the multi-editing of Wikipedia, for example. By contrast, inheritance relies on recognized name brand -- an earned reputation for quality, such as the New York Times, for instance. The two types of quality control sometimes may complement each other, it seems to me, as what we hope will happen as traditional news outlets learn to incorporate Twitter feeds. But they can also be destructive of each other.

Duguid notes several sources which comment on poor quality of the Google Book scanning -- from Wikipedia links to examples of poor scans, to bloggers' comments. Duguid then evaluates the quality of one particular book Tristam Shandy, which he had previously used to evaluate the quality of the ascii text in Project Gutenberg. Where Project Gutenberg had serious problems with Greek footnotes, marbled endpapers and blank pages in the book, Duguid expected the outright scanning of the Google Book project to sail through Tristam Shandy. Instead, he finds all kinds of problems with the scanning. Pages are cut off at the left, the right, the top or bottom, the text is blurry, or pages are even skipped entirely. Google software that organizes the materials misunderstands lists of illustrations and points to them as the tables of contents. And the copies chosen are problematic -- who decided which edition or copy to scan?! Duguid comments that Google tried to paste innovative quality control of its technology over the inherited quality control of ivy league research libraries through its Google Book Project. This essay is his evaluation of the outcome.

Duguid concludes:

The Google Books Project is no doubt an important, in many ways invaluable, project. It is also, on the brief evidence given here, a highly problematic one. Relying on the power of its search tools, Google has ignored elemental metadata, such as volume numbers. The quality of its scanning (and so we may presume its searching) is at times completely inadequate [14]. The editions offered (by search or by sale) are, at best, regrettable. Curiously, this suggests to me that it may be Google’s technicians, and not librarians, who are the great romanticisers of the book. Google Books takes books as a storehouse of wisdom to be opened up with new tools. They fail to see what librarians know: books can be obtuse, obdurate, even obnoxious things. As a group, they don’t submit equally to a standard shelf, a standard scanner, or a standard ontology. Nor are their constraints overcome by scraping the text and developing search algorithms. Such strategies can undoubtedly be helpful, but in trying to do away with fairly simple constraints (like volumes), these strategies underestimate how a book’s rigidities are often simultaneously resources deeply implicated in the ways in which authors and publishers sought to create the content, meaning, and significance that Google now seeks to liberate. Even with some of the best search and scanning technology in the world behind you, it is unwise to ignore the bookish character of books. More generally, transferring any complex communicative artifacts between generations of technology is always likely to be more problematic than automatic.

Finally, with regard to inheritance as a strategy for quality assurance, the question of quality in Google Book’s Library Project reminds us that the newer form is always in danger of a kind of patricide, destroying in the process the resources it hope to inherit. This remains a puzzle, for example, for Google News. In its free provision of news, it risks undermining the income stream that allows the sources on which Google News relies for quality to survive. It may even be true, in a lesser way, for Google Books. Google relies here for quality assurance on the reputation of the grand libraries it has corralled for its project. Harvard and Stanford libraries certainly do not have their reputations enhanced by the dubious quality of Tristram Shandy, labeled with their name in the Google database. And Tristram Shandy is not alone. With each badly scanned page or badly catalogued book, Google threatens not only its own reputation for quality and technological sophistication, but also those of the institutions that have allied themselves to the project. The Google Book Project’s Tristram Shandy may be, as Sterne said ruefully about his marbled page, the “motley emblem” of its work

2. Johanna Drucker, in the Chronicle of Higher Education for April 3, 2009, writes "Blind Spots: Humanists must plan their digital future." In arguing that humanities faculty must take responsibility for helping to design the research portals of the future, she points to Duiguid's essay. Taking off from his points, Drucker notes:
For instance, a number of us who had an opportunity to be part of town meetings reviewing the recommendations for dealing with the shortage of space at the Library of Congress some years ago came up against a basic issue: What version of a work should be digitized as representative of a work? Is Leo Tolstoy's original Russian text of War and Peace sufficient or irrelevant for future generations? Will those generations prefer access to the Louise and Aylmer Maude translation? Or to the more recent translation by Anthony Briggs? Should we digitize the sanitized version of Mark Twain's classics, purged of language now offensive to readers, or the originals that allow the historical distance of culture and vocabulary to register?

In a similar vein, what if another library's only version of Euclid is a copy of Stephen Thomas Hawtrey's An Introduction to the Elements of Euclid used by Bertrand Russell's brother Frank to introduce the future philosopher to its mysteries of mathematics? If the copy is tattered, missing pages, or has other marks showing it was used by a childishly eager Russell, should that bit of history be put aside for the benefits of scanning a clean, new copy of a 10th-grade geometry textbook? In the 2007 article "Inheritance and Loss? A Brief Survey of Google Books," Paul Duguid mordantly observed the shortfalls in Google's plans to digitize library books. He emphasized that the intellectual tasks of vetting editions and assessing scholarly value for generations to come have to be taken into account from the very design of the project, not reverse engineered later.

The technician might suggest that the cleanest, clearest copy that is most legible in OCR (optical character recognition) and automated search technologies provides the best return on the digitizing investment. With those criteria, selection is guided by technical requirements and constraints designed into the system. That is not "just" an issue of selection but a fundamental feature of functionality and capability, in other words, the design of the digital environment. The typographical features of the long "s" or radical experiments in graphic layout used by the 20th-century avant-garde — markers of their time and place of production — can be quickly sacrificed in a choice of "legible" (i.e., standardized) fonts. The migration of typeset texts into ASCII streams has been an issue of contention in literary and biographical studies since the advent of the Internet. Such debates underscore the fact that properties of texts are informational, not incidental, to many scholarly projects.
These comments go beyond Google Book Project itself, and raise larger questions about digitization projects themselves. In building the digital library of the future, scholars need to be involved. The selection of editions, of copies, of translations, and fonts all mean something to the scholars who study these texts. Librarians also need to be involved. Metadata means something to us, and needs to be handled with more care and detail than is currently the practice. In the rush to scoop up huge swathes of culture, projects like Google Books or perhaps even Project Gutenberg or the Open Content Alliance may be making haste without reflecting carefully on policy decisions. It will be difficult or perhaps impossible to go back after the fact and add, correct or re-do to make the digital library of the future more usable. Should we not have a conversation now about what are the important features we will need or miss?

No comments: