Friday, April 03, 2009

The Perfect Search

The ABA Journal April, 2009 issue has a fascinating article about a group trying to improve search protocols and technology for electronic discovery. As a few high profile examples like the Microsoft e-mails, and Big Tobacco e-mails provided exquisite "gotcha" moments on the stand, lawyers began to see more than ever the value of mining the huge data dumps of electronic discovery. But the problems are huge. When I was in library school in the mid 1980's I read about a study done on a large trial documents database by Blair and Maron (An Evaluation of Re­trieval Ef­fectiveness for a Full-Text Document Retrieval System, by David C. Blair and M.E. Maron, in "Commu­­nications of the Association for Computing Machinery," 1985). I was amazed to read in the ABA Journal article, In Search of the Perfect Search, by Jason Krause, that the Blair and Maron article is still the only research in the area. Krause does a nice job of reporting on it:

Blair and Maron asked lawyers and paralegals to use keyword searches to find documents on a given topic among 40,000 documents comprising 350,000 pages. Legal researchers estimated they would find 75 percent of relevant documents, but the research showed only 20 percent had been found.
The current work reported in the ABA article began three years ago, and includes experienced trial attorneys and information scientists. Text Retrieval Conference Legal Track is the project name. The link will take readers to a site at the University of Maryland. There is also a project home page, called Text REtrieval Conference Homepage (TREC), which is hosted at National Institute of Standards and Technology, a federal government site. (The NIST guys had been working on searching large data sites for about 15 years before being approached by litigator-researcher Jason Baron, in 2006. They were delighted to have a real-world problem to add to their mix.) This site simply says "to encourage research in information retrieval from large text collections. The efforts are not limited to legal information; they have different teams working on different tasks. The site has a call for participation, which I intend to follow up on, and hope other librarians will, as well. I was interested all through the article, that librarians do not seem to be involved in any part of the research. I wondered, in fact, whether firm librarians are involved in searching for electronic discovery items? Are there any readers out there with experience who care to respond?

Back to the ABA article, which I found both fascinating and frustrating. The guys working on the problems of e-discovery retrieval had come to the concept that the maleability of language was the crux of the problem. Words mean different things in different contexts, and they are often ambiguous. The problems are amplified by problems such as human error like typos, or keying errors in transcribing, or errors in translation. Then, when print material is scanned, optical character recognition is another source of error, or if handwriting is involved, you can imagine the extra levels of error introduced if you are trying to locate material through an electronic search of the database. Add to these problems with text, issues like video, sound, and items such as PDF files or spreadsheets that do not move into a database of text.

Judges have to OK the electronic discovery method, and for that to happen, the attorney applying for it must be able to defend the method. At this point, keyword searches are the default. That is the one everybody, at least thinks, they understand. Apparently, the two sides can negotiate the terms of the discovery methodology, but it seems to be rarely done. Negotiations can include detail down to the terms to be searched. The ABA article includes a link to sample negotiation form in PDF provided by the TREC team.

The parts of the article that most frustrated me were the discussions that compared "keyword" searching to boolean and "advanced." I suppose they are referring to various "off the shelf" products that are marketed to lawyers specifically for e-discovery searching, and thus can't easily be compared to the varieties of search options I am aware of in the legal databases I use as a librarian. But the terminology just drove me wild. Nevertheless, I am fascinated by the results and conclusions they have from the various types of searches available:
Baron says those last-century results [that is, Maron and Blair's results in their 1985 article, noted above] are stunningly similar to results from the first two years of the TREC Legal Track more than two decades later. Legal Track showed Boolean keyword searches using commands such as and, or and within so many words across a range of different hypothetical topics found only between 22 and 57 percent of all relevant documents cumulatively retrieved through a variety of alternative search methods. But the Boolean search was no better or worse than other more sophisticated search methods tested, and it still represents the current standard.

Keyword searches done thoughtfully can return a viable amount of documents. E-discovery consultant [Bill] Speros recalled an insurance case in which lawyers needed documents pertaining to young people. Rather than just search for that term and some synonyms, he used words like mother, father, dad and words for activities correlated with children like baseball and football.

“If I can do it in a thoughtful way,” he says, “I can get better results than some fancy new search technology.”

Notes Phoenix litigator [George] Paul, “The top people in the world have been working on this problem; they’ve had years and years and years and thousands of times more computing power since the days of Blair and Maron, and yet there’s been no material advance. How come we have 10,000 times more computing power than we did years ago and see no more advance?”

Still, something very important did come out of those earlier tests: While almost every test found roughly 20 percent of potentially relevant documents, each different type of search basically found different documents. When testers threw different combinations of search technologies at the database, they were able to find roughly 78 percent of the total number of relevant docu­ments.

Baron believes these paradoxical and confounding findings can be reconciled if “lawyers come to realize that to improve the results of searching, one needs to use a variety of available search methods and tools. No one off-the-shelf method will solve all of your e-discovery issues.”

Baron and the legal track team are trying to create a credible process and protocol to improve digital searches. It won’t exactly be the perfect search; no one expects that. The researchers are using all the computing power and search techniques they can muster to try to crack the problem.

Here’s where the tobacco litigation archive comes in. Legal Track is using the nearly 7 million publicly available documents from the master settlement agreement database, a collection of tobacco documents produced in relation to several state lawsuits against the industry. That database was chosen because it contains a wide spectrum of types of documents.

At that target cache, TREC Legal Track is aiming 13 hypothetical legal complaints (PDF). Written like normal legal documents, they contain all the information included in real-world complaints for fictional tobacco-related lawsuits, such as campaign finance violations, class actions, antitrust investigations, securities litigation, patent infringement and wrongful death suits. The most important part is the search terms these hypotheticals lay out.

Baron says the Legal Track team has had fun dreaming up hypotheticals on subjects ranging from the mu­sic of Bob Dylan and Joan Baez to research on pigeon deaths. “Basically, anything you can think of has been contained in some subset of documents that were gathered together for purposes of the prior tobacco litigation, and we have taken full advantage,” he says. (snip)

Experienced legal minds are hard to get on a volunteer basis. “As you’d expect, a lot of these people are busy with their day jobs,” Hedin says. “And it’s not like having people on staff where you can call a meeting anytime you need to.”

So far the TREC Legal Track research has identified a couple of practices that improve on the baseline keyword search. To start, lawyers need to work with opposing counsel to identify good search terms and to negoti­ate proposed Boolean search strings.

And it is important to use sampling—testing to see whether the search engines are finding documents known to be relevant. That means deploying what e-discovery experts call iterative feedback loops. These involve a team of lawyers and other in-house experts conducting searches in stages, and conferring with counsel and experts from the opposing party to de­termine whether the process is working.

Experts say that when litigators set up a search, they should identify the data types and then prove that the search tool they’re using works with those data types.

“Judges don’t want to get into a fight about tools, but want to hear a reasonable plan,” says Jessen, the e-discovery firm founder who is a volunteer in Legal Track. “This is not about perfection, but did you set up, enable and audit a process in good faith?”
According to Krause's article, two things have really given the TREC Legal Track project credibility:
The Sedona Conference
The Sedona Conference® Cooperation Proclamation calls on trial lawyers, in-house counsel, and judges to rethink the contentious practices that have grown up around civil discovery and refocus litigation toward the substantive resolution of legal disputes. There is an annual conference on e-discovery, and a series of working group think tanks in various areas on "tipping point" issues. The Conference also publishes a journal and a series of occasional publications. The Journal published an article, "The Sedona Conference: Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery." Interest from this prestigious group is important to the TREC Legal Track group's future. They became interested in 2006, as the group began, and influenced changes in the Federal Rules of Civil Procedure that affect e-discovery.

2. Federal judges -- through the Sedona Conference, the changes in the Federal Rules on e-discovery and decisions such as Victory Stanley v. Creative Pipe, Inc. that put pressure on attorneys to use experts on the e-discovery. That, sadly, drives the costs up.

Why, with all the advanced technology, all the advances in text searching, are the results of the various options still the same as what Blair and Maron found in 1985? It is fascinating, that the various technologies find different 20% of the documents. This is something I want to know more about!

The decoration is a painting by Boris Artzybasheff, from 1952, The Executive of the Future, found at I thought it illustrated our overly optimistic idea that we would be one with the search technology by now.

No comments: