Keyword Development For eDiscovery

Monday, August 30, 2010 - 01:00

In the not-so-distant past, the development of keywords for eDiscovery was relatively straightforward. Attorneys would lock themselves in a room with a yellow pad and thesaurus and start guessing words that might be contained within a collection of data, which could indicate a given document would be responsive. The benefit of this approach was that it allowed both parties to quickly narrow the data to a much more manageable set of information, and thus cut down on the time and cost associated with analysis and review.

While there is little debate that keywords remain a critical part of eDiscovery because they are conceptually easy to understand and cull data based on what the issue is about, the process by which they have traditionally been developed is coming under increased scrutiny from the courts. Numerous concerns are being raised about keywords' integrity and accuracy because the old selection process is manual and based on only a limited understanding of the millions of documents in question. Moreover, it is open to interpretation, which can lead to an under-inclusive or overly inclusive set of information - that is, the number of keywords chosen may be artificially high or low and thus impact how many documents are ultimately set aside for further review. To wit:

This case is just the latest example of lawyers designing keyword searches in the dark, by the seat of the pants, without adequate (indeed, here, apparently without any) discussion with those who wrote the emails.

William A. Gross Constr. Assocs. v. Am. Mfrs. Mut. Ins. Co., 2009 U.S. Dist. LEXIS 22903 (S.D.N.Y. Mar. 19, 2009)

In this case, the Defendants have failed to demonstrate that the keyword search they performed on the text-searchable ESI was reasonable. Defendants neither identified the keywords selected nor the qualifications of the persons who selected them to design a proper search; they failed to demonstrate that there was quality-assurance testing; and when their production was challenged by the Plaintiff, they failed to carry their burden of explaining what they had done and why it was sufficient.

Smith v. Life Investors Insurance Company of America, Dist. Court, WD Pennsylvania 2009

In short, the method by which keywords are developed must evolve along with court expectations, or organizations will be at risk of increasing consequences that manifest themselves in the form of fines or sanctions.

The Early Role Of Technology In Keyword Development

Understanding the risk of standing pat, many forward-thinking organizations have looked to technology for assistance in developing keywords. They have leaned on functionality such as clustering and "documents as heuristics," for example, to improve their results.

These efforts undoubtedly yielded quicker results, but they also had a number of underlying challenges. Most notably, what was taking place under the hood was generally a mystery to any but the most technical attorney. This lack of transparency affected the repeatability of the process, and therefore raised concerns on how it might stand up in court. There are many recent anecdotal examples of judges pointing to a lead attorney and saying, "I want you to tell me exactly how your technology selected these keywordsnot your technical expert."

To understand more about why the defensibility of these advanced technologies can come into question, let's look deeper at an example. When clustering technology was first introduced to eDiscovery, many assumed it would dramatically change the industry. This artificial intelligence would help group documents with similar characteristics, so readers could gain efficiencies by looking at a stack of what was presumed to be similar content - and therefore, one could assume, hold similar keywords.

How "clustered" documents were grouped together, however, is based on advanced mathematical theory and broad characterizations of an entire document. If, for example, a document is 90 percent about fantasy football and 10 percent about price-fixing, it would be bucketed in the cluster of fantasy football because it is broadly about this topic, and the underlying mathematics isn't smart enough to prioritize the more-important concept that an attorney might be searching for in a given case (price-fixing in this example). Clustering, therefore, isn't an ideal approach to developing keywords because it's not granular enough, difficult to control and explain, and (most notably) it can yield results that make it easy to miss critical information.

They key lesson we've learned so far about developing keywords is that the solution can't be all human guesswork, but the solution also can't be entirely handed off to technology to do the heavy lifting either.

A Smarter Approach To Keyword Development

A better path for developing keywords is to leverage a mix of a number of critical elements - human intuition and understanding of the case, an understanding of the language used in the documents under consideration, and sound technology - rather than rely on any one aspect alone.

Our recommendation is to first determine if your case is right for keywords and then to start identifying the characteristics of the documents you are looking for within the collection. The latter is the most important step in developing keywords, and often is the most difficult.To do so, take a methodical approach by first outlining the critical issues of the case, and then develop a logical expression - comprised of a handful of key concepts - that represents each issue. For instance, a key issue of a case might be price-fixing, and a logical expression for price-fixing might be "meeting - competitor - price" in which you'd be focused on any documents that might be about a meeting with competitors where price is discussed.

Once this framework is complete, building a list of keywords becomes much easier because the roadmap is already laid out in front of you - you've, in effect, turned a fill-in-the-blank test into a multiple choice exercise. Your job then is simply to evaluate if each word in the collection is a synonym for any part of the logical expressions you previously developed.

But how do you know what words are in the collection, and how can you quickly evaluate all of them? This is the stage where technology can be highly effective. By tapping directly into the document collection and providing a simple workflow for evaluating each word, technology takes you the last ten yards and enables you to quickly narrow your search based on the actual language of the case.

Such an approach is much more structured and leads to greater defensibility and less ambiguity than alternative approaches to developing keywords.

About Anagram Keyword Development

Recently, RenewData launched an exciting new offering explicitly for developing keywords. The Anagram Keyword Development service is composed of two main parts: 1) a consulting service to help determine the key issues and logical expressions of a case; and 2) a technology that digs into the language of the case and helps users quickly reach a consensus on what types of documents are required for a case.

The key to the Anagram technology is its simplicity. Instead of requiring attorneys to guess possible keywords from millions of documents, Anagram allows legal professionals to choose keywords from a pre-organized list containing every word in the document collection.

This mechanical, repeatable methodology is optimally defensible because every word in the collection has been considered and because it keeps a record of steps taken to ensure what you have done stands up to even the toughest legal scrutiny.It not only identifies relevant keywords, but it also then tags and fingerprints each potentially relevant document in the collection, which can then be exported into virtually any review platform for greater inspection.You can learn more about Anagram at


There is no question that the methodology for developing keywords has to evolve. The old yellow pad and thesaurus approach is no longer up to the task, as fewer judges are willing to accept an ad-hoc approach to this important process. But the answer is not to move from one extreme to another - from purely a human-driven process to an entirely technology-driven approach. A middle ground that leverages the best of human knowledge and technology intelligence has to be attained in order to achieve optimal results.

To that end, it's important to first explore what keywords are ultimately designed to accomplish. Fundamentally, they are used to find documents within a collection that have a high enough probability that they might be about the case in question. Breaking this down to the simplest form, that means you need to have a keen understanding of the case and of the language contained within the documents at hand in order to build these search terms.Technology can certainly facilitate this process by tapping into the collection and making it easier to perform analysis, but it must be directly tied to the issues of the case in order to deliver on its promise.

Richard Cohen, Esq . is President of RenewData, a leading provider of eDiscovery, archiving and review-acceleration solutions. He has more than 30 years of management and legal experience - as a corporate executive, consultant, trial attorney and general counsel. Prior to joining RenewData, Mr. Cohen served as Senior Vice President & Managing Director of Legal Services for a large legal services organization focusing on complex litigation matters and bankruptcies.He has also held many other senior-level positions serving the legal industry and has served as COO of Integrex Professional Claims Services and as General Counsel at Ohio Power and Columbus Southern Power Companies.

Mr. Cohen has served as the President of the American Corporate Counsel Association in Ohio, Editor-in-Chief of the Electricity Journal, and Editor of the Energy Information Bulletin, as well as in leadership positions in Bar associations throughout the country.

Mr. Cohen is a member of the Corporate Counsel Advisory Board for T he Metropolitan Corporate Counsel and a recipient of the Distinguished Legal Service Award conferred by Corporate Legal Times . He holds a bachelor's degree from the State University of New York at Buffalo and a Juris Doctor from the University of Akron School of Law.

Please email the author at with questions about this article.