In civil litigation or regulatory inquiries, parties often must engage in electronic discovery, also known as eDiscovery, which is the process of identifying, preserving, collecting, reviewing, and producing electronically stored information that is potentially relevant in the matter to the requesting party. The goal is to discover (i.e., find) potentially relevant documents to produce, while identifying privileged and other sensitive documents to withhold from opposing counsel.
This part of the process, document review, is the most expensive, time-consuming and error prone activity within eDiscovery. In the past, lawyers reviewed hard copy documents on-by-one. Today, with document volumes often in the hundreds of thousands if not millions of documents per matter, setting eyes on every document is no longer feasible. Modern approaches to eDiscovery document review run the gamut from purely human driven processes such as keyword search followed by linear review, to predominantly AI-driven approaches using various forms of supervised machine learning, including technology-assisted review.
What are modern machine learning approaches to eDiscovery document review?
The objective of machine learning approaches is to minimize human review while maximizing effectiveness. 1 That is, one should look at fewer documents in a collection to find the relevant ones and minimize the costs and wasted efforts of eDiscovery document review.
One of the most efficient approaches to technology-assisted review, also known as TAR, in recent years involves a combined human-machine approach known as continuous active learning (CAL). Back in the early days of TAR, there were debates in the community about how one should seed the TAR process. Among other benefits, TAR reviews based on CAL showed that this question was moot: as long as you start with even a single positive seed document, the continuous learning approach catches up to other approaches that started with more positive seeds, or that were either more or less biased in the nature of the seeds. By 70-80% recall, the review is just as effective no matter the starting point. Continuousness overcomes initial conditions.
A more recent TAR model uses another seeding approach that does not rely on human assessment of the review collection or random sampling, but rather is based on artificial intelligence (AI) methods and derived from documents outside the collection. This flavor of TAR is known as a “portable model” or “reusable model” – aka AI models pre-trained on data from one or more prior matters or related datasets and applied to another. Portable models take a pure AI approach and by reusing human knowledge in the cold start seeding process.
Are portable models, the latest hype in eDiscovery machine learning techniques, worth considering over the continuous active learning technology-assisted review model?
Recognizing risk in portable models
Portable models come with risks and rewards in eDiscovery document review. First, let’s look at the risks–namely, data privacy and security considerations.
As it relates to data protection laws, data collection and reuse is under scrutiny as regulators seek to minimize data collection and maximize privacy and security. As you can imagine, there are rights and obligations around the use of the original data that goes in to training portable models, and there is a strong need to understand (and assess) potential risks when porting those models. This is an important issue since cross-border marketplaces for this technology exists today, opening up the potential for data leakage risks.
Here’s what experts have to say on the subject of portable models and data privacy laws:
“Naturally, most companies that invest in building an ML model are looking for a return on their investment…Data protection laws can run counter to these objectives by imposing an array of requirements and restrictions on the processing of various types of data, particularly to the extent they include personal information. The interplay between these competing considerations can lead to interesting results, especially when a number of different parties have a stake in the outcome.” — Brittany Bacon, Tyler Maddry, and Anna Pateraki. 2020. Training a Machine Learning Model Using Customer Proprietary Data: Navigating Key IP and Data Protection Considerations. Pratt’s Privacy and Cybersecurity Law Report 6, 8 (Oct. 2020), 233–244.
Further, because datasets can be very large, and possibly come from a range of sources, they can sometimes contain sensitive data (including PII), raising the possibility that model trained using such data could inadvertently expose sensitive information in its output.
A second very important challenge with respect to portable models is the possibility that data will be leaked, or exposed, through a type of attack known as “membership inference attacks.” In such cases, an attacker probes “black box” non-transparent models to uncover or recreate the examples used to train the model. The obvious risk is the potential for models to expose specifics of the data on which they’ve been trained (especially if a model based on training on sensitive or private information was made publicly available). In fact, entire models may even be stolen by membership inference attacks.
Given the risks, of portable models, the question is what are the rewards? Do they have sustained advantages over human-augmented TAR approaches like continuous active learning?
Evaluating risk versus reward
In light of potential dangers of portable models, OpenText lead data scientist Dr. Jeremy Pickens led a peer-reviewed study to explore the questions on risks and rewards of portable models compared with continuous active learning. Do portable models help or are they hype?
In terms of reward, various aspects of portable models were evaluated relative to an appropriate baseline. A traditional review strategy such as linear review is not the appropriate baseline against which to compare portable models. Why not? We already have better ways of doing things. For portable models (or any technology) to show value it needs to improve upon current best approaches, i.e., strong baselines, not just against traditional, inefficient approaches.
One of many possible strong baselines is human effort (e.g., Boolean keyword-based human searching) combined with TAR based on continuous active learning. We therefore set a small number of humans to the task of searching for just a few minutes each and use the documents that they find (approximately 50 for each case tested) as the foundation for the comparisons.
Two claims that have been made at various points in the industry are that: (1) portable models will jump start your review (allow you to start faster), and; (2) this jump start will get you to a high recall target quickly (allow you to finish faster). Thus, the study looks at these two primary questions:
- How many more responsive documents does the portable model find initially, relative to a human searcher baseline?
- How much more quickly does one get to a target recall point (e.g., 80% recall) after having started from a portable model-seeded point, versus from a human searcher-seeded point?
We further note that this study goes one step further. It sets up an evaluation framework in which portable models are given the strongest possible chance to show value by being trained on an “other” collection that is statistically identical to the target collection. Portable models in the wild will rarely be such a perfect fit but seeing how well portable models that were intentionally constructed in this manner is a useful way of studying the upper bound on their effectiveness.
Which review approach is most efficient?
The study concludes that a portable model is not entirely un-useful in that they offer large, significant review efficiency advantages relative to linear review and even relative to randomly seeded CAL review workflows in extremely low richness environments, where random positive document encounters are rare. However, linear review and random seeding are weak, improper baselines because there already exist better ways of doing things.
When it comes to the first question: We find that portable models are able to find on average a small handful (three or four) more documents than the human searcher in the first approximately 50 documents reviewed. Again, this is perfectly-fit portable models versus the Boolean keyword searching baseline. The difference is statistically significant, meaning that while keyword searching did manage to beat portable models in a few cases, the reverse was more often true. However, the magnitude of the difference was not large.
However, for the second question, we find this initial slight advantage is not sustained in the long run: At 80% recall, a human-seeded CAL process beats the one-shot portable model ranking by a large margin (thousands or even tens of thousands of documents) every time. Fairly standard human seeding combined with continuous iteration is a powerful combination that far outpaces a static model learned on a different collection – even when that other collection is almost identical to the target collection.
The second question has another aspect: What if the portable model-selected seeds are used to seed a CAL process? Would that reach high recall faster than the human keyword-seeded CAL process? That is, does the combination of portability and continuousness overtake the alternative? The answer is no.
Our experiments, again backed by this peer-reviewed research, show that when the CAL process is initiated using the portably-selected seeds, there is no significant, sustained improvement over human-seeding. While in approximately half of the cases the portable model-seeded CAL approach hits the 80% recall target 2-3% quicker (2-3% fewer reviewed documents), in the other half of the cases, the human-seeded CAL approach is 2-3% better.
A bird in the hand is worth two in the bush
In essence, continuous learning that is focused specifically on the target collection of interest (the current matter) more than offsets the transferability or reuse or portability of historical data. As the saying goes, a bird in the hand (a judgment on the current matter) is worth two in the bush (old judgments from previous matters).
Many claims are being made about portable models, and not a lot of empirical evidence has been shown backing up those claims, certainly not relative to an appropriate, strong baseline. That situation may or may not change in the future, but we note that there is a long history in the industry of making hyperbolic claims that ultimately lead to disillusionment and disappointment. By studying these and other claims, we take a more reasoned, rational approach to building our review platform. Doing these studies is especially important when the technology, in this case portable models, carries significant risk in the form of intellectual property rights violations, data leakage via membership inference attacks, and data privacy and security issues.
To learn more about the specific risks versus rewards of portable models compared to CAL, and want to delve into the results of the latest research, watch the ACEDS webinar recording, “Portable Models for eDiscovery: Help or Hype?”
Published December 16, 2022.