Alexis Clark is a data scientist at Recommind, a San Francisco–based leader in e-discovery, enterprise search and contract analysis. Below, she provides an overview of Recommind’s key techniques for fact-finding investigations: see the data, then read the data; using analytics; and proving a negative. More information is available in Recommind’s new e-book, “TAI: Technology Assisted Investigation.”
See the Data, Then Read the Data:
MCC: You’ve written that visualizations map data patterns – when people were talking, who was talking to whom. Can you walk us through how that works, particularly how a user can see the key dates and players in an investigation? Plus, what is it about visualizations that the brain finds so appealing?
Clark: Visualizations are a quick delivery system for information. The brain sees patterns and interprets what they mean much more quickly than is possible with words. This is really central for investigating. The human thought process is fast. We interpret what we see very quickly, especially data in a chart format. Reading through a very long spreadsheet or thousands upon thousands of documents is too slow a process for finding patterns.
Visualizations give an aggregate view of information at a glance. A user can easily see patterns emerge in a conceptual group display, date histogram or communication diagram. In this way, users see the data and use what they see to focus the next level of effort they expend.
Axcelerate features powerful visualizations. What I like best is that these visualizations are dynamic and work with the search and investigation thought process. Some other applications will create reports, but it’s definitely a pause in your thought process to create reports. Software should work with the investigation thought process and not create a lot of burdensome steps. It should be a fluid process when you’re investigating data.
MCC: You note that there may be a piece of evidence that can serve as a launching point in an investigation. What are examples of such launching points?
Clark: At the beginning of an investigation, there is often a lot that isn’t understood. But there is usually some piece of evidence that can serve as a launching point – something directly related to the topic or claim that is spurring the investigation to take place. That launching point might be a document, an email or a pattern of communication. Using the context around this launching point, the investigator can start digging into the data to find evidence that supports or refutes the claim.
One key analytical tool is Hypergraph. Hypergraph shows patterns and communication at two levels: First, you can see how information is moving within and among organizations.
Using myself as an example, I send a lot of email internally through Recommind, but I also interact quite a lot with our clients. With Hypergraph’s domain view, the pattern of information flow outside of the company is presented graphically. Depending on the focus of the investigation, this kind of traffic might be very useful.
Second, Hypergraph can also show the flow of communication at an email address level. That lets you see patterns of communication between individuals. Hypergraph updates dynamically to show communication patterns in every search you run.
You can see the patterns every time you apply a new filter or search term. You can look to that visualization in Hypergraph to see if there are any patterns that stand out. Just to ground that a little, you can do a search for a key topic and then check Hypergraph to see where the communication about that topic is happening. It’s between Joe and Bill. Joe is within the corporate entity, and Bill is external. Depending on your investigation that might be a meaningful pattern.
MCC: “Machine learning” is a provocative phrase. How does this technology help locate a larger body of evidence or discern a trend?
Clark: Machine learning goes beyond simple keyword search. It enables you to find a set of documents about a particular topic and feed them into the engine. It then builds a model that incorporates a great deal more context than a keyword search can by including the interrelationships and placements of words and phrases across the documents used for training. The system then compares each document in the data population against the model. One way to think about it is that machine learning is providing a subject matter sweep of a large data population.
Let’s say I’ve found 25 or 50 documents about a due diligence process, and I want to find all the email and all the documents in my million-document database that are about due diligence. I can use machine learning to do a sweep of that large data set, and it will return documents that are topically similar to the data I fed into it. It’s essentially “find more like this.” The machine learning system will scan the rest of the data to find similar documents – documents that are similar on a topical level. You’re essentially doing a sweep based on the subject matter that lets you collect all of the documents on a particular topic much more quickly than you’d be able to otherwise.
MCC: Can you describe the roles of information filters – like date ranges, authors, senders, domains and document types – in narrowing results?
Clark: The way that you use analytics is very much dependent on what it is that you’re looking for. Let’s say that you want to find out whether someone attended a meeting. The first step may be to filter down to a targeted date range. Let’s say you know this meeting probably happened in June. The planning process was probably a little earlier than that. You might filter the data population down to that targeted date range. Then you might want to know who was planning the meeting or who the likely attendees were, filtering down to authors and senders, likely attendees, their assistants, other people who might be involved in the event planning. If different organizations are involved, you might filter by domain to hone in on a starter set, a foundation of documents that are likely to contain the answers that you’re looking for.
This kind of focused reduction of data not only increases the effectiveness of the visualization, it also reduces the overall effort it takes to find the answers. Additionally, you learn a lot along the way. You might learn that someone’s assistant changed or they had a switch in personnel halfway down the line. You might learn about additional people that are involved and then become interested in understanding their roles. Narrowing down your focus and finding a foundation of pertinent documents is a good first step.
That said, part of this multistage analysis process is being careful not to narrow down too much and cut significant things. Machine learning can really help here. If you find a few documents on the topic and you want to make sure there’s nothing that you’ve missed, you can run predictive coding to do that broader subject matter sweep. I think of it as narrowing your focus, then backing out of it to make sure that it’s correct, and then narrowing down the focus again. It’s like broad to narrow and then narrow to broad. We talk about that a lot in our ebook, “TAI: Technology Assisted Investigation.” It’s repetitive and iterative. You start by using what you know and then use the data to guide you in finding out what you don’t know.
But it’s not machine-driven exclusively. The investigator’s human input and evaluation is critical to this process. It’s about assisting the person, not about replacing the person.
MCC: Context is obviously key for analyzing data. What kinds of tools can help you understand the context around the data being analyzed?
Clark: Yes, context is vital, and it can be derived from metadata, phrase analysis and visualizations. You find the foundation documents and start with what you know. Let’s say that you composed a search that isolates specific key players and a particular topic. Using that as a foundation, you can use Hypergraph to see patterns of communication, along with date visualizations to see how these conversations spread across time.
Phrase analysis is especially powerful here. Our phrase analysis exposes two-to-four-word common phrases in the data population. Once you have that foundation search result, you can use the phrases to do a mass spot check without reading the actual documents. Let’s say you have 3,000 documents in your results; you can look at the phrases to see what these folks are commonly saying. If you’re interested in price-fixing, you might be interested in all the different ways that price is used in these data populations, such as “price volatility” or “price manipulation.” Phrases provide context for that keyword from the actual documents. It can help you gauge the quality of the search results. You see how that language is actually being used by the key players. Phrases also help with identifying jargon and code words – seeing past the literal. You can see and learn more by this mass spot check of the documents. You’d have to read so many documents to find that information otherwise.
Date visualization maps out patterns over time, answering questions, such as “when were people talking about a topic?” The chart shows you how those conversations lie across time, which can give you cues about the information you’re searching for. You can see when people were talking about something. Sometimes you will see dips in communication for weekends and holidays, which might be expected. You might also see the use of a product code name until the product is released, then the language changes depending on the production or development phase.
There are also times when words or topics disappear from the data. This point of “going silent” could indicate something significant. For example, the change of active discussion to a simple “call me about this.” There are indicators that you can see in the patterns of communication. If everyone goes silent about a topic, you would see this in the date visualization. This should stand out to an investigator, to someone who’s interested in what’s been discussed and what’s been communicated. “Hey, wait, why did everyone stop talking about this?”
Proving a Negative:
MCC: Sometimes you need to know if something didn’t happen. Take us through a situation where the job was to prove a negative.
Clark: Step 1 is to look for gaps. Let’s imagine a team is investigating whether a person of interest was aware of a wrongdoing. We need to make sure the data that’s been collected is actually going to support our conclusions. Was data collected from all potential parties of interest? Is the collection time frame correct? Is there a sufficient data population to serve as the basis of this investigation? In this scenario, Axcelerate’s Hypergraph visualization for communication, the date histogram and statistics about the collection process itself will be especially helpful.
Moving past this initial analysis, the team would search for examples and patterns in communication – finding those launch points we discussed earlier. They might gather all the data around this wrongdoing, then see if the communication patterns indicate that the person of interest is aware of or included in the conversation. Do they work with the people involved in the wrongdoing? What communication takes place among the group that might be party to wrongdoing?
Step 2 is to prove a different positive. You can see how the patterns and the data could do that, too. If the person of interest has very little or very superficial communication with the parties that are potentially part of the wrongdoing, then that trend, that statistic, is an important piece of your proof. You’re proving a different positive. The individual didn’t work with the group at all, or we’ve found these emails that talk about how they’ve colluded amongst themselves and excluded others from this particular event. The data serves as the guide here, and the thought process is critical to this kind of effort.
Step 3 is to document the process. If your searches aren’t returning any evidence, then the quality of the search process and what you have searched for becomes really important. If it ends up looking like you’re going to need to prove a negative, keeping track of what you’ve looked for and how you’ve looked for it is important because that process becomes your proof, or an argument that proves that negative. It’s all about tracking all the angles that you evaluated for that particular topic.
Proving that something didn’t happen is a difficult task. It’s really important to have an approach that documents the process and incorporates strategic statistical sampling to surround the investigation with really sound methodology. It’s also hard to know when to stop. That ultimately comes down to a cost-benefit analysis – comparing the diminishing likelihood of finding additional valuable content against the effort involved. This is something that the legal team needs to keep in mind as they’re embarking on this adventure.
MCC: How do you envision the future of technology for fact investigations? Is it on track to replace or override human effort?
Clark: This is a fascinating field, and it’s evolving rapidly. I think of the advent of technology assisted review as a natural progression; it wasn’t really all that long ago that folks were reviewing paper documents in warehouses. Data has become more and more electronic, and with the explosion of big data, we’re seeing new technologies introduced. I think that we’re going to continue to see these technologies develop at a quick pace over time.
But the human element is very important. The technology amplifies what the person, the investigator, the human is doing. The insights and the focus that the individuals contribute are the most critical aspects to the success of these projects. The technology just makes it more efficient, more informed, and perhaps more conclusive and repeatable.
That’s an important fact for the legal industry to recognize: Despite all the great technology at play here, there’s still very much a role for Columbo.
Alexis Clark, Data Scientist at Recommind. 415-394-7899
Published March 31, 2016.