Big Data: The Elephant In The E-Discovery Room

We have been hearing plenty about Big Data lately. The massive amount of phone and Internet metadata analyzed by the NSA certainly comes to mind. A little less well-known, but still a great example, was Netflix’s analysis of Big Data when taking the gamble on “House of Cards,” their first original show that was estimated to cost approximately $100 million.[1]

But what is Big Data? According to Wikipedia, “Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.”[2] It is relatively simple to see just how large a Big Data collection can be. Sticking with Netflix as an example, below are the data points that they are collecting:[3]

  • More than 25 million users
  • About 30 million plays per day (and it tracks every time you rewind, fast forward, or pause a movie)
  • More than 2 billion hours of streaming video watched during the last three months of 2011 alone
  • About 4 million ratings per day
  • About 3 million searches per day
  • Geo-location data
  • Device information
  • Time of day and week (it can now verify that users watch more TV shows during the week and more movies during the weekend)
  • Metadata from third parties, such as Nielsen
  • Social media data from Facebook and Twitter

The key here is not so much that these are large, complex data sets (corporations have had massive data sets for some time now), but that there is deep value in the analyses between these huge data sets. These analyses are challenging and can be very slow when utilizing conventional database technologies. Thus, new technologies have emerged to handle the storage and relational analysis of Big Data. From an e-discovery perspective, little is known about these Big Data technologies. How will we preserve and collect the potentially relevant data? What tools can be used to process and review for privilege and responsiveness? What will production of Big Data ESI include? There are many questions that can be asked, yet not a great deal of answers to these questions today.

As part of the litigation discovery phase, an examination of potentially relevant data locations is typically performed. Depending on the type of matter, the ESI that resides in Big Data could be potentially relevant. As an example, a medical device manufacturer may record telemetry from their devices implanted into the human body. (This type of information gathering from mobile devices is not at all uncommon in the realm of Big Data). This telemetry can include such information as device ID, date and time information, utilization metrics, and even user profile information. The aggregated data could be potentially relevant to a claim that the devices do not perform as advertised or to the device standards. Equally compelling examples could be identified in the retail industry or the financial industry, which are both enormous proponents of the collection and analysis of Big Data.

So how does one go about preservation and collection from Big Data? The good news is that Big Data is no different from the ESI data sets that we have been preserving and collecting for many years now. The same principles of preservation and collection today that are applied to emails, electronic documents, and structured data sets (like general ledgers or employee time entry data) will apply to Big Data. Additionally, the information in Big Data repositories is likely not purged very often. Patterns in the historical data are analyzed to come up with future scenarios. While the storage architecture may vary for different time periods of data (online vs. near-line vs. offline), there is a good probability that the available data may cover the relevant time frame of the matter. A word of caution though – while storage is cheap today, it is not free. The timelines for Big Data availability are most likely not indefinite.

However, there is a lot more Big Data to preserve and collect, and the architects of Big Data technologies are likely not thinking about designing access to their systems with e-discovery preservation and collection in mind. After all, the purpose of Big Data is to use historical analytics to help generate predictive analytics that can be applied to future scenarios – not for e-discovery purposes or legal holds. Initial versions of Microsoft Exchange had no mechanisms for implementation of legal holds. Much like email, systems have evolved over time to better accommodate preservation and collection (e.g., the legal hold functionality implemented in Microsoft Exchange 2010). I suspect that Big Data systems will do the same as Big Data becomes increasingly relevant for discovery. I also suspect that there will be a host of third-party technologies designed specifically to handle e-discovery of Big Data, just as there are a host of third-party technologies designed to specifically handle e-discovery of emails.

One should consider whether a Big Data repository is the system of record for potentially relevant data. This same consideration is applied to ESI collections today – we do not typically consider a collection of email from a mobile device as the primary email collection because the mobile device is typically not the system of record for emails. Rather, these mobile devices typically synchronize to an email server, which is usually the system of record for the emails. Perhaps there are components of Big Data that need to be preserved and collected because that content only resides in the Big Data system. This is very similar to text messages on a mobile device that are not synchronized to a text message server. In that scenario, the mobile device is the system of record for those text messages.

There is more good news when you consider the history of e-discovery technologies. Over time, these technologies have been developed to process and search large data volumes. For example, ten years ago, processing and searching emails from an entire corporate email server was a pretty significant endeavor. Today, there are tools that can process, index, and search entire corporate email servers containing terabytes of information with relative ease. These tools evolved and were developed because legal matters required that large data volumes of potentially relevant emails be processed and searched during the discovery phase. While the data volumes handled today may be minute when compared to Big Data, the technologies will continue to evolve so that they can handle potentially relevant Big Data. In fact, today’s applications have already incorporated one aspect of Big Data analytical systems – the ability to predict document reviewer decisions using technology-assisted review. This type of forward-looking analysis is one component that Netflix utilized when deciding to move forward with “House of Cards,” which was a completely new business territory for the company.[4]

What about the review and production of Big Data ESI? This is a much more challenging issue, especially when you consider that Big Data is an aggregate of many different types of ESI. With traditional ESI review, one methodology is to segregate the structured data from the unstructured data and use different technologies and processes to review and produce each format. Emails and documents are reviewed using a standard document review technology. Databases are reviewed via queries and reports run against the underlying data. With Big Data, some of it may be structured while other components of the data set are unstructured. We could potentially still segregate the structured data from the unstructured data, but will that break the meaningful relationships within Big Data, much like breaking a parent email from its attachment may change the meaning of the two documents when reviewed separately? It is likely that as Big Data evolves and becomes more in scope for discovery purposes, there will be an evolution of the Big Data analytical systems to generate reports and data purviews specifically designed for e-discovery. We see this evolution today with the enhancement packages available for SAP Customer Relationship Management that enable e-discovery reports and legal holds within the SAP system.[5]

As for the production format, the parties already need to come to agreements during Rule 26(f) conferences. These agreements will be even more crucial when dealing with Big Data, as an endeavor to TIFF all of the documents in a Big Data set will likely not go over well in any matter. There may be a hybrid production format for Big Data, where some information is produced natively and other data sets are produced via TIFF images. We see that today quite often where ESI productions of unstructured data include TIFF images for most documents and native production of spreadsheets and presentation. The native files produced can include dynamic content that does not typically render well in TIFF images.

Every day, the use of Big Data is becoming more prevalent in corporations. Those same corporations are relying on Big Data analytics to drive strategies and decisions, and those same corporations are involved in litigation. It is only a matter of time before there is some type of litigation where the strategies or decisions that were derived from a corporation’s Big Data become potentially relevant to litigation. While we may not need to address these issues today, as Big Data does not yet appear to play a significant role in e-discovery, we will need to discuss and address these issues in the very near future. Personally, I am looking forward to the seminal e-discovery case where the primary issues revolve around Big Data ESI. We are not there yet, but we cannot be too far away.

Published .