Big Data or Little Data: Do We Have To Choose?

A storm is brewing injim shook the legal world: Should we find ways to better manage our data, enabling us to get rid of data as soon as it has expired; or should we keep all data for as long as possible , relying on storage and search improvements to meet our operational and legal compliance requirements?

There are some good arguments on each side.


Organizations retain a lot of stale, useless data and getting rid of it just makes sense. Only about 1/3 of retained data has business or compliance value (that tracks closely to what we see with most customers).

There are operational costs to retaining data – production and data protection storage, licensing, personnel time – that vary with the volume of data under management.  So getting rid of data on a regular basis can help to save or at least postpone expenditures as the volume of organizational data increases.

When old data is deleted, searches for new and useful information are easier and more effective.  The data we browse through or review in a search result set is more recent, so the spreadsheet or PowerPoint stored in the departmental directory is easier to find, newer and more likely to be useful.  Also, many applications have built-in search tools that work well to a certain point, but with too much volume start to fail, crash or just take a long time to return results.  With less data, we have better and faster searches.

A smaller volume of data is also easier to move and manipulate.  Many applications and data sets are moving to the cloud.  Again, the cost of this process is based largely on the volume of data. Bandwidth limitations (the data has to get into the cloud) can also create issues and delays if a lot of data needs to move back and forth.

Another good argument for deletion is based on eDiscovery, where it costs about $18,000 to review a gigabyte of documents. Although new tools are starting to change this paradigm, an organization holding onto more data will incur a larger expense for eDiscovery than one that limits stale data.  Some lawyers also believe that the extra data gives the other party in litigation more opportunity to support its theories.

Finally, compliance and privacy are moving in a direction where holding expired (out-of-date or incorrect) data is becoming more of a liability, and might even violate certain laws.  One of the principles of European Privacy Law is that data should be deleted once the purpose for which it was collected has been completed.  (Roughly the same principle exists in the US, although as part of fair information principles, which are merely advisory).  However, lost or stolen data sets with specific information, such as financial, credit card, health or social security numbers, can result in fines, civil liability or an embarrassing (and possibly expensive) notification procedure.  Why keep it if it’s just a liability?

Keep It All!

The flip side to the argument is that all of this data – even stale data – may have value, if it can just be properly leveraged.  With new “Big Data” initiatives and tools, many organizations are leveraging troves of data to better understand or predict trends, determine optimal inventory levels and buying patterns or even identify risky employee behaviors.

Operationally, storage costs continue to drop on a per gigabyte level, so while there is still an expense to keep data, many argue that it’s low compared to the potential value of the data. Storage and data protection platforms and software continue to improve so that it takes less time to manage and protect all of this information.

Compliance requirements can be met with better search and retrieval tools.  The promise of predictive coding (which is a “Big Data” tool) is such that the costs of review in eDiscovery should significantly decrease as the technology gets better and acceptance increases within the courts and with litigators.   There’s even a good argument that a framework in which “stale” data is routinely deleted generates a risk of spoliation that is equal to or greater than the incremental costs of holding the data.  And many lawyers will argue that it’s difficult for them to defend an organization in a case when a lot of the email and documents were deleted before the case began.

Keep Or Delete?

There will always be costs and risks in holding on to too much data, especially as the laws both in the US and internationally continue to change on issues of privacy and security.  Yet we are at the beginning of new tools and algorithms that help us to search, manage and derive information from larger data sets.  So consider borrowing from both sides:

  • Good information governance – mapping your data sources (and maybe your data flows), knowing what information is being stored and where, how it’s being protected and secured, which retention policies apply and the available search tools, etc. – just makes sense. And you don’t have to actually delete the data unless you want to.
  • Some data sets are far more likely to be useful or to provide insights into your business than others. For example, manufacturing, order systems or financial data will probably yield more insights than old email messages.
  • Beware of “dark data” – offline or legacy system data where there is usually little insight into the information. Dark data can be a real compliance and eDiscovery problem and – at least while it’s “dark” – does not provide any value to the business from an analysis standpoint.

Whether your organization prefers to retain data longer or to get rid of it at the earliest opportunity, it pays to provide some governance around your information.  With knowledge of what’s there, you can make some informed decisions and get the best of both approaches.

About the Author: Jim Shook

James D. Shook, Esq., CIPP/US Director, Compliance Practice, Global Technology Office Dell EMC Jim helps Dell EMC’s customers understand and efficiently meet the legal and regulatory obligations for their data, focusing on cybersecurity, privacy, retention and electronic discovery. Along with an undergraduate degree in Computer Science, he is an experienced commercial litigator and a former general counsel to technology companies. Jim publishes and speaks frequently about meeting challenges created by the intersection of law and technology, and has been an active member of The Sedona Conference’s working groups on electronic information (WG1) since 2004 and data security and privacy (WG11) since 2015.