Data lakes for data science

Big data can be challenging for an enterprise organization, because big data affects data scientists, application developers, and infrastructure managers differently. Each of these specialists has different needs when it comes to analytic frameworks and storage infrastructure.

A data lake is a storage strategy to collect data in its native format in a shared storage infrastructure, making data available to different analytics applications, teams, and devices over common protocols. The notion of an EMC Isilon data lake sets the stage for a discussion of the kind of architecture that best supports the enterprise data science program pipeline and the newer, highly scalable big data tools. You can find this discussion in a new white paper, Data Lakes for Data Science: Integrating Analytics Tools with Shared Infrastructure for Big Data.

This blog post highlights the impact that data science has on an enterprise organization, and the considerations for decision makers to keep in mind about analytics frameworks and storage infrastructure. For details about data lake solutions and examples, refer to the white paper.

The impact of data science on the enterprise

Implementing an enterprise data science program to analyze big data involves two overarching, interrelated requirements:

  1. The flexibility to use the analytics tool that works best for the dataset on hand.
  2. The flexibility to use the analytics tool that best serves your analytical objectives.

Several aspects of the data science pipeline highlight these requirements:

  1. When you begin to collect data to solve a problem, you might not know the characteristics of the dataset, and those characteristics might influence the analytics framework that you select.
  2. When you have a dataset, but have not yet identified a problem to solve or an objective to fulfill, you might not know which analytics tool or method will best serve your purpose.

Analytics frameworks

With the traditional solution of the data warehouse and business intelligence system (DW/BI), these requirements are well known, as the following passage from Margy Ross and Ralph Kimball’s book, ”The Data Warehouse Toolkit,” illustrates:

“The DW/BI system must adapt to change. User needs, business conditions, data, and technology are all subject to change. The DW/BI system must be designed to handle this inevitable change gracefully so that it doesn’t invalidate existing data or applications. Existing data and applications should not be changed or disrupted when the business community asks new questions or new data is added to the warehouse.”

However, the fact of the matter is that unknown business problems and varying datasets demand a flexible approach to choosing the analytics framework that will work best for a given project or situation.

In particular, one change that DW/BI systems have difficulty adapting to is the demands of big data. In the face of new business requirements to collect and analyze large sets of unstructured data, DW/BI systems have become barriers to change. Why?

Because a data warehouse or relational database management system (RDMS) is not capable of scaling to handle the volume and velocity of big data and does not satisfy some key requirements of a big data program, such as handling unstructured data. The schema-on-read requirements of an RDMS impede the storage of a variety of data.

Indeed, the sheer variety of data requires a variety of tools—and different tools are likely to be used during the different phases of the data science pipeline. Common tools include Python, the statistical computing language R, and visualization software, such as Tableau. But the framework that many businesses are rapidly adopting is Apache Hadoop.

Analytics tools such as Apache Hadoop, Apache Hive, and Spark underscore the data science pipeline. At each stage of the workflow, data scientists are working to clean their data, extract aspects of it, aggregate it, explore it, model it, sample it, test it, and analyze it. With such work comes many use cases, and each use case demands the tool that best fits the task. During the stages of the pipeline, different tools, such as Apache Hive and Apache Spark, may be put to use.

Storage infrastructure and the data lake

The infrastructure of any data storage system must support data access over multiple protocols so that many tools running on different operating systems, whether on a compute cluster or a user’s workstation, can access the stored data.

The flexibility of a data lake empowers the IT infrastructure to serve the rapidly changing needs of the business, the data scientists, and the big data tools. If the storage solution is flexible enough to support many big data activities, it can yield a sizable return on the investment.

For more information, including examples of data science studies conducted in enterprise environments, read the white paper, “Data Lakes for Data Science: Integrating Analytics Tools with Shared Infrastructure for Big Data.”

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, or comments about the video specifically, contact us at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

About the Author: Steve Hoenisch