This is a great time to be a data scientist –a bit like rock stars with all the fans always trying to catch some private time with us. While there’s is no clear definition of what a data scientist is (see related blog or view diagram of DS skillset) our take on this role is quite simple:
- Work with stakeholders to elevate high impact business related questions
- Find the means to answer these questions
This blog aggregates our collective experiences as members of EMC’s Corporate IT Data-Science-as-a-Service (DSaaS) team. Our team has been active since 2012, providing Data Science (DS) services to different business units as part of EMC IT’s transformation to an agile and innovative IT-as-a-Service model.
Although we aimed for a technical blog, we thought that the first post should provide a broader context to the DSaaS offering and it will, therefore, be dedicated to the process of innovating and driving data science projects in the corporate environment.
There’s much to be said on the deployment of a data science team, but more importantly how it interacts with what we simply refer to as “the business.”Let’s make this absolutely clear: the business is the main driver for what we do and not the data (which is the raw material from which we carve our answers and prototypes).
In the present post we want to focus on what we think are the essential steps required for a successful, business driven, data science project cycle.
Step 1: Define the business problem
This phase is all about listening. You have to understand the internal processes taking place with the customer, their data and its flow. It’s where we query the executive sponsoring the project on their vision and aligning their expectations with what we do (surprising to some, math is not magic).
It’s also a good time to suggest alternate or additional objectives which may have been overlooked in the past because they seemed too complicated. Remember: this is just the right- time to share all the exciting new possibilities data science can offer our customers.
Two crucial points that should never be left out of this stage:
- ROI – making sure that we are really tackling the right problems!
- Spending enough time on designing the evaluation process and metrics. In most cases an algorithm’s output is not identical to the answers we are pursuing so a proper, well defined evaluation metric is especially important.
Step 2: Exploratory data analysis
Get some data and play with it. This is where you should really spend time just looking at data. We try anything and everything. Some of us like using box plots while others prefer plotting all variables into files (so what if we have 25,000 plots? we don’t really have to examine everything and with a preview window we can easily go over a few hundred or a few thousand…).
If the data is textual, we spend a good few hours/days in evaluating it. Sometimes, we can transform data into a graph and sometimes it’s better if we look at its projection (PCA, t-SNE). Spending more time in getting to know the data can mean the difference between success and failure, so it is important to be patient and wait before you run off to try all those cool machine learning packages.
We also invite technical folks from the business side to join us at this stage. Remember that we are solving a business problem, so someone who is familiar enough with the setting can help put broader context and relation to the data:
- Is the data helpful?
- Is it distracting us (e.g., repetitions and known noise in the data may point us in the wrong direction)?
- What is it like in terms of quality? Go beyond just looking for missing values and try to measure other quality properties. For example, verify that a time series signal is not missing any data points. What percentage of data is missing?
- Can we get other relevant data sources? This point in particular can be challenging because there is always more data.
Step 3: Data Preparation
In this stage you transform and hammer the data to the shape needed for different data models. This is by far the most time-consuming part (most people who don’t drive data projects are usually surprised to learn that this is how we spend more than 80 percent of our time in a project).
Here we go even deeper in our exploration of the data and spend enough time to make sure we do a good job cleaning it (tip: simply using the sample’s mean to replace missing values may work, but results are usually better if you can leverage domain knowledge).
These transformations can be as simple as a normalization or Z-score calculation, but can quickly become extremely complicated and require specialized algorithms. For example, at this stage we sometimes try to derive metadata based on the temporal behavior of a key performance indicator (KPI), or run a clustering algorithm as part of a broader pre-processing phase.
Step 4: Modeling and Evaluation
This is what people actually think that we do all day…In fact, we see more and more software packages trying to automate this step (while ignoring most of the other parts).
While the arsenal of statistical tools and machine learning procedures is rapidly growing these days, the real benefit of having a data scientist using these tools is our ability to inject domain knowledge into a model (not to mention realize new models).
Work carried out in this stage should be seen through the evaluation metric lens we defined before. An elegant algorithm is not what we aim for –we seek an answer to a problem, and we use the evaluation metric to measure the accuracy of the one we provide.
As a side note, one should also remember that this is a technically challenging phase where we adapt an abstract model to a real data set and have to think about scale (up and out), fault tolerance (sometimes) and Big Data constraints.
Step 5: Insight and Deliverables
This is the final stage in a DS project cycle. Here all pieces are glued together and handed out (a few slides, Proof-of-Concept code or even an agile prototype). Much of this part is about telling a good story on what was done. Ideally, this should not be new to our customers –it should be their tool to propagate our joint success story to their managers and peers.
Last but not least: if the first phase was mostly about listening this is the where we discuss matters –possible extensions and improvements and a “lessons learned”document together with our customers.
So, what are out main takeaways from all this?
- A DS project cycle is a five-headed beast: defining a problem, exploring the data, preparing it, modeling and evaluating, and finally delivering an insight or a data product.
- We work with “the business” (complementing it with our skills) and for “the business” (solving its problem). This is a team effort!
In our next posts, we will present some concrete examples for this process delving into the business use case, model and technical realization.
For more insights into data science, read these blogs: Big Data Wilderness: Finding Your Way Starts with Asking the Right Questions and Don’t Fear Big Data: Leverage the Right People and Take it Easy.