Great Data Scientists Don’t Just Think Outside the Box, They Redefine the Box

Imagine you wanted to determine how much solar energy could be generated from adding solar cells to a particular house. This is what Google’s Project Sunroof does with Deep Learning. Enter an address and Google uses a Deep Learning framework to estimate how much money you could save in energy costs with solar cells over 20 years (see Figure 1).

Figure 1: Google Project Sunroof Project

It’s a very cool application of Deep Learning. But let’s assume there “might” be an even better way to estimate solar energy savings. For example, you want to use Deep Learning to estimate how much solar energy we could generate with solar panels on the Golden Gate Bridge (that probably wouldn’t be a very popular decision in San Francisco). The obvious application would be to analyze several photos of the Golden Gate Bridge and estimate clear skies based upon cloud coverage.

However instead of estimating the potential solar energy generation based upon “cloud coverage,” what if we wanted to use “sunlight reflection” to generate the solar energy estimate (see Figure 2)?

Figure 2: Determining Best Predictive Variables for the Golden Gate Bridge

Or maybe you want to test another metric based upon the “sharpness of the shadows” generated by the bridge? Or another metric based upon how many people in the photo are wearing sunglasses? Or yet another metric based upon…

How do you know which of these variables – clouds or reflection or shadows or sunglasses or anything else – is the better predictor of solar energy generation? You try them all!

This thought process highlights an important behavioral trait of the best data scientists; the best data scientists have strong imaginative skills for not just “thinking outside the box” – but actually redefining the box – in trying to find variables and metrics that might be better predictors of performance.

The word “might” is a powerful enabler. “Might” is used to say or indicate that something is possible. It’s a data scientist’s most important concept, because “might” gives the data scientist the license to explore, be wrong, learn and try again.

“It Can’t Be Done” Is Not a Data Scientist Term

Andrew Ng, artificial intelligence visionary and fearless leader for many of us, wrote a recent article titled, “What Artificial Intelligence Can and Can’t Do Right Now.” In the article, Andrew states the following:

“Surprisingly, despite AI’s breadth of impact, the types of it being deployed are still extremely limited. Almost all of AI’s recent progress is through one type, in which some input data (A) is used to quickly generate some simple response (B). For example:”

Figure 3: What Machine Learning Can Do

While the use cases are limited today, the creativity at which data scientists are leveraging Big Data and existing Machine Learning and Deep Learning technologies is staggering. Let me give you one example of how data scientists from one of our Services teams at Dell EMC are thinking outside the box, to uncover new ways to help our customers avoid issues in their IT environment and create a more effortless support experience.

Predicting Hard Drive Failures

Let’s say that you are capturing over 260+ different pieces of telemetry data several times a minute for the life of a device. Most of these 260+ variables have incomplete or sparse data, the collection timing doesn’t always line up nice and neat, and getting time continuity across the devices is a major challenge.

If you were using a traditional Machine Learning algorithm, the data science team would have to spend an overwhelming amount of time 1) feature engineering new variables based on domain knowledge, and 2) using trial-and-error to determine which combinations of variables should even be included in the Machine Learning model.

Instead, our Dell EMC Services data scientists used a Patent Pending approach to Deep Learning to “pixelate” the data. They turned the over 260+ variables into device performance “images.” Then once they created these “images,” the team leveraged a recurrent neural network to find “shapes” and repeatable patterns out of random pixels (see Figure 3).

Figure 4: Pixelating Telemetry Data

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle. RNNs can use their internal memory to process arbitrary sequences of inputs, which typically makes RNNs ideal for handwriting or speech recognition. Except in this case, instead of trying to decipher handwriting into words, the data science team used the RNN to decipher the seemingly random pixels into a prediction on the state of the device (see Figure 4).

Figure 5: Using RNN’s to Identify Shapes and Patterns Buried in the Telemetry Data

I love this example because the team didn’t feel constrained to try to fit the square peg into the round “Machine Learning” hole. Instead, they used Deep Learning in a different context to decipher seemingly random pixels into a prediction of the health of a device. The data scientists didn’t wait until someone developed a better Machine Learning algorithm. Instead, they looked at the wide variety of Machine Learning and Deep Learning tools and algorithms available to them, and applied them to a different, but related use case. If we can predict the health of a device and the potential problems that could occur with that device, then we can also help customers prevent those problems, significantly enhancing their support experience and positively impacting their environment.


One of a data scientist’s most important characteristics is that they refuse to take “it can’t be done” as an answer. They are willing to try different variables and metrics, and different type of advanced analytic algorithms, to see if there is another way to predict performance.

By the way, I included this image just because I thought it was cool. This graphic measures the activity between different IT systems. Just like with data science, this image shows there’s no lack of variables to consider when building your Machine Learning and Deep Learning models!

Want more information on how Dell EMC Services uses data science?

Check out the “Decoding Customer DNA with Data Science” blog by Doug Schmitt, President, Dell EMC Global Services, and watch for the upcoming podcasts “A Conversation with Two Data Geeks” to hear directly from the data scientists behind our transformative technologies.

About the Author: Michael Shepherd

Michael is a Distinguished Engineer and recognized technical evangelist who speaks globally on the impact of emerging technologies. With 25 yrs of experience in Technology backed by 14 years of growing up in Asia, he currently leads AI Research for Dell Technologies Services and serves on the Pan Dell Patent Committee. Michaels responsibilities include engaging with external researchers and collaborating internally across the Chief Technology Offices to envision and drive transformation as we prepare for the Age of AI. As Augmented Intelligence improves the efficiency by which humans and machines work together, Michael focuses on “the possibilities” with Machine Intelligence and provides a vision for how Data Scientists in Dell Technologies Services can help drive human progress and better outcomes for businesses and humanity. Michaels experience as a sole proprietor and subsequent 20+ yrs at Dell in multiple organizations give him a unique perspective of Dell’s entire product lifecycle. He serves on the MSBA advisory council for the University of Texas McCombs School of Business and has been granted thirteen hardware and software patents in eight countries. When he isn’t consumed by Machine Intelligence, he can be found hiking with family and friends somewhere off the beaten path and out of service.
Topics in this article