5 essential skills for big data scientists

By Brian T. Horowitz, Editor and Contributing Writer

The role of data scientists is becoming more prominent, with Harvard Business Review calling it the “sexiest job of the 21st century” and the White House naming its first U.S. chief data scientist.

A couple of years ago, data scientist jobs were limited to tech companies in Silicon Valley for the most part, according to Adam Flugel, executive recruiter for data science at Burtch Works.

“Now it’s not nearly as limited to those spaces,” Flugel told Power More.

Today, all kinds of companies are hiring people with backgrounds in data mining, machine learning, programming and basic statistics. This is especially true at retail and consumer-packaging companies, Flugel noted.

“What would differentiate a data scientist is the ability to work with unstructured data, real time streams and other complex sources,” he said.

There’s a growing need for data scientists. By 2020, there will be about 10 times the amount of data there is today, according to IDC.

Data scientists will be needed to comb through genomic data in health care to help physicians recommend treatments for patients with cancer. In retail, data scientists will help stores determine which products to place on shelves based on customer preference. In finance, they can help predict the performance of funds.

No matter whether the title is big data scientist, data architect or data analyst, the focus of the position is to gather and make sense of data to ultimately make better business decisions, John Reed, senior executive director of Robert Half Technology, told Power More.

The role also requires the person to place the data in a reportable format and to use business intelligence tools, Reed said.

Here are five skills data scientists will need. 

1. Experience with Apache Hadoop

The ability to build and set up Apache Hadoop clusters is important for a data scientist. Hadoop is an open-source framework written in Java that enables storage and processing of large data sets. Hadoop is known as the standard in big data for storing, processing and analyzing hundreds of petabytes of data. The framework allows users to gain insights from both structured and unstructured data. 

Merkle, a customer-relationship marketing agency, uses Hadoop to provide its clients with a unified view of customer data.

“We had to come up with a computing model that was way more scalable than what we had,” Shawn Streett, vice president of managed hosting for Merkle, said in a case study. “Hadoop was the logical target for us.”

Analytics software built with a Hadoop-ready library can allow users to create dashboards of scenarios, such as what shoppers might purchase at various times of the year or how air quality might affect asthma patients.

Core components of the Hadoop stack include the Hadoop Distributed File System and MapReduce, a programming model that uses a parallel, distributed algorithm or a cluster to process and generate large data sets.

2. Ability to cull through multiple databases

Data scientists in various industries, including health care, finance and retail, need to pull data from multiple databases, Flugel noted.

“The more types of databases you can work with, the more hands-on experience you have, the more effective a data scientist you’ll be because you can work with more data sources,” Flugel said. 

Examples of databases include MongoDB, a cross-platform document-oriented database, Apache Cassandra, an open-source, distributed database management system and NoSQL, which is becoming a more important part of big data and Web applications because searches can be less constrained. NoSQL allows data scientists to store and retrieve data without the tabular methods of relational databases.

Health care organizations use MongoDB to analyze lab results, which is structured data, and physician notes, a type of unstructured data, to provide a 360-degree view of the patient.

3. Knowledge of predictive modeling

Data scientists need to be able to make predictions from data models and gain business value from the data. They should also be able to work with dashboards and draw conclusions from the data, Reed said. 

“You have to know when the interesting becomes the sublime,” said Tony Baer, principal analyst at Ovum.

Predictive analytics is valuable in industries such as retail, in which stores can gain insight on customer preferences to try to gain a competitive advantage over competing stores, Reed said.

Real estate agents could also use big data analytics to gain insight on the type of homes bought during the previous summer.

4. Familiarity with Apache Spark in-memory stack

Data scientists will need to know big data analytics platforms such as Apache Spark, which is incorporated in products such as the Dell In-Memory Appliance for Cloudera Enterprise.

“Time will tell, but it seems like the buzz around Hadoop is slowly dropping off while Spark gets more and more attention,” Flugel said.

Spark runs on the Hadoop Distributed File System, although its programming paradigm is different compared with Hadoop, he said.

“It seems like Spark is gaining momentum, and I can see it becoming the next big thing you hear about when it comes to data science infrastructure,” Flugel said.

“I think it will probably overcome MapReduce in a lot of ways in terms of prevalence,” Flugel added, noting that Spark is faster than the two-step MapReduce for many applications.

MapReduce is disk-based while Spark runs in memory and brings more potential for real-time processing of data

5. Programming coding

The ability to write code is the “most basic, universal skill” for data scientists, according to Harvard Business Review.

Employers will see it as a plus if data scientists can code, Flugel, said.

Potential data scientists should also be able to program in R or Python, Baer said. R is a programming language for developing statistical software created by researchers Ross Ihaka and Robert Gentleman at the University of Auckland, in New Zealand. Python — which is used to structure data and run applications that pull data from Web sites — features several packages and libraries for predictive analytics as well as machine learning, Flugel noted.

If candidates have a background in mathematics and statistics, a company may still go for them if they lack a background in exact products such as a Hadoop application, Reed said.

Demand for data scientists exceeds the number of candidates available, according to experts.

“We’re never going to have enough true data scientists out there,” Baer said.

With all of the skills required for data scientists, it still comes down to analytics.

“The core of data science is still the analytics, the ability to create and work with predictive models, to really pull out meaningful predictions, and get business value out of the data,” Flugel said.

About the Author: Power More