Data Science Lessons: Insights from an Agricultural Proof of Concept

Agriculture has come a long way from ancient times through the industrial revolution to the current digital era. In 2017, modern agricultural organizations have access to increasingly large amounts of data collected by sensors from soil quality measurements, weather sensors, GPS guided machinery, and more. According to a USDA’s recent survey, more than 60 percent of corn and soybean crops are monitored by data collection devices (source). However, there is still a substantial gap between the potential of utilizing this data and what happens in reality. Despite having the data, many companies lack the capability to effectively process, analyze, and efficiently build informative models in order to make data-driven decisions.

That’s where guidance from data service providers, such as Virtustream, can help. Virtustream provides data management expertise, tools and data science consulting to enable customers across different industries to get value from their data resources.

Our data science team in Dell IT recently initiated a Data-Science-as-a-Service Proof of Concept (PoC) as part of a Virtustream service engagement with a large company that plants thousands of farms across USA. Virtustream had enabled the company to become more data-driven by harnessing its large amounts of data, as well as developing and implementing different applications that enable scalable, faster, and more accurate operations – operations that couldn’t be executed with existing tools. Our PoC sought to demonstrate the speed and efficiency of those analytics applications.

Our goal in this PoC was to enable fast and automated execution of seasonal yield predictions for each field in terms of tons per acre using Virtustream cloud services. Such a prediction has a significant business value for the company as it enables efficient resource allocation among the different fields or, in other words, maximal crop yield with minimal costs.

In a three week sprint, we delivered a model that is at least as accurate and is much faster than the existing model (which runs on a single lap-top). The model calculated the predicted yield of a given field based on internal data sources (soil type, fertilizers, plant ages, etc.) and external public sources like satellite images and historical weather data. This short-term engagement highlights four points that we find worth sharing with readers:

Data Science as a part of a bigger solution

Cloud and storage services serve as tools to achieve what the customer really wants – business value. A key way to illustrate that business value to the customer is via Data Science PoC outputs that provide concrete proof of what can be achieved. We request the customer to share a dataset (or a fraction of it) and after several days we deliver concrete evidence of the potential ROI that can be gained with the proposed solution. In the discussed PoC, we proved that using the proposed Virtustream cloud infrastructure, a prediction model that use to take several days to create may be built in less than 30 minutes.

There is a fine line between Small Data and Big Data

‘Big Data’ has become one of the most widespread buzzwords of this decade. But what is Big Data? Is it a 2GB dataset, 100GB or 1TB? From our perspective, the answer is very clear: any data processing task that requires a scalable programming paradigm to be achieved in a reasonable amount of time is at the “Big Data Zone.” If it’s possible, we always prefer to develop with standard tools (like simple R or Python scripting) as it usually results in a shorter and clearer code that requires minimal configuration efforts. But what happens when you start with relatively simple procedures that take several seconds and after three days of developing find yourself waiting more than an hour for one procedure to be ended?

When things get rough, we want to shift our code to the Big Data zone as quickly as possible. In this PoC, we started to develop our solution with standard Python libraries. When we wanted to integrate image processing data into our model, it became clear that we need to shift our solution to the Big Data zone. Since we worked on a very scalable environment, this shift was very easy. We quickly configured our environment so it could run the existing code in a Big Data mode (using PySpark), allowing more complicated procedures to be integrated within it.

Data Science - Developed Architecture

Focus on the business problem, not on the dataset

“I have a lot of data. Let’s do Data Science!” may be the first sentence that will initiate an engagement around a data science project. But taking this approach to the rest of the project is likely to result in a burnout. The first two questions that should be asked before initiating every data science customer engagement are, “what is the business problem?” and “can we solve this problem with the available data sources?” Answering these two questions is often complicated. Sometimes it requires conducting a three-to-five-day workshop that includes brainstorming and data analysis sessions with the customer. However, it is a prerequisite for a successful project. Every data source should be considered when working to solve the business problem.

The business goal that the company presented in this PoC was very clear: predict farm yield in terms of tons per acre. Given this business problem, we were able to search for the relevant datasets and available knowledge, whether they were internal from the customer (soil types, previous years’ yields, planted species, etc.) or external (public weather datasets, satellite images of farms).

Pair mode developing

Delivering a complete solution in three weeks leaves no time for distractions and redundancies. In this PoC, we were working as a pair so the risk for duplicated work or disconnected silos (pun unintended: not to be confused with grain silos) was even higher. In order to mitigate this risk, it was necessary to deliberately modulate the work. It was very clear from a very early stage of this engagement that there are two main tasks in the PoC: building the model pipeline and conducting the satellite images processing.

Building a designated model pipeline is a complicated task in which features are engineered from the given data (for example: amount of seasonal rainy days, average temperature in season, previous year yield, etc.) and then digested into a models competition module that choose the best Machine-Learning model for the given problem (e.g. Random Forest, Neural Network, Linear Regression). On the other hand, satellite image outputs serve as a crucial feature in the model, as Normalized Difference Vegetation Index (NDVI) and Normalized Difference Water Index (NDWI) are very indicative for yield prediction. Converting an image into a single number like NDVI and NDWI is a non-trivial task as well, so we decided to separate this task from the rest of the development process and combine its outputs into the developed model when ready.

To summarize, in each of our engagements, we learn different lessons that sharpened our analytics skills. We share these lessons because we feel it is highly important for us to give a true sense of data science work.

You are more than welcome to contact our Dell IT team (Data Science as a Service) with any questions and issues.

About the Author: Omer Sagi