The Next Horizon

2.3 – Building Certainty Into Your Data Pipeline

Today’s data pipelines are wildly sophisticated and still somewhat uncertain for many companies, but through careful organization and control, you can discover an impressive amount of unforeseen value from that data.

August 19, 2020October 27, 2020 | 19 min audio

Transcript

All The Next Horizon Podcasts

In this episode:

What is Data Management? (1:28)
Two primary ways to process data (4:10)
Which is better: batch processing or stream processing (7:44)
The growing pains of scaling a data program (8:51)
Data outliers and their impact on cancer drug development (11:50)
How Dell Technologies is building certainty into emerging technologies (12:41)
How data silos create analytical impediments (13:36)
Leveraging enterprise knowledge graphs (14:05)
The importance of having a data management strategy (15:52)

This week, Jon meets again with Vish Nandlall from Dell Technologies’ Office of the CTO to dive deep on the topic of Data Management, discussing the myriad reasons why businesses should develop a clear data pipeline strategy, some of the typical roadblocks that can clog those pipelines, and how Dell Technologies is helping customers observe data management from a 360-degree view.

Guest List

Jon Hyde leads the Technology Thought Leadership and Emerging Technologies Marketing team that drives and delivers an aligned vision and strategy from the Office of the CTO.
Vish Nandlall is the Vice President of Technology Strategy and Ecosystems at Dell Technologies and is responsible for developing strategies to sustain technology leadership across new and emerging areas.

Jon Hyde: Hello, and welcome back to The Next Horizon. A Dell Technologies podcast. I’m Jon Hyde, and together we’ll explore the implications of several major emerging technologies for business, society, and most importantly, for you. Hi, I’m Jon Hyde, and today I’m joined by Vish Nandlall. Vish, you lead the Technology and Ecosystems group for John Roese’s office as CTO here at Dell Technologies. I’m not 100% sure what that means. Could you explain what exactly is that you and your team do?.

Vish Nandlall: Sure. Technology Strategy and Ecosystems is really all about making sure that we don’t miss the next big technology inflection. The whole idea is, we want to be able to lead through our product portfolio with the latest innovations, and trying to anticipate when technologies are going to be sufficiently mature to be adopted by customers. And to develop a point of view and a strategy around them so that our business units can best make advantage of those technologies, is really what my team is all about.

Jon Hyde: Got it. Okay, that makes a lot more sense. Okay. Thank you for that. Today we’re here to talk about data management and it seems like there’s a lot of different definitions of data management rolling around out there. In your words, in our context as a company, what do we mean when we say data management?

Vish Nandlall: It’s a very loaded term, data management. I like to think of it more as an IT process. And goal here, the project, is to be able to organize and control your data. You want to make sure that it’s accessible, reliable, and timely whenever any of your users are going to call upon it. What that means in reality is really a set of tools that can encompass the entire life cycle of a data asset. So from the very initial creation of the data, all the way to the final retirement. And so you’re going to use a suite of capabilities that help you to collect, validate, store, organize, protect, process, and just otherwise maintain the data. And within a Dell Technologies’ perspective, we like to think of it across three broad pillars.

Vish Nandlall: There’s a preservation of your data, which encompasses things like the data storage and persistence layers, data protection, there’s data activation which starts to bringing this lens on how do I process the data? How does the data then make itself to a tool, or a platform that allows me to compute upon it? Finally, there’s data curation, where we have capabilities like data governance. How do I manage the data? How do I apply policy? How do I ensure my data is sufficiently private? All of those capabilities come out under that pillar. So we look at it through those three broad pillars but as I said, it means many different things.

Jon Hyde: Yeah. It feels like a life cycle, right? So we’re looking at origination of the data, curation of it, how we handle it and all the things that we need to think about when we start to handle these things that have a lot of value to our customers. And then how do we ensure that we have integrity of that data, secure it whether it’s for intellectual property rights, or whether it’s against malicious activity or whatever the case may be, and how do we actually start to gain value from it?

Jon Hyde: It really discloses this idea of the data pipeline and what that means and how data in motion can have value while it’s in motion, as well as once it gets to its end state. Now can you talk about that importance of developing a real comprehensive data management strategy that suits both the old world of data at rest and this new world of data in motion, and customers experience both of these at the same time.

Vish Nandlall: The evolution of data management has really tracked against these two trends, this notion of data at rest and data in movement. Fundamentally it culminates into these theoretical concepts called Lambda Kappa functions, but let’s not take a detour into that. Let’s really talk about what are the primary ways we need to process data. And some convenient terms are things like batch processing and stream processing. And both of these methods have some pretty unique advantages and disadvantages. And it really depends on the use case. So if I think of batch processing, you think of data that’s being collected into batches and then fed into an analytic system. And the big thing here is that prior to be loaded into an analytic system, it basically starts to move through a database or a file system [inaudible 00:04:41].

Vish Nandlall: You could think of batch processing as ideal for things where I have very large data sets, projects that require deep data analysis. It’s not so desirable for projects where you need speed or real time results. So then we’ve got stream processing. Stream processing is all about speed in real time analytics. And here you’re building your analytics system piece by piece as soon as the data presents itself. So this allows you to produce key insights in near real time. Now, how would you pull these two things together? A perfect example is something like artificial intelligence or machine learning. If you think of the development phase of AI, you train a model using a batch process. You’re effectively taking data out of storage. You’re feeding it into the system.

Vish Nandlall: You’re qualifying it relative to a particular expected outcome. That error function then helps to retrain a set of weights until the model starts to converge. That’s the batch process. When you actually take that model and you do inference in your classifying, you’re actually deploying it. You typically would use a streaming pipeline of data. And that streaming of pipeline is to do a predictive event of some kind, it’s to be able to provide a timely classification insight, i.e., someone is about to steal that loaf of bread off the grocery shelf. Some type of event needs to be triggered as a result of it. So you can see both of those types of techniques really coming together in one fairly recent contemporary example.

Jon Hyde: Yeah. I think the one thing that my mind really jumped to when you talk through this, it makes really good sense putting in that, in that context, this idea of autonomous vehicles, right? You think about how that works in real time. You have the vehicles who need to be able to make decisions in real time, based on the data that they have available within their own context. But then if you extend that a little bit further, when we go to this idea of a swarm or hive mentality, where those vehicles can communicate with each other in near real time to deduce things like traffic pattern issues or accidents, or just congestion.

Jon Hyde: And how that might be dealt with in the local ecosystem of those vehicles, there’s value there. But then that’s the near real time piece that you referred to. But then there’s the batch processing where you start to look at the anomalous things that happen in those hives and swarms. And you start to surface those anomalies back to the greater control system which can make better decisions or retrain the model, to make better decisions over time. And then pass that back to the individual end points to be able to make those decisions in a different way in the future.

Vish Nandlall: Exactly. There’s really no universally superior method. Really. When you think of the two, they both have strengths and weaknesses. Really it’s going to depend on your project. I think when it comes to data processing above everything else, flexibility is really the most important factor for how you build your data teams and data infrastructure. Most projects will require different approaches. If you have a legacy set of equipment and you’re doing cloud migration, chances are using batch processing. Most of your legacy equipment doesn’t support streaming. There’s not a clear winner. The winners are the teams that can work well with both.

Jon Hyde: No, no. If you want to have an efficient pipeline, you need to do the pipe cleaning processes. You need to go out and really understand the inefficiencies and the clogs that are going to cause your pipelines to not be efficient. That’s always been a manual process in traditional IT infrastructures. But this idea of embedding machine intelligence, embedding better understanding of the pipeline and the outcomes that you’re trying to drive. And creating a more automated pipe cleaning type of a mentality, is really key to success for a lot of our customers. Is there any insights you could share on that?

Vish Nandlall: Yeah. You’re talking about something that’s super important. We’re seeing a lot of organizations hit the stage where the growing pains of scaling a data program is starting to become evident. And this is because a lot of people begin the journey through ad hoc data pipeline. So it depend on a few code littered experts that they’re not really repeatable at scale. So what all customers have to be aware of as they start the journey, is that you need to carefully select data infrastructure that can grow with your organization and to help operationalize the process. And I think that’s the key word here. Operationalizing these processes require something that we call a 360-degree view of your data pipeline. The old way of just corporate performance dashboards, potentially didn’t require that to the degree that you have today. You have these new ecosystems where you have analytics platforms or data science platforms that are built across companies digital transformation projects.

Vish Nandlall: And these are the fundamental building block of their competition moving forward. And so you need to have systems where monitoring, being able to understand when the data quality is starting to degrade, to be able to understand how much CPU and storage capacity is in the commit log. To understand what the topology your data pipeline is traversing, and whether it’s doing that in a compliant way. Today’s data pipelines are crazy sophisticated. They used to be all a unidirectional batch oriented types of devices. Today they’re Poly Directional. They span from a core to an edge, they’re streaming typically, they’re multi-cloud. So data pipelines just require a tremendous amount of care and feeding. And so that visibility and that operationalization of it, become incredibly needed.

Jon Hyde: You brought up an interesting point that I do want to circle back to, which is this idea of looking at these data pipelines from a 360-degree point of view. And the reality is, a lot of times when we start to leverage machine intelligence to help us scale these technologies to a point where we can really capture the value, the machine intelligence does some things that we may not necessarily anticipate. It could interpret new data that we didn’t even anticipate was going to be coming out of these types of technologies. And when we do that, our natural inclination is to correct that behavior. We go back and is like, no, that’s not the data we were looking for. That’s not the outcome that we expected. So, I mean, that’s something for us to think about as we look at our customers and try and help them. Sometimes they may not understand the value of their data.

Vish Nandlall: It’s becoming incredibly important to understand those outlier events and to start asking the appropriate questions so that your outliers, because very often they lead to really see changes in the way you perceive your business. There’s the non-apocryphal story of oddly enough, this is an actual story of some cancer research that was done into breast cancer, where a cohort was viewed as having an effect where a drug was 80% effective in a group of females. And what they found out through the data analysis was the interpretation was completely wrong, was that this particular drug was a 100% effective in 80% of the population. And that just simple conclusion flipping an insight on its head, suddenly drives an incredibly different dividend from the research.

Jon Hyde: Yeah, absolutely. That’s a powerful thing to think about. So not everyone’s going to get that same type of value from data. But I think that when we think about how we evolve our data management strategy, and we think about how it impacts the work that we do from day-to-day, what are the insights that you draw from these kinds of events?

Vish Nandlall: So, within the technology strategy and ecosystems team, we obviously deal with data at great scale. We’re trying to build certainty into technologies that are emerging. So inherently uncertain technologies, how do I get certain about it? And the best way to do that is to start to go back to a database line that you can build. And that database line is going to be constructed of a lot of weak signals that come from all of the assets that Dell has. For instance, trouble tickets. Understanding what are chronic and perennial problems that identify underserved needs from the customer that can be met and addressed with some emerging technology, helps us to create value propositions.

Vish Nandlall: But what we’ve found is, when we take a look at trying to undertake those types of analysis, is that we run into similar problems that other enterprises have, was their data assets. Data silos in particular, across the Dell Technologies landscape is an impediment to a lot of the types of analysis that we want to be able to do. What we’ve been trying to establish, I guess, or invent was in the company, is to do a bit of champagne drinking. Take some of the concepts that we’ve been thinking through that could help to address those problems for our own data management exercises. We’ve been looking at things like enterprise knowledge graphs that as long bed and academia, but we’re starting to see adoption across a lot of smaller enterprise. We’re using graph based data management and analytics to combine multiple sources into an enterprise knowledge graph.

Vish Nandlall: Now, what we don’t want to do is to take the time honored tradition of consolidating everything and a knowledge graph. So we’re actually trying to leverage the benefit of an EKG by integrating the data from multiple sources through an indirect way. Through a virtual way, meaning that then algebra is representing data and it can pull data from any source, but it’s not consolidating the data. We’re actually creating an abstraction of the data. And so by providing that and separating the use case from the data source, it’s enabling us to leave the data where it is and treat all of the Dell data set as a single resource to drive all of our use cases. And this is helping us to support much deeper, more sophisticated analysis into how we can take emerging technologies and apply them better to the products and services that Dell is offering.

Jon Hyde: Yeah, I think that’s a really powerful thing for us to consider, especially when we talk with our customers. They’re bombarded with a host of challenges that come up whether its regulatory compliance, intellectual property and governance. It could just be a fear of the technology and not understanding it or not having the resources to really take these things on. But these trends of emerging technologies are what are going to help our customers achieve these business goals and create totally new business goals for themselves. So what advice do you have for customers who are hesitant in these areas?

Vish Nandlall: We’re definitely in a realm where not having a data management capability is going to significantly affect your ability to compete. If the cornerstone of many data management projects are on the altar of your analytics and your data science activities, you’re impeding it. If I think of the historical pain point of ETL and data integration was that high to move data between operational systems. And that typically had a pretty negative impact on the business. A lot of that movement would occur in batches or is its real time. And it put a ticking clock on the freshness and relevance of the data that had just been integrated.

Vish Nandlall: So that was a reasonably sufficient thing for the past when you had a limited number of people meeting to leverage the data for your corporate performance dashboard, or your first generation BI use case. But today you really need to move at the pace of digital. And that means intuitive access to all the relevant data. You need an exploding number of end users. Not just data producers but data consumers, the data scientists, the business analysts. And this ability to do real time integration and to view and access all that data is really the goal of most of these projects. So at the end of the day, if you’re going to really try to lean in on digital transformation, you really need to take careful calculated decisions around what your data infrastructure is going to look like.

Jon Hyde: I think that is a really salient point that a lot of our customers will take to heart and then wonder what they do next. And I think a lot of us are in the same boat when it comes to that. And I think that the evolution of data management and as we look how that’s going to impact things like data pipelines, and the future of our technology is a powerful thing for us to think about. So, Vish Nandlall, I want to thank you very much for your time and for the insights today. Again, Vish Nandlall from Technology and Ecosystems for the office of CTO for John Roese here at Dell Technologies. Vish, thanks so much for your time today and look forward to talking again in the future.

Vish Nandlall: Thanks a lot, Jon. It was a pleasure.

Jon Hyde: For those of you who enjoyed this podcast. You can find it at www.delltechnologies.com/nexthorizon, along with feature podcasts and other great content focused on emerging technologies. Thank you so much for listening and be sure to subscribe. Until next time, I’m Jon Hyde, and this is The Next Horizon.

The Next Horizon

2.3 – Building Certainty Into Your Data Pipeline

In this episode:

Guest List

When Humans Become Data Silos

Computational Storage in the Data Decade

Data Confidence Fabric and the Importance of Vetted Data

Living in a Data Jungle: The Genesis of Data-First Businesses

The Future of Business: How to Get Ready for the Zettabyte World

2.7 – Galloping Toward Digital Transformation

Data Sovereignty: The Challenge of the Data Decade

What does it mean to be a data-first company in practice?

5 Ways to Find the Data Scientist of Your Dreams

Hallmarks of a Data-Driven Business

The Power of Data: Why Now is the Time to Embrace Data Management

Welcome

Welcome to Dell

In this episode:

Guest List