S3 #11 How to merge business analytics and data science in your organization

Guest:

Guests:

Guarav Rao

Gaurav is currently the Executive Vice President and General Manager of Machine Learning and AI at AtScale. He is responsible for defining and leading the business that extends the company’s semantic layer platform to address the rapidly expanding set of Enterprise AI and machine learning applications.  Most recently, Gaurav served as VP of Product at Neural Magic - innovators in software acceleration for deep learning utilizing sparse model architectures. Previously, he served in several executive roles at IBM spanning product, engineering, and sales that were focused on taking cutting-edge data science, machine learning, and AI products and solutions to market; specializing in model training, serving, mlops, and trusted AI in the context of driving business outcomes for enterprise applications. He is also an advisor to data and AI companies, including DataRobot, an AI Cloud leader.

Enterprises that merge business analytics and data science can deliver augmented analytics and create a semantic layer so that everyone in the organization understands the path and predicts the future. What does that mean, and what does it take to achieve the perfect balance between opportunity and risk? AtScale’s EVP & GM Guarav Rao provides timely advice and industry-tested insights.

Keywords:

AI, ML, machine learning, artificial intelligence, data lakes, data management, data insights, data governance, policy, governance, data governance, data ethics

Season:

Episode number:

Duration:

32:18

Date Published:

June 23, 2022

[00:00:00] KRISTINA PODNAR, host: Enterprises that merge business analytics and data science can deliver augmented analytics and create a semantic layer so that everyone in the organization better understands the path and predict the future.

[00:00:12] INTRO: Welcome to The Power of Digital Policy, a show that helps digital marketers, online communications directors, and others throughout the organization balance out risks and opportunities created by using digital channels. Here's your host, Kristina Podnar.

[00:00:30] KRISTINA: With me today is Gaurav Rao. He's the executive vice president and general manager of machine learning and AI at AtScale; he's responsible for defining and leading the business that extends the company's semantic layer platform to address the rapidly expanding set of enterprise AI and machine learning applications. Gaurav, I'm so excited to have you here today. I know this conversation's been in the making for quite some time. We're finally at

[00:00:55] GAURAV RAO, guest: Thank you, Kristina. And likewise, I'm very excited to have the conversation today.

[00:00:59] KRISTINA: So, you know, everybody wants to be using analytics these days. Everybody wants to be using AI before we jump down the rabbit hole, very, very deep, can you help us understand the types of analysis we typically see in organizations these days? And where does AI come in?

[00:01:16] GAURAV: In It's a great question. And you know, AI is not necessarily a new concept. You know, we've been doing predictive, prescriptive types of analytics for a number of years. And in fact, what you're starting to see is data has always been a critical part of an organization. And as data has continued to grow, the volumes have continued to grow. The types of data have continued to grow. Whether that's real-time marketing data or real-time user behavioral data, it could be unstructured data like video and images. As the number and volumes of data types are growing, organizations have tried to figure out, well, what can I do better with it? What can I drive from an insight? What can I drive from an outcome, right? To better understand and use my data for mission-critical processes. And those processes could be better understanding a lead, right? If you're a sales manager, it could be net retention scores. It could be visual product recommendations in retail, as someone's navigating your site, all of these become data challenges. And I think in the past we've had traditional or classical approaches to solving some of these challenges with data; we've had operational data stores, mainframes, and other transactional systems in the past. And what we've started to see, I think in the last couple of years, Is the rate and pace of innovation and state-of-the-art techniques have just exploded around data. And that's kind of where machine learning and AI have come into play. New techniques and new state-of-the-art algorithms allow users to probe and better understand their data.

[00:03:04] KRISTINA: And when you talk about users, my head automatically goes to the marketeer and the organization. It doesn't necessarily go to a data scientist or an IT. I think what I hear the most is I'm a marketer; let me have access to the data. But that also seems a little bit dangerous sometimes, right? It's like, what are we looking for when we have data? How does this help us understand how we should structure data democratization and let people have data but also make sure that they're using the data correctly or that they even understand what they're looking at?

[00:03:37] GAURAV: Exactly. And, you know, what's ironic is most of the people we talk to who are coming to companies like AtScale and other folks in the data and AI space are business users who are saying, this is the problem I'm trying to solve. I'm trying to improve operational efficiency, improve or reduce customer churn. And it starts with some sort of a business problem and a business user. And I think. What we've started to see from a skill standpoint in the industry is we've gotta simplify the tooling, right, in order to make machine learning AI and sort of the whole data process simpler so that even non-developers can understand it. Now, the challenge is a majority of the tools that help these business users are still being operated by data scientists, machine learning engineers. And one of the evolutions that we're seeing, in the innovation of these, this tech space is no code, low code. And that's geared for sales professional who's using Salesforce or some other CRM or in marketing, right. The HubSpot and, you know, of the world. We know that there's got to be simpler ways for them to understand their data and use their data. And because of these no code, low code techniques, we're seeing sort of two things happen. You're giving access to business users who still need access to raw data, right, in order to build these machine learning models that will fuel the decisions that these business users are making. And that often requires still working with the more technical developers, like the data scientists and data engineers and ML engineers, who are really responsible for organizing, collecting and normalizing the data that's flowing into the enterprise. It's their job to figure out how can I make the data that's coming in, regardless of the industry, more consumable for the tools that are now available for technical and non-technical users to get insight out of it. So now I have different levels of personas, different organizations. All sort of trying to collaborate with data and what that introduces is challenges as well as issues. Challenges in how we share effectively and issues in. Is there the right processes in place? Do we have governance? Do we have control? Do people who have access to the data, should they be accessing certain data sets? Are the right non-technical users reading and accessing the data that they technically should or shouldn't have access to. So, the democratization challenge really becomes one of access. And then also one of, an evolution of a life cycle, because there's this constant back and forth between multiple personas that are now all interacting with different data in the enterprise.

[00:06:38] KRISTINA: I think you said governance twice, which made my heart go pitter-patter because I'm a little bit of a governance girl. And so, I'm very curious from your perspective, does governance need to happen ahead of data democratization in allowing everybody accesses to this data? Or should that be folded into the process itself? What do you see as best practices for the industry right now?

[00:06:58] GAURAV: A hundred percent agree with the latter. I think ethics and governance around machine learning and AI, to be honest, is not necessarily new. We've always had and placed importance on governance, ethics, in data, and we've had longstanding domain practices around privacy, security, that's gone back many, many years, on all of these systems I mentioned earlier, I think what's new is, as machine learning and AI continues to innovate in advance and you have these no code, low code, auto ML platforms. Which they're designed to make technical and non-technical users' jobs and understanding data much, much simpler. So in the past, you used to have to collect large amounts of data. You'd have to have some data engineer sort of process it, data scientists, then go build machine learning models. They then deploy these machine learning models. And then that end-user is sort of consuming. That governance process that I just mentioned was in place and is starting to become more prevalent, but it has to be something that people think of as a standard, right? This is not something they want to layer in later; they want to be incorporating proper best practices and stewards and policy-driven decisions. All of that needs to happen early on in the process and ideally leverage what they have, what they're trying to fold in now as some of the new tech, and some of the new challenges that get introduced with this new tech.

[00:08:34] KRISTINA: So, tell us a little bit more about that, because you also mentioned ethics and ethics doesn't necessarily correlate to regulations and regulatory compliance. I set it aside as a separate category because you can be compliant with certain regulations and not be ethical, or conversely, you can be ethical and perhaps compliant, or you can be ethical in compliant in a, super uber kind of way. That's a competitive advantage. So, when we talk about ethics, what do you mean?

[00:09:05] GAURAV: Yeah. So, to me, I think ethics is an interesting challenge and opportunity to me, it's a movement and it may not be the first thing people think about when they think about ethics more generically. But the reason why I think it's a movement is it's an evolving set of frameworks. And it's also about societal best practices on how we're driving responsible use of data. And in this case, AI, right? So, to me, it has both cultural and technology-based implications. We kind of mentioned earlier, privacy, governance, and security. These are all things we've done in the past. And these have been important domain best practices we've established. And I think what's changing with machine learning and AI is ethics is becoming more of a concern because as we've increased the rate and pace at which we can build and use AI, right? If I have these automatic or low code products that effectively just take data and spit out a machine learning model, and that machine learning model is then making predictions and making decisions. Now that we're starting to use AI more heavily in society, whether it's retail recommendations, which seems somewhat benign, to using GPS in your smartphone, which we all rely on pretty much every day. I don't think I can go anywhere anymore without using my GPS. So those are all location-based services, right? To using your Siri or Alexa, your smart home, NLP based systems, to then more complex, issues like underwriting, right? So as someone who's kind of going through the loan process myself in the home buying, there's been a very large opportunity to improve the fairness and to remove biases that have been inherent in some of these systemic processes like underwriting, are we consistently using machine learning to make the best decisions when it comes to approving or denying loan? Have we historically been biased based on the data? It could be race, it could be age, right? It could be geographic location. These are all the types of inherent challenges that are starting to come into machine learning and AI specifically, right? As the rate and pace of adoption around machine learning and AI in these industries are starting to rise. And healthcare is another, we see different types of sensitivity, amongst race, sex, also age, based on clinical data and the sort of the data that we're starting to capture. And it's no surprise that this industry is also ripe, based on the data that they're collecting for machine learning and AI. So, they're also the responsible use of AI becomes prevalent, and it's why we're starting to see more people have issues with what's being collected, how it's being collected and then more importantly, as you start spitting this data into some of these very powerful tools to build models that then make, predictions and decisions. Is that a black box? Can I really understand what that machine learning model is doing? Do I know from soup to nuts that the right data was there to train the model that the right person was hired with the right diversity in thinking to train that model; to do I have the checks and balances in place, to check that the data was validated, that the model is continually making the right decisions once it's been deployed, these are all sort of the end to end challenges. And why I think ethics is becoming sort of front and center with AI.

[00:12:54] KRISTINA: I want to hide for a moment because it seems like a huge problem; there are so many challenges ahead of us. What is the thing that you advise organizations right now to do? Because honestly, I think for me, it can sound a little bit overwhelming in terms of how do I bake in ethical AI? How do I ensure that I'm dealing with responsible data governance? Where should organizations start? Where should individuals who are, need deep in marketing right now? And they know they have data and need it, but where should they start? What should they do?

[00:13:25] GAURAV: To be honest, I would think the first place to start is to review the practices that you have in place. So hopefully, you already have maybe a base data governance strategy. Maybe you have security policies in place. Maybe you have data stewards. And now it's about how do I start to evolve those to meet the needs of the new tools and techniques that you're starting to build and the practices that you're starting to build in the business. And then, more importantly, I think we're starting to see ethics and AI ethicists as an example of roles. And I think this is starting to happen because they realize there's more than just the technology component. So, if we break it into technology and, let's say, culture for a second, On the technology side, I would argue let's continue to layer in the right policies and start with what you have. And ideally, you have something to start with. And then you're building from that. So that could mean more tools, right? So, there are now tooling that exists around, we call trusted AI, so these are the types of tools that will help ensure as you're building and using and deploying that the entire life cycle of machine learning and AI can be checked. There are different ways that you can validate your data sets. There are different ways to do different types of testing as you're making your machine learning models. And then, when you run these machine learning models in production, there are different ways to monitor them. We call it observability or ML ops. There's an entire practice around this now, which is good. We're evolving to meet the needs that are coming from machine learning and AI. And then on the culture side, I think this is where some of the most exciting things are starting to happen internally within companies; you see human resources, understanding that ML and AI are not something that's going away. So, they're incorporating, just like your business conduct guidelines, that new employees go through. I would urge people to start understanding and treating AI in the same vein, like educate your teams on what AI is, and educate them on how they're using it. And then realizing that there's probably a skill set that they need to bring into the company. And it's not just a technical one. It's, do we need to have a governing board? Do we need an AI ethicist? Do we need someone that's going to stay up to date with the policies that are happening around the world, whether it's in the EU, Japan, or Australia, right? As more and more people realize. Again, AI is not going away. How do we continue to use it and live with it, but in a safe, responsible way?

[00:16:07] KRISTINA: I'm so happy that you mentioned that because I think you probably saw the headline that Accenture will be bringing on board 150,000 new hires and spending their first day of work in the metaverse. And my first thought was that what we talk about at XRSI, which is who will own all that data associated with those individuals being onboarded through virtual reality. I mean, it's a lot of potential data points; things like, are you paying attention? Are your pupils may be drifting off because it's a boring 1 0 1 onboarding experience. Is that something that's going to be held against you in your annual performance review? So you bring up a good point around needing to think about HR. So it's not just about who are we marketing to? It's not just about the consumer. It's really about data that's flowing all around us.

[00:16:52] GAURAV: Exactly. And I think it's everything from developers to managers. We always in prior jobs always used to say, AI is never going to replace people, but managers that are using AI will likely replace managers that don't. It's becoming a competitive advantage. So it becomes a pervasive opportunity across enterprises, across different groups, whether it's CSOs who are typically responsible for security but now have to understand how does that extend now to machine learning and AI? Maybe these organizations have a chief data officer, and what we're starting to see now, even more, is chief AI officers and centers of excellence around machine learning and AI are starting to become common. There needs to be some sort of body that's focusing on not just governance but also extending it to sort of the responsible use of machine learning and AI. And I think that's one of those areas that becomes more systemic. And it's also a little bit broader than just the technology teams within an organization; it's pulling in HR, marketing and really making sure that core to that company's DNA is a better understanding of like how machine learning is going to help them and make sure, I think everybody's on the same page to, to use it responsibly.

[00:18:15] KRISTINA: And you worked with very large organizations, multinationals; in fact, one of the things that I'm always wondering is how do you achieve this at scale? So when I think about any AI and ML, it's all about the data. And yet we're encountering a lot of these data privacy laws worldwide now, including some that say, hey, you do need to localize that data. You can't export it like vendor GDPR. You can't just take all EU citizen data and move it over to the United States into a cloud unless there are certain conditions that are met, China has its own data localization requirements, but we see a lot of that increase from Russia to Korea, to South American countries as well, how do you deal with that? And what are you seeing amongst your clients right now who are not just dealing with these data localization challenges but they're also in a regulated field? How should they respond? What are they doing?

[00:19:07] GAURAV: It's a great question. And it's, it's one that's still continuing to evolve. I think what we're starting to see more, especially from the customers who are realizing that there are regulations in place and then two sensitivities around some of the data. So maybe it's compliance, maybe it's HIPAA, SOC1, SOC2, if you're in banking and financials, I think a hybrid data strategy is becoming one that's more important. Recognizing that you're going to invest in multiple tools. It could be cloud. It could be on-prem. These could be virtual, private clouds; having a better strategy in place that can talk to both an on-prem and cloud environment, I think becomes critical because then you're sort of bifurcating and sort of routing workloads and thinking of it from a workload perspective, routing the workloads to the systems that will allow them to meet their business SLA, in a way that can take advantage of locality, regulations, privacy. So I think that's one. And then I think the second area that I'm starting to see more customers invest in is sort of these centers of excellence because what we start to see them do is they can sort of sitting back, and their responsibility is more global and his, you know, uh, they have a broader sort of impact across the entire company versus one specific department that says, this is how I've been told I need to use my data. They're looking a little bit more holistically on here's the best path forward. Here's the end state that we should be looking at and then working backward because I think that type of approach is a little bit better versus trying to have every individual team in a silo, kind of work their way towards a modern data structure or a modern data stack, because you're going to get 15 different answers based on 15 different teams using 15 different products, in different environments. So, really centralizing, I would say that the data strategy becomes just as important as having the right infrastructure in place to run it.

[00:21:16] KRISTINA: Do you see a trend in that context towards synthetic data? Or are we still, you think, a long way off from that?

[00:21:23] GAURAV: No. I think synthetic data is becoming a very prevalent way to help get to what I call applied AI. And to me, applied AI is all about running machine learning and AI at scale in the context of a business problem. And the reason why that becomes so important is some of the rate and pace of the innovations we're talking about. Machine learning models are getting bigger, which means more data is needed to train them, to retrain them. And as a data scientist, that poses two challenges. I need access to data and large amounts of it. And then I need to compute. I need the physical infrastructure to power the training processes for these models. So where synthetic data becomes really advantageous is one, it really allows end-users from a time to value statement the ability to access relevant, ideally industry-specific synthetic data, right? To start getting to experiment and trialing faster. Two, they can sort of utilizing the right amount of that data to then meet the needs of whatever infrastructure they have in place. So, the best example I could give here is if you're keeping up with the NLP world, there's DALL-E which is a new NLP model, that's trained to draw effectively. Now that sounds great in practice, but this is a billion. This is the multibillion parameter. These big, big, big neural networks need a lot of infrastructures, GPUs, and cloud resources to run. So, if I have synthetic data sets, theoretically, I can use some of the new state-of-the-art techniques to retrain these massive, massive pre-train models with smaller, relevant data sets for me, for my industry needs. So, I think synthetic data sets aren't just important. I think they become a very critical way really democratize the use of some of these new, powerful models that are coming into the market.

[00:23:31] KRISTINA: And I realize that I just threw out the term synthetic data sort of willy-nilly and said, hey, is this a good idea? But what we didn't do is we didn't define synthetic data; help us understand layman's terms. What is synthetic data? Obviously, great idea. Or it has a place, but what is it?

[00:23:47] GAURAV: Synthetic data to me is part of just this larger idea of pre-trained data sets, right? So, these are readily available data sets that you can download from anywhere. Some of them are free, some of them are proprietary, and some of these data sets are being created by machine learning and AI. So, some companies are making a business model out of generating relevant, labeled data sets. So those data scientists who maybe you're in a smaller industry or a smaller company, and you can't afford to hire all these great data scientists to get data, train data, label it, relabel it. This becomes a challenge. So having a readily available data set to effectively grab and go like load and go is the concept behind synthetic data. And it's just a matter of how synthetic is that data was; machine learning and AI used to create it. And, you know, another example is in the machine learning and AI space in computer vision, which is all about utilizing images that are being created, whether from video, et cetera. ImageNet is a popular, open-source data set, right? Where many data scientists in machine learning and AI are using ImageNet as a way to train machine learning models, and it's becoming a standard. So synthetic data is becoming more normal in the process of machine learning and AI.

[00:25:11] KRISTINA: One of the takeaways from this conversation, Gaurav, is that we need a lot of tools. We need a lot more tools than I certainly have in my toolkit, especially when it comes to having the right tool for the right job. And so, I'm curious from your perspective, for anybody who's looking at purely solving their business problem, whether it's somebody in marketing or as a digital ops person, not necessarily IT, I think because even though IT is a great partner in the enterprise, what I hear a lot these days is it is just too slow for me. IT, as it used to stand, is just a little bit too slow. So as folks are looking at toolkits and various technologies that they need to stand up and access their data, make sure they do it in an ethical way as you've described, maybe using synthetic data, tossing that in where it's appropriate. What types of tools should we consider putting into our toolkit, specifically technology, right? Mean, there's a lot of process organizational aspect, like you said, digital policy, et cetera, but just thinking tool-wise, what should folks be thinking about? Or are we even maturing enough yet to have a standardized tool set?

[00:26:16] GAURAV: I think it's a great question. I think there are certain, I would say, tenants to the machine learning and AI space that we're starting to see, like bias, drift, and anomalies like there are certain things that are becoming a little bit of a standard as a challenge. And I think as a result of that; There are tools out there that address how do I detect bias in a machine learning model? How do I check if my data set has drifted, maybe my underwriting process? I have loan data from the last 15 years. Maybe I need to go back 20 because I need more historical data and more context. And then I think, as machine learning models are running, there are now tools that are telling me, hey, are the decisions and outcomes consistent with what you had previously predicted? Meaning I trained a model as a data scientist, thinking it's going to do a now in the wild as it's making predictions potentially on loan in the underwriting process; we're starting to see a primer B. Is that deviance or delta between AA prime and B acceptable? Maybe it is. So I think we are starting to see some standards emerge in techniques that we're employing. And then, from a tooling standpoint, what I would urge companies to look at is check that you have some of the basics. So if you're running, if you have an auto ML tool, check and see if there's bias detection or bias remediation. It is explainability, a package that's supported. And then, as they're starting to move more from experimentation of AI and they're actually running it, I would really urge companies to look into ML ops and observability because these are the tools and dashboards that let you know DevOps and ML ops users really understand how is it impacting in, in the runtime, in the process, in the business application? How is my machine learning AI impacting that SLA? And is that impact acceptable or not?

[00:28:24] KRISTINA: So for those who are new to ML ops, thinking about observability, and probability, where should they go to find out more? What are they looking at in terms of getting themselves up to speed? Any recommendations?

[00:28:36] GAURAV: Yeah. So, there's, first of all, the good news is, I mentioned there's lots of rate and pace and innovation in, in AI. And one of them is open-source. So for many companies, open-source is becoming a great way to learn, leverage, test, and use, especially in machine learning operations and machine learning observability; as the name kind of implies, you're detecting, you're looking, and you're comparing a lot of different things that are happening. It could be, how is my machine learning model running against the infrastructure I have in place? Am I scaling it correctly? Am I providing access? As we talked about earlier, are the right people, you know, leveraging the data in, in the proper way. And then what we're starting to see more of is lineage. Traceability, auditability, publishing to data catalogs. As an example, all of these are becoming more and more prevalent, not just for the data needs that we've seen in the past, but now extending it to machine learning and AI as that life cycle becomes part of a critical process. So, I would tell users to check what's out there, from the open-source tools, even basic things in PyTorch, TensorFlow, and Keras; some of your popular, deep learning libraries have these capabilities; start there. And then there's now a number of companies that are focusing on these specific areas and majoring in one of them, like certain companies will just major in ML ops or drift or features, so I would urge them to really understand maybe in comparison to what they already have in place. Like we talked about earlier, what's the delta? What are they missing from coverage and control and from a governance standpoint? And then, if some of these open-source techniques aren't enough, start to look at the companies out there that are focusing and majoring in these areas because they now do exist. Both from a startup and then from the larger cloud players and even data warehouse and cloud database providers that they're likely running.

[00:30:45] KRISTINA: Well, I've gotten excited, Gaurav; thank you so much. This has been really, really insightful. I'm sure everybody who's listening has a lot of good information and insights to take away from this conversation. We're going to include the references that you mentioned in our resource notes. We'll get that teed up, but is anything else that you would like to point people towards, or is there a place that they can follow you online in the things that you've been talking about with us today? A place to follow up with you?

[00:31:09] GAURAV: Absolutely. They are more than welcome to come to check us out on what we're doing AtScale. We feel like we play a very important part in this process of democratizing the data and serving it to both BI and AI users consistently and taking advantage of the governance strategy. So, I would definitely urge users to check out what we're doing in this space. And then also follow me on Medium. So, I am always writing different articles, whether it's about recommendation systems or the trends. Please don't hesitate to check out those areas and, of course, on LinkedIn and other social media avenues.

[00:31:43] KRISTINA: Great. Thanks so much; I really appreciate your time today. It was great to catch up finally. Appreciate the time.

[00:31:50] OUTRO: Thank you for joining the Power of Digital Policy; to sign up for our newsletter, get access to policy checklists, detailed information on policies, and other helpful resources, head over to the power of digital policy.com. If you get a moment, please leave a review on iTunes to help your digital colleagues find out about the podcast.

You can reply to this podcast here: