S4 #2 Deep diving in the world's easiest and fastest lakehouse

S4 #2 Deep diving in the world's easiest and fastest lakehouse

S4 #2 Deep diving in the world's easiest and fastest lakehouse

Tomer Shiran

Tomer Shiran

Tomer Shiran served as Dremio’s CEO for the first 4.5 years, overseeing the development of the company’s core technology and growing the team to 100 employees. Previously, he was the fourth employee and VP Product of MapR, a Big Data analytics pioneer. Tomer held numerous product management and engineering roles at IBM Research and Microsoft. He is the founder of two websites that have served millions of users and 100K+ paying customers. He holds an MS in Computer Engineering from Carnegie Mellon University and a BS in Computer Science from Technion – Israel Institute of Technology and is the author of numerous U.S. patents.

The responsibility for complying with data privacy laws and ensuring that data is secure falls on individual companies. If your data is copied all over the place, it's tough to secure it because each copy has its own different set of privileges and permissions. Consider the implications as your organization deep-dives into the world's most accessible and fastest lakehouse. In this episode, Tomer Shiran of Dremio talks lakehouses in the era of data privacy and privacy law compliance.

data privacy, data protection, data lakes, data lakehouse, data warehouses, data exchange
Episode number:
Date Published:
January 31, 2023

[00:00:00] KRISTINA PODNAR, host: Integrating every healthcare system or any system for that matter, is an ideal worth pursuing, but we're unlikely to achieve that level of integration for many industries in the coming decades. What's possible now is the adoption of data exchange and standardization, which can get us exceptional outcomes.

[00:00:18] INTRO: Welcome to The Power of Digital Policy, a show that helps digital marketers, online communications directors, and others throughout the organization balance out risks and opportunities created by using digital channels. Here's your host, Kristina Podnar.

[00:00:33] KRISTINA: Tomer Shiran recently stated that open data standards such as Apache Parquet and Apache Iceberg, as well as just open data architecture such as data Lakehouse are ushering in a new era in which data from disparate systems can be centralized and aggregated, breaking down silos that inhibit people in companies from having a 360-degree view of patients and other constituents. Sounds familiar, certainly, to me. So, I asked him to join us today and talk further about data exchanges and standardization. Tomer served as Dremio’s CEO for the first four and a half years, overseeing the company's core technology development in growing the team to over a hundred employees. He's held numerous product management and engineering roles at IBM Research and Microsoft, and he's the author of numerous US Patents. Tomer, welcome to the Power of Digital Policy Podcast. It's great to have you.

[00:01:26] TOMER SHIRAN, guest: Yeah, thanks for having me here.

[00:01:28] KRISTINA: I recently had this thing where I had to deal with an elderly parents' healthcare issue that involved several hospitals, a health plan, post-acute care centers, ambulatory care centers. It was a big mess. A lot of redundancy in terms of forms, questions, and doctors coordinating. I had a headache, but how big of a problem is this for the industry?

[00:01:50] TOMER: I think it's a huge problem. We're living now in a world where you have so many different systems and they don't talk to each other. And at the core of all this is data, right? Each system has its own data. We call these data silos. And because you can't bring them together, things are very difficult. In fact, an interesting example from yesterday, my daughter needed to pick up some MRI results. And the two systems don't work together where she got the MRI and the doctor she was going to see, and they told her to send the fax so she could pick up a CD. And she asked me what a fax is and what a CD is. And so there we are the price of data silos.

[00:02:23] KRISTINA: Wow. And so, is it the case that silos are inherently bad or is it the case that we can't tie those silos together?

[00:02:29] TOMER: I think there's no going away from different systems, having different storage of data. They're built by different companies. They’re offered as different services out there. But at the minimum, we need to be able to kind of stitch those things together, right? We need to have a way to analyze data at least across disparate sources. And then that's what we're so focused on at Dremio is really providing companies with and organizations with a platform that allows them to consolidate their data and do analysis across different data sources. And that's kind of a first step I think, towards this ideal of really unifying things.

[00:03:02] KRISTINA: So, you've said that open data architectures is one way of solving the data exchange problems, right? We have data everywhere and we all want access to data. So how mature is this concept and what does the landscape look like just in terms of standards and platform solutions?

[00:03:19] TOMER: Yeah. Things are changing. But at the core of the problem or maybe the most foundational problem is the need to have common ways to represent data, right? Because if different systems represent data differently, then that just isn't going to work. And we know historically, there were formats as simple as CSV files, right? Spreadsheets that, I could open a spreadsheet, could share it with you. You could open that same spreadsheet, right? And so that ability to kind of share data across different users and organizations, that was important. We've moved on in, in the last few decades. Those simple formats aren't enough anymore. But we've had other things happen such as the rise of the public cloud, which has provided these centralized services and places. Where data can be placed and accessed because it's all in, there's kind of one cloud, right? Or maybe a few clouds, but that's accessible to different systems, different services, and different technologies. And if there are standards in which the data is stored in these places and these clouds then that makes it possible for, data to be shared across organizations like that.

[00:04:16] KRISTINA: So which health data standards should we be thinking about? Is there a set standard? Are we sort of mature enough at this point? Or what should we be looking for when we're inside of organizations in terms of boosting interoperability? What's the golden key, if you will?

[00:04:32] TOMER: I think first of all, it comes down to, okay, what it physically like, how are we storing the data and, the open-source community and, we at Dremeo, we operate in not just with health data, but any kind of data, right? Our customers span everything from, yeah, the world's largest tech companies to the largest banks, healthcare companies, and insurance companies. And so we see a very broad range of use cases and needs that companies have. But there are a lot of commonalities here. And the open-source kind of community has come up with some new standards in recent years on how to store data. And so we have projects there's a project called, for example, Apache Spark. And the idea there is to have a standard way to represent data in what we call column formats. And those are formats that are much faster for analysis. And more recently a team at Netflix actually and folks at Apple and many of the large tech companies came up with a new standard called Apache Iceberg, which is a kind of common table standard. And that common table standard can be accessed by and processed by a variety of different systems that do analytics. And that includes Dremio which I'm a founder of. But other companies Amazon's entire cloud infrastructure can work with Iceberg, a company called Snowflake, which does cloud data warehouses they can work with Iceberg. Google's cloud can work with Iceberg. So now you have, you're starting to see this kind of standardization around the single format that's open source and all these different technologies can work with. And I think that's the first step towards, in, any industry being able to have that kind of interoperability and not get stuck into these data silos.

[00:06:00] KRISTINA: So, just as a data policy person and a policy sherpa, I guess, I get excited when people talk, start talking about standards because I think that's the nirvana state, but we have these competing priorities, right? Because you're working on standards. The industry has the opportunity to work with standards, but we're also facing a time when there are data privacy laws creeping up all over the world. We have siloed data, and privacy laws, if you will. Instead of different tapestries, how do you deal with that? Or what's your advice in terms of competing data priorities for folks? We have the governance, access, security, and data for localization. Is it just that pipe dream then that we can standardize and have interoperability? Or is it just within specific countries?

[00:06:44] TOMER: Yeah, well, in general, I think that privacy and security are top of mind for every organization today. Both because there are laws, of course, that now regulate what you're allowed to do. And the laws, as you said, are many different laws. There are, sure, there are some standards that are broader, like the European Union and, GDPR for example, and, California's law, but, every state starts having its own laws. And all of a sudden, as a company, you're forced to deal with all these, understand them all, and comply with 'em. And of course, as you start thinking about sharing data, that could potentially make things even more complicated free as a company. But the responsibility for complying with these laws and making sure that data is secure falls on individual companies, right? It's not something that, expands companies each company has to make sure that they're in compliance. And a key, or maybe the first ingredient to accomplish that is to minimize the amount of data copies. And the challenge that most companies have, is that data has been copied all over the place for a number of different reasons, which we can talk about. But if your data is being copied all over the place, that is very difficult to secure it because each copy kind of has its own different set of privileges and permissions. And keeping those things in sync is very difficult. People move around, they move across departments, and they leave companies. And so if you have lots of copies, just keeping that in sync is impossible. And then also, even simple things. GDPR requires you to be able to delete a person's information. Well, if that data has been copied all over the place, and maybe it's on some analysts' laptops and data science machines, then you probably as an IT organization don't even know where that data lives. And so now trying to delete that person's records, good luck even knowing that you are successfully doing that and that you've kind of covered all the different copies. And so, for us as a company that provides a platform for data analysts in the cloud, that eliminates the need for copies. That's a big important kind of advantage that our customers get is that they can create this infrastructure, for their organization that eliminates the need to create lots of copies of data in the data warehouse that eliminate the need to create copies outside of the system as well. Things like BI extracts, Tableau extracts, Power BI imports, and Cubes. All these things that people do usually to get higher performance for their queries. If you can eliminate that need through, in our case this autonomous kind of query acceleration technology that we've developed, well now you don't have copies and it's a lot easier to secure the data and to comply with different regulations.

[00:08:59] KRISTINA: For those listeners who are focused heavily on marketing, or maybe they're working in digital ops that don't touch things like Tableau, help me explain to them what we mean by a unified location of data, housing, and the autonomous query?

[00:09:14] TOMER: Yeah, sure. Well, a common thing that people or companies want to accomplish with a platform like Dremio is we call it customer 360, right? If you think about it as an organization, you have a bunch of customers that you're going to probably market to, right? And, if you're a healthcare provider, maybe these are patients, right? Or somebody's interacted with you, but they usually don't interact with you through one, specific channel, right? They visit your website, they visit, and they use your reservation system. They visited an office where they've gotten some service, right? They've done all sorts of interactions. And if all those interactions live in different systems, then it can be difficult as a marketing person, for example, to really understand what that person is doing and to tailor kind a specific campaign around; I want to send an email to all the people that have, been on our website and looked for appointment availability. But I've also already visited one of our providers, right? So those might be in different systems. And if you want to target that, you need to be able to have this visibility across different systems. And so the ability to take data from all these different systems that you might have, consolidate it into one place. In other words, put that data in a single place where you can run a query across it and ask those questions like, who are all the people that have done X and Y and Z, right? That opens up more opportunities for marketing, customer service, and a variety of different use cases that an organization might have. And one of the cool things that we've done at Dremio, for example, is to provide the ability to do this in an open, in an open approach, right? So historically companies would maybe adopt the data warehouse and say, okay, let's move all this data into a data warehouse. And that project would probably take two or three years. And by the time it was done, it, people probably wanted to do something different. And so we've made it so that they can basically take Dremio, connect to all these different systems directly, and even the data that they are physically moving into a single place is done in an open format so that it's not just for Dremio's use, but you can use, like I said, Amazon's services, Google services, a variety of other technologies on the same data. So now you've built a vendor agnostic and a data lakehouse.

[00:11:15] KRISTINA: It seems like that's the pipe dream. Every person that I talk to, first of all, says, oh my gosh, we were building data warehouses for years. Then we started building data lakes. We're still fighting each other and dunking it out within the enterprise because everybody has data. There's the governance issue of how do we even make people put all the data in the same location. It sounds like what you're saying is that you don't necessarily have to do that. We can connect the different data sources and create a single source and data truth if you will. Is that the case?

[00:11:42] TOMER: Yeah, exactly right. So, in the real world it's not as simple as, oh, I'm going to create one system and put all my data in there, right? That rarely works. Yes, for some data, that is the right approach and that's possible to do. And in other cases, maybe the data's in some system that where some Oracle database where it's rapidly changing all the time, and it's hard to kind of move it as kind of one snapshot into a centralized system because it'll be outdated by the time it gets there. So for a system like that, you want to be able to kind of work with it live. And so yeah, what we've done is kind of created that ability to both connect live to external sources. Things as Oracle databases, SQL Server, and a variety of different databases. And then also the ability to have data that's centralized in cloud data, lake storage. And you've brought up the fact that there were data warehouses and there were data lakes, and I think there, in the work cloud data warehouses, there was a load of kind of a comeback to because of the migrations to the cloud, people needed a warehouse for the cloud. Ultimately the rise of the Lakehouse is exactly too kind of designed to provide the best of both worlds, right? The name itself is, lake and Warehouse combined. And that's the idea behind the Lakehouse.

[00:12:46] KRISTINA: And so historically, I've seen a trend in data management, which is sort of centralized in terms of responsibility for storing, curating an organization's information to the data engineering team. Is that the right placement in organizations? What are you seeing from your perspective as people look to deploy your tool? Ought sort of the owners still reside with data engineers?

[00:13:10] TOMER: Yeah, that's interesting, and I think a hot question right now that a lot of companies are kind of thinking about. And you may have heard the term data mesh kind of came out of that question. And so yes, at its core, the data infrastructure itself, the systems, tools like Dremio that allow you to query any data at any time, right? Those are adopted by the data team and the data engineering team. They're responsible for selecting these types of platforms, making them available in the cloud, and things like that. But then, especially in a larger enterprise, the data team and the data engineers might not have knowledge of the actual data itself, right? That might be something that the business understands a lot better. Different business units understand a lot better than actually the kind of infrastructure team or the data team. And so finding that right balance where, yes, the data team can provide a system like Dremio, but then you have the business units managing their data, right? They're responsible for curating it; they're responsible for deciding who's allowed to access it. They're responsible for making sure that it's always fresh and up to date and things like that. And when people have questions as they naturally do, what does that call 'em? What does that? What does this mean? Then the business unit is much more able to answer that than a centralized data team. And so that's the idea of a data mesh, right? It's really thinking of data as a product. That is a term you might hear, right? They can manage data like they manage their own products, right? The team manages the iPhone application for the health system, right? That's a product, right? Well, now they can start thinking about their data as a product in the exact same way and building those best practices around testing and data quality, right? Agile development kind of iterating quickly, doing work in isolation. In fact, we as a company recently launched in kind of preview at this point, but a new product called Dremio Arctic. It's a cloud service, and the idea was really to bring all these ideas from product development to the world of data. And so we call it data as code, really, or kind of providing a GitHub for data. Because so much has happened in the world of software development that makes developers so productive today, none of that's happened in the world of data. And so we're really excited about bringing the kind of to data engineers and data scientists and data analysts, all those capabilities that developers get to enjoy today when they're building applications and products.

[00:15:17] KRISTINA: That's super exciting. I'm wondering if you're also going to hand me a box that says, here's your business box as well. Because when I think about data scientists and data analysts, I get excited. Certainly, moving beyond the engineering team is exciting, but the business seems to be really busy, right? When I talk to marketers, they're like, okay, I've got so many things already on my plate. I must keep the lights on. Now you want me to also deal with data? It's one more thing to add to my list, and now you're asking me to start governing that data. That's kind of craziness. I don't have time for that. How do you deal with those people who don't necessarily understand the value? Because there's a little bit of a learning curve and a little bit of a change management effort here.

[00:15:56] TOMER: Sure. I mean, when I talk about the business unit, it, it's not kind of one person, right? It's not one type of person, right? And so, you might have folks that are; they're not technical. They're, maybe they're marketers, right? Or that is, in some other area. But within these business units, you typically also want to have folks that are data analysts or data scientists, right? They're kind of bridging that world of the technical and the kind of data engineers and the end users, right? And so they're sitting with the business unit; they develop a deep understanding of the data and the problems the business is trying to solve. And they help them, right? They are able to write SQL queries; they are able to set up Tableau dashboards. They are able to set up extracts of data for marketing purposes and things of that nature, right? And so I think that's really important in, in a company that wants to be data-driven, is to have this kind of folks that are, you know, and you can call 'em data scientists analysts or whatever, depending on the skills that you. Have them close to the business and really as their liaisons.

[00:16:53] KRISTINA: Yeah, I've been calling them power users, folks that maybe get some training certifications, they understand how the data works, so that they're not going to make things kind of go kaboom and blow up. But they aren't necessarily, data scientists, either. So, they're sitting with a business, they're embedded, they're power users that can bridge that business side with the actual data side. Is that a good way of thinking about it?

[00:17:13] TOMER: Yeah. Yeah. I've heard lots of different terms, power users, resident analysts. And, again, it depends on the skillset you need, right? If the marketing team wants to do very sophisticated kind of work, then you might need more of a data science kind of skillset. If all we're talking about is, getting lists of people to market to, then maybe that's a simpler skill set. But yeah, at the end of the day, they have to understand the business because there are also a lot of ways to take data and do the wrong thing, right? If you don't really understand that data, then you could easily make wrong decisions or interpret it incorrectly, or even just interpret it in different ways. And then all of a sudden, you're providing some report or some numbers too, your management, and the next week you provide different numbers, and turns out it was just, you were just calculating things differently. Right? What is a work week? Is it Monday through Sunday, or is it Sunday through Saturday, right? So that could be simple things like that. You have to make sure you understand the data, and you have a consistent approach to that.

[00:18:09] KRISTINA: So that's something that I've been hearing from several different colleagues lately. They say, oh, everybody's screaming data democratization; self-service sounds great, right? But then it turns out that every time I hear that, I don't know, maybe a consumer has no income. I put a zero. And you don't put anything in, right? Because I've decided that zero is what I should be putting into the database. You don't put anything in. And when we look at the average, it makes a difference whether you have a zero or you're not accounting for that person. Very different data interpretations. And so, thinking about it from a peer governance perspective, is there any kind of homework that you would advise people to do or take on, get some hygiene in place before they dive in deep and get onto your platform?

[00:18:51] TOMER: Yeah. And I'd say the wrong way to address the challenge you described is to take away self-service. Because once you take away self-service, people are out there to do their job, right? And they're going to work around the systems, and it's going to be even worse. And so, you can't get away from self-service, but you can have what I describe as governed self-service. And the way you do, that is, you have some sort of kind of semantic layer, right? This allows the data team and the folks that understand the data to create a view of the data for different constituents, and so they can kind of take care of that kind of curation of data and create that consistency. But then from there, they still allow, from that point on, they still allow folks to ask their questions and do their analysis. And so it's striking that the right balance and a semantic layer allow you to do that. And that's we've invested a lot in our own product. And this is where Dremio is also different from other data lake houses and data warehouses in that we have a built-in semantic layer that allows the data team to expose data in a very kind of elegant way. You can create different folders of data and a kind of hierarchy. And then you'll decide I'm going to use a kind of role-based access control and decide who can see which areas of the semantic layer for different purposes. And I think that's important.

[00:20:04] KRISTINA: So it sounds like something that I can do to get my organization up and running relatively quickly, but I still have some governance homework to get done.

[00:20:12] TOMER: Yeah. Yeah. And that balance is, I think, the only way to do it. Unless you, you say, we're not going to be data-driven. And I know a lot of parts of healthcare; I mentioned earlier my daughter's sending a fax to get a CD; some parts of the system are still like that. But if we want to move away from that, then yes, we absolutely have to become more data-driven and more integrated.

[00:20:30] KRISTINA: I want to know where you found a fax machine. That's the question that I want to know!

[00:20:34] TOMER: I didn't, I searched online for services and at the end of the day she thought it was kind of like an email were, put in a phone number instead of an address, but yeah. I guess on the other side it came out as a piece of paper from some machine.

[00:20:46] KRISTINA: Yeah. Craziness. Hey, tell me a little more about what you're doing in Dremio? What's the roadmap looking like? You've got this cool product right now, but you also have a roadmap for cool things coming. What are you guys up to?

[00:20:56] TOMER: Sure. Yeah. So, we've built the world's easiest and fastest lakehouse. That's our product today. It allows companies to analyze data, especially in the cloud, without having to put it into kind of proprietary and expensive data warehouses. And so that's Dremio's product, and we're excited about all the things that are coming, such as Dremio Architect, which is our service that provides data as code or the gap for data, really reinventing how data is managed, so that data engineers and data scientists and analysts can work with the same level of productivity that developers can work when they're building products. And so yeah, lots of things are going on. We primarily work with enterprises across regulated industries and others. So, as I mentioned, healthcare is one of them, but insurance companies, financial services many of the world's largest tech companies are customers of Dremio, and we do that both in the US and in Europe.

[00:21:47] KRISTINA: Interesting. In preparation for this conversation, one of the things I did was go back to HIMS just because I was thinking of healthcare, and I grabbed their definition of interoperability. And they defined it as the ability to access, exchange, integrate, and cooperatively use data in a coordinated manner across organizational, regional, and national boundaries to provide timely and seamless portability of information and optimize the health of individuals and populations globally.

[00:22:12] TOMER: That sounds like a good idea.

[00:22:15] KRISTINA: Do you think it's possible to get to that point, not just in healthcare, but in other industries?

[00:22:21] TOMER: Yeah, I think the trend is definitely to get there. Open did architecture an important kind of ingredient in all of that. But it's not just the technology, of course; it's also how companies have to operate, and the human element is as important in terms of achieving that goal. But what I see interacting with hundreds of companies and large enterprises, in particular, is that even internally, companies struggle to achieve what you just said. So let alone, different parts of the industry and different companies kind of collaborating and exchanging information and that kind of thing. Which, I think, happens a little bit today, point to point but not at a large scale. Even within organizations becoming data-driven is really challenging. Right? And that, I think, is kind of the first focus is how do you go from, okay, we have a thousand data silos within a company to having data democratization within the organization where anybody can ask any question at any time and get an answer. And that's what we're right now so focused on as a starting point,

[00:23:15] KRISTINA: In those organizations that need to clean up the house internally, you think before they can go externally? Or is this something that can happen in parallel?

[00:23:22] TOMER: No, I think it can be, can happen in parallel, and should happen in parallel. And I also don't think that wouldn't look at it as kind of cleaning the house because you don't want to start redoing everything that you have either; things are working, and it's very difficult when you embark on these projects that, take years. Often they don't get done, and the people that start working on them leave the company. And now you're stuck. And so I've seen a lot of projects, especially data warehouse projects, right? Like it's a two-year project, a three-year project never gets done. Right. And I am not a fan of those. I think. The best thing companies can do is, pick some new use cases, some things that they're trying to accomplish they haven't done before. Maybe it's some kind of integration of data from different systems. Tackle that as a new project. Maybe take advantage of it; if your organization has the ability to kind of, use the public cloud, then do this in the public cloud. Have some wins with these newer , kind of more modern technologies like Dremio, right? Like these lake households. Have success there and then bring more things onto this new modern platform. And I think that's a much better, kind of more incremental approach that's easier to kind of demonstrate results.

[00:24:23] KRISTINA: You're making my heart go pitter-patter, right? Because you're talking about incrementality. You're talking about choosing one or two use cases that make sense, proving the ground, and then building on top of that. It's everything that we always talk about, but it seems doable in this instance and probably the right approach because you're not investing years of time or tons of money, right? You're looking to get some small wins and change the culture as you go along with that and make sure that you're investing in the right places.


[00:28:41] TOMER: Yeah. First of all, with respect to Dremio anyone can go online and, we have a test drive, they can get started there. But we also have a conference that we host every year called Subsurface. And we bring speakers from a variety of different industries and innovators from companies like Netflix and Apple, and others. And so that's another good place where people can go to learn. And it's virtual, and it's free.

[00:29:01] KRISTINA: When is the Subsurface conference?

[00:29:04] TOMER: It's the first week of March.

[00:29:06] KRISTINA: All right, so we're coming up on that and folks should sign up virtual free March 1st and 2nd sounds like a good place to be. I'm certainly gonna sign up and be there.

[00:29:15] TOMER: Awesome. Thank you for having me.

[00:29:18] KRISTINA: Thanks for coming by. It was great talking with you today. Take care.

[00:29:22] OUTTRO: Thank you for joining the Power of Digital Policy; to sign up for our newsletter, get access to policy checklists, detailed information on policies, and other helpful resources, head over to the power of digital policy.com. If you get a moment, please leave a review on iTunes to help your digital colleagues find out about the podcast.

You can reply to this podcast here: