
Adam Kamor
Adam Kamor, Co-founder and Head of Engineering at Tonic.ai, leads the development of synthetic data solutions that enable AI and development teams to unlock data safely, efficiently, and at scale. With a Ph.D. in Physics from Georgia Tech, Adam has dedicated his career to the intersection of data privacy, AI, and software engineering, having built developer tools, analytics platforms, and AI validation frameworks at companies such as Microsoft, Kabbage, and Tableau. He thrives on solving complex data challenges, transforming raw, unstructured enterprise data into high-quality fuel for AI & ML model training, to ultimately make life easier for developers, analysts, and AI teams.
AI isn’t being held back by model performance or prompt engineering - t’s being slowed by something far more fundamental: organizations can’t safely use their own data. In this episode, Kristina Podnar talks with Adam Kamor of Tonic.ai about why privacy has become the primary bottleneck for AI adoption, especially when it comes to working with messy, unstructured data like customer interactions and transcripts.
They explore how many companies are still relying on manual data redaction, an approach that is slow, expensive, and nearly impossible to scale. The conversation also highlights a key disconnect: leadership often feels confident about compliance, while the teams closest to the data see just how fragile those assumptions can be.
Kristina and Adam dig into what it takes to move forward, from building privacy-first systems to adopting more automated, “programmable” approaches to data sanitization. They also unpack common misconceptions around synthetic data and why it’s not a complete solution on its own.
The takeaway is clear: privacy isn’t just a constraint, it’s a capability. Organizations that operationalize it effectively are the ones that can actually move faster and turn AI from experimentation into real-world impact.
[00:00:00] INTRO: Welcome to The Power of Digital Policy, a show that helps digital marketers, online communications directors, and others throughout the organization balance out risks and opportunities created by using digital channels. Here's your host, Kristina Podnar.
[00:00:18] KRISTINA: Everyone's really talking about faster models, better prompts, more powerful AI. You're seeing the headlines as much as I am, but that's really not what's actually slowing companies down right now. The real constraint is much simpler and much harder to fix. They really can't use their own data. There's privacy, there's compliance risk. Those things just aren't guardrails anymore. They become the gating factor for whether AI moves or whether it stalls. So the question isn't, are you investing in AI? It's can your organization safely access and use the data required to make it work? Today, we're digging into that tension. What's breaking, what's changing, and what's actually going to take to move things forward. So with us is Adam Kamor, co-founder and head of engineering at Tonic.ai, a company that's focused on synthetic data and privacy, safe data workflows. You'll find out what all of those things are and more as we welcome Adam.
[00:01:12] ADAM: Hi. Happy to be here today. Excited to talk about privacy and AI.
[00:01:17] KRISTINA: Awesome. Well, let's just cut right down to the chase and get into the conversation, Adam. Do you agree that privacy is now the bottleneck, not the guardrail? And tell us a little bit about what you're seeing actually happening inside of organizations when this issue kind of shows up.
[00:01:34] ADAM: I agree with what you're saying, that privacy is the main blocker at the moment. I'm not sure if I fully believed it until we started, and asking questions to industry experts and insiders. But kind of like unequivocally, that is the problem now. It's not about model speed performance, not even model accuracy. It's all about how can I actually use my data in a safe, regulatory, friendly, and compliant way? And you know, what organizations do about it. There's a lot of kind of ways to approach this problem and how to solve it. And I'm sure we'll get into those details as we go forward, but if you want just the TLDR, it while you're planning out what you're going to do and you're kind of planning out these systems and architecting things, do it from a privacy first lens. Don't have privacy as an afterthought after everything's already been built.
[00:02:22] KRISTINA: You mentioned, actually you did a survey recently, so in your survey findings, you described this real data bottleneck, especially with unstructured data. What does that look like operationally, because we have a lot of listeners that are really in that operational layer.
[00:02:38] ADAM: Yes. It's not exclusive to unstructured data, but it's, it's primarily unstructured data. And the reason is because it's just harder to work with than structured data. Structured data is, is typically easier to work with through a privacy lens because you know what data is in each column in general. Whereas with unstructured data, there could really be anything. You know, imagine like. A customer support chat transcript ,you have no idea what's being said on that, right? You can't predict it ahead of time. People say crazy things. I've seen examples of like people typing into like help forms, like giving their credit card information just because they think it's going to be needed by the customer service rep when really they, they shouldn't be doing those types of things. But in terms of what companies do about this, the today, the majority of companies are doing something which doesn't scale very well, and it's painful and expensive and slow. And that is, they are manually having to kind of redact and sanitize data before they work with it. And it's that manual piece, which is hard. I think everyone knows and recognizes applications and processes , and redacting data at that scale in a manual way is, is not really viable. If you wanna make your PSCs, if you wanna get small things out there and working, which many organizations have that the right way to do it.
[00:03:54] KRISTINA: And also, there's a problem there fundamentally, right? Because, , if I think about what you just said manually redacting data is one thing, but even knowing that you have to go redact the data as a thing, right? How do organizations actually find what's in the unstructured data? Like, where's that needle in the haystack and are you paying attention to the right haystack and the right needles?
[00:04:15] ADAM: The normal flow here is, there's some idea for some new use case that you're trying to solve, internal tool, externally facing, what have you. And you have an idea of what type of data is going to be needed for it. And let's, for example, say. We are trying to fine tune a large language model for some specific purpose. Maybe we're gonna fine tune it on customer support audio transcripts. Right? I think that's a good example. They're very sensitive and they're also very complex. So when I say manual redaction, in that case, what I mean is actual humans are going through and reviewing transcripts and removing the sensitive information. And then, as a follow up to that, maybe then the people could be using some models or other like automated techniques to begin helping them. But regardless, it's a very like labor intensive process today that kind of deal with that data and it doesn't scale. And I think that's where the main bottleneck is right now for AI systems. And more important than what I think that's what a lot of survey respondents mentioned as well.
[00:05:19] KRISTINA: Why are people stuck in that mode? Is this really a tooling problem or is it a governance design problem at this point?
[00:05:27] ADAM: Oh, that's a good question. What do I think it is? I mean, I think it's, it's definitely, it's probably both. It, it's never black and white. So I, I can't say, oh, it's definitely this. So on, on the tooling side, which I'm definitely more familiar with there are off the shelf open source tools for helping identify sensitive information. All they typically do is identify, they don't help you remove it or synthesize it or sanitize it anyway. And, and the solutions are so, so. Most of the open source products out there are typically trained also on open source data. This is like data from common crawl data sets, Wikipedia, Reddit forums, Twitter, et cetera. And it is not always indicative of the sensitive data sitting behind corporate firewalls. And as a result, the models won't necessarily perform very well. So there is definitely a tooling problem. Open source is not necessarily the best, of course, you know, hey, I could go to. Open AI or anthropic or Google's, latest models and have the LLM do it and then you hit other scale and cost considerations. These models are, they're good. They're not necessarily the best solution for identifying PII, but they are slow and they're very expensive. So yeah, I think that does indicate in several ways that there is a tooling problem on the compliance side. I think it's interesting here's something that we found out. I would've said there isn't really necessarily a compliance problem here like the typically in a given industry, the regulations are known on what you can and cannot do. Like healthcare is a good example of this, right? Like healthcare has HIPAA , HIPAA has very specific guidelines as to what you have to redact and not redact to make something safe. Not every industry is like kind of like blessed with something as clear cut as HIPAA , unfortunately. So for those industries, sometimes, you know, each company is kind of making their own internal guidelines. And interestingly what we found is that the closer you are to actually making the sausage or seeing how it's made, you know, like the data engineers, you know, the people that are really working in the internals of the data the closer you are to it, the less confident you are in your compliance and regulatory, within your company. Whereas like the people that are further removed, like the leadership and executives have a lot more confidence in it.
[00:07:39] KRISTINA: Oh, interesting.
[00:07:40] ADAM: Now, okay. Who, yeah, who do I trust more? Both sides kind of have like a different view of the problem. I kind of bias myself, myself more to like trusting the person that's on the ground, getting their hands dirty. So, you know, maybe there is a compliance regulatory concern there that I, I don't know
[00:07:56] KRISTINA: That's interesting. I'd be concerned, I actually want that person in my boardroom at the next meeting, frankly. Right?
[00:08:02] ADAM: Yeah, I, yeah, maybe, maybe I don't want them there, but yes, I, that, that would be the, the right thing to do, I think.
[00:08:08] KRISTINA: So talk to me a little bit about what you're up to at Tonic, because it seems like everyone is positioning synthetic data as the answer at the moment, but I'm concerned maybe, isn't there like a risk that it becomes a workaround maybe for bad governance rather than fixing the underlying problem? Talk to me about that.
[00:08:25] ADAM: Right. I mean, your output, which is to say your sanitized data is only as good as kind of like the rule that you put forward to, you know, how it should be sanitized. And there, there definitely is some maturity, I think to be had there. Is like a relatively new frontier for a lot of companies and understanding like the privacy implications of data is very challenging. It's dependent on the data itself, the population that it covers, what the data's going to be used for, who is gonna have access to the sanitized data, whether it can be joined against external data sets. Like all of these things come into play when trying to understand like, are the guidelines or the rules that I'm putting forward and how to deal with this, are they good or not? So for example, going back to healthcare for a second, this is understood in healthcare and as a result, most companies that are trying to like, you know, work more freely with healthcare data, they take their data through what's called an expert determination. This is where a third party, typically a third party would come in, it would be someone that has a mathematical statistical background, typically with a master's or a PhD in statistics or some equivalent field. And they'll actually look at everything that I just mentioned, like, the population, the use case, who can access it, all of these things. And then they will essentially provide an attestation letter saying, Hey, yeah, this data that has gone through this process, which they then outlined, safe and it is it adheres to the HIPAA guidelines, right? But that is like a very carefully done, expertly given statistical analysis and review ...
[00:09:59] KRISTINA: ...and very expensive...
[00:10:00] ADAM: ...you know, not every company has that kind of skillset in their wheelhouse, and not every industry has like frameworks in place for third parties to come in and do it. You know, it doesn't mean, let's say I'm in finance, it doesn't mean I as a financial institution can't hire some third party service with statisticians to come in and give an assessment. But what does that assessment mean? They're only assessing it against the risk level that I'm giving them and what my kind of needs are. So it can be hard and I think people sometimes complain about HIPAA , but HIPAA's actually great 'cause it tells you exactly what you can and cannot do. And, and that clarity helps drive like faster innovation in healthcare companies in that way.
[00:10:38] KRISTINA: The beginning of this, warmup conversation you and I were having, I mentioned the Data and Trusted AI Alliance. One of the things that we talked a lot about at D&TA was how do you actually acquire data, whether it's from an internal organizational partner, from department to department, et cetera or line of business line of business, or maybe from a third party source. The challenge becomes really understanding, where's the data coming from? Can you trust it or not? Should you use it and how you can use it. And that was the core idea behind data provenance standards, which work in some scenarios, right? They don't necessarily scale to internet level data sources. But how do you see the data world hashing out at the moment? Is it really the case that we need to start to beef up the number of synthetic data sources that we're using? Do we need to lean into data providence? Is there a holy grail that we're going for? Or what does good look like?
[00:11:36] ADAM: Your question I think is very relevant. There are a lot of companies right now that exist solely for selling data, primarily to like other companies that are trying to train models. It's a big booming business. All of the large LLM foundries are purchasing large amounts of data. I can't speak to them specifically, but what I typically see, and I think this is more of a business decision, is you put the burden on the person that's selling you the data, like it's on them to kind of state, yeah, this data's safe you're okay to use it. And from a legal risk point of view, certainly that makes sense. That's how I think any company would want to do it. Right. And I think one of the reasons you're paying this company is because they are kind of putting their name to it and, saying, yeah, yeah, yeah, we're, we're good here. Don't worry. I think that's kind of like a, a natural outcome of, of how companies operate. Is is it good though, for the people whose data is being sold and, and then, and what does good look like for them? That's definitely trickier. I don't think synthetic data is like the ultimate answer. 'Cause you, you kind of suggested synthetic data can play a part here and, and certainly it does play a part. It's not the only part that it can play. And, and there, there's a few reasons for that. One, synthetic data isn't always good enough. It oftentimes can be, but not always. Two. Okay. Like, sure, the data is synthetic. That synthetic doesn't equal safe. Oftentimes synthetic data is safe, but it doesn't guarantee that it's safe. You know, I can, I can come up, for example, with an algorithm for generating synthetic data, which absolutely will betray the privacy of individuals, right? It's not hard to do. And if you're not careful in how you generate synthetic data, you really don't have any guarantee that it's safe, right? Because remember, synthetic data is generated by first training something on real data and so the thing that's then generating synthetic data, if it happens to regurgitate any of the original data or if it samples from it in a way that's not reasonable, then it's not necessarily private. Right? So, because of that there's other options as well. And actually from a privacy point of view, I prefer more a traditional de-identification and replacement. By that I mean identify the PII replace it with fake PII. The privacy of that is much easier to understand. It's not some black box algorithm. It's clear cut what's happening, give me a piece of data. I find the sensitive information I give you back a new piece of data where I've replaced all of it. That's easy to understand. And I, I think that's kind of like the, that's the best you can do, as long as you're kind of like accurately identifying all of the PII and also defining correctly what is PII and what is not. But Providence is cool too. I, I like that idea. I guess it would depend how it's done in practice and I'm not, I'm not super familiar with it.
[00:14:22] KRISTINA: We can kind of talk about that. 'cause that's probably a whole conversation in and of itself. But I think it's part of the solution, right? And I think we're living in this very complex ecosystem. It's sort of interesting. One of the things that we did find out in the process of creating the data provenance standards, which was helpful was privacy was only part of what I perceived to be a velocity problem. Do you see privacy as a velocity problem, as universally true in , your experience or mostly true for companies that haven't modernized their data stack?
[00:14:53] ADAM: It absolutely can limit velocity. It, it can even be the thing that defines the pace of it. Not, not just kind of it doesn't have to be the thing that slows it down. It could be literally the thing that defines the pace and the better you are at it, just the faster the system moves in general. But look, it's not true for all organizations, right? Like organizations that just typically deal with non-sensitive customer data or the data itself is just simpler. Privacy is less of a concern for them, right? It's, it's gonna be mostly in regulated industries where, privacy's the big issue and also in bigger companies. The larger the company, the more sensitive the industry. Privacy becomes more important, , bigger companies, more risk averse. More sensitive data, obviously you have to deal with that more.
[00:15:36] KRISTINA: But usually the companies that I see out there, for example, you said, Hey, if you have your HIPAA requirements in place, those are very straightforward. Then we understand at least where the bar is and what we have to do and how we have to dance to the HIPAA requirements. So I tend to wonder about more of the companies that aren't regulated, like if you're not in the FinTech sector or if you're not in the healthcare space. And how do organizations flip this on its head so that the real issue is not privacy, but maybe is it that organizations don't know what they really have or what they need? And if you can flip that on its head, can you go faster and have better and faster innovation?
[00:16:10] ADAM: I see. So I think you're saying, I imagine an industry, let's think of an industry that's not like, I'm trying to think of an industry with interesting unstructured data where there's not necessarily like a, a privacy concern inherent in the data. Maybe a B2C company. Typically it doesn't store credit cards, so they don't have credit card info, they'll have usernames addresses, things like that. But beyond that, it, it's not like super sensitive. So may maybe like a, a B2C company's a reasonable example of this. I think for their biggest issue then just becomes, okay, we're building AI systems. How do we operationalize them? And, and it gets back to the more standard things of like, well, is the model good enough? Is it fast? Is is the is it working as well as we want it to? I believe that any company that, for example, is fine tuning a model on the data of their customers has to first strip out any customer, PII, whether it's super sensitive or not. I certainly that will be true in places like Europe with GDPR and I, I think in the US we'll see that as well, even though there, there might not be like, clear requirements on it. And, and then the reason I say it is actually pretty straightforward. These models are generative in nature. You don't wanna model regurgitating the information of one individual to another, even if it's not necessarily super sensitive. It's just a very bad look and no one's going to appreciate it. But look, with that being said, not everyone is, is fine tuning models on customer data. So for those customers, I don't think, there, there potentially isn't a privacy concern.
[00:17:42] KRISTINA: As you across the landscape, who's getting this right, right now and, what are they doing differently?
[00:17:47] ADAM: Well, I don't wanna name names because I don't wanna pick favorites with our customers. But the shape of what it looks like to do right, is typically like these things follow a curve. Like no one, no company is born being able to do this correctly, right? The general trend that we see, the progression is companies first get comfortable and good dealing with the privacy concerns of structured data. It's more straightforward, it's easier, it helps build the muscle. It also lets you do things like understand this type of data is sensitive. This type of data isn't right. Like, yeah, we wanna obfuscate zip codes, but we're okay keeping the first three digits of that, but we don't want the last two we. Rules like this. Okay. And, and then you'll see a company kind of transition into this space where, okay, well now we've unstructured data. We want to use large language models like everyone else. Unstructured data's important. Let's start using that. They already have their rules. They've already kind of built this muscle of what it means to like work with sanitized or de-identified data. And just removing and sanitizing the unstructured data you like. They know what they want it to look like. They know how to do it, but now they gotta do it. And that's when they get into this kind of like manual redaction phase. And then in the third phase is that okay, they're doing that now they wanna scale things up and they will go. To more automated approaches. We kind of have started calling it internally ourselves programmable privacy, where essentially, you build automated systems that sanitize unstructured data. You can kind of tweak them as needed as you go to kind of strike the right balance of privacy and data utility. And then at this point they're kind of happy and, and data's able to flow through systems in a safe way, maintain compliance, train models, et cetera.
[00:19:23] KRISTINA: Yeah. But you actually talk about the fact that a lot of organizations do get stuck in stage two of that model, right? Like it's really hard for them to kind of like jump into the third piece. Like what is the recipe missing?
[00:19:33] ADAM: I think at that point, like they've nailed the compliance and regulatory concerns internally once, once they're in phase two, right? So going from two to three is more about tooling, like you asked earlier, is it a compliance issue or a tooling issue? Well, now I have a better answer for you. If they haven't yet hit, if they haven't nailed stage one they have compliance issues and regulatory issues that they have to go deal with and figure out once they're comfortable in stage one and two, or as they're progressing from one to two, they're good at that. It now becomes a tooling problem. The open source models are, are not great at identifying PII, and typically they don't deal with any of the synthesis of it. So it comes down to either. Building your own techniques for identifying and then sanitizing, PII or, you know, choosing vendors. Obviously I'm partial to tonic.ai, but there there's many great vendors out there as well.
[00:20:25] KRISTINA: Got it. What's what do you think the biggest misconceptions are at the moment amongst leaders around this maturity for AI data access? 'Cause I'm assuming that a lot of this has to actually come from the top.
[00:20:39] ADAM: It does. We've always found this, we started the company eight years ago. Initially, we were only dealing with structured data. We only got into unstructured really, as LLMs were becoming more common because there was just such a need at that point. It's always been the case from my point of view though, that privacy and compliance concerns always come from the top, like the rank and file, the IC engineers are typically not thinking about it, and they need to think about it when they're building their systems or, or what they build is gonna get blocked and, and never. Kind of get outside of their development machine or outside of a little toy environment where they can play around and show it off on, demo days and hackathons. So it, it does come to kind of like having an educated leadership. But look, if you can also educate, folks below the leadership level that's great as well. Like if your engineers are kind of thinking about this as they're developing, they're gonna be able to, to generate, or they're gonna be able to develop code and systems, I mean, a, a lot faster. Doing these types of things as an afterthought is always painful. You end up typically having to like re-architect a lot of what you've done. 'Cause you had these assumptions about what the data was gonna look like, but now it has to look different and it can really be like starting from scratch. , It's very painful and expensive.
[00:21:54] KRISTINA: And timely.
[00:21:55] ADAM: And timely. Well, time is money.
[00:21:58] KRISTINA: Exactly. So if you had to give a Fortune 500 executive, one directive tomorrow to unblock AI velocity, what would you tell them to do in the next 90 days?
[00:22:11] ADAM: Next 90 days, I would have them, they need to button up their regulatory and compliance. They need to have very clear definitions of what they consider sensitive, what they don't, and, and it won't be like black and white, like this is always sensitive. This always isn't sensitive. It's gonna be more complicated than that. It's like, okay, like if the data's used for model training, this is the set identifiers that are sensitive. But if it's used just for normal Devon tests, then it's these things over here. So they have to come up with these rules. That's the first thing they kind of need to be. Then broadcast downward. So all of the different teams working on, different features and systems are aware of them. Then at that they can have an internal centralized team kind of handle all data, DID moving forward, that team can either build internally or they, they could purchase a vendor. Or they can decide if they want to go with more of like, a decentralized model where each team is responsible for their own setup. I tend to think the centralized teams. In really large companies, the centralized teams can sometimes do a better job, is what I have found. It can be too much of an ask on small teams to handle their own data, DID needs, especially when they have to adhere to company-wide standards that they may not be super familiar with. And so the executives gonna have to think about all this and, basically come up with their plan. And I, I think 90 days is perfectly adequate for something like that.
[00:23:33] KRISTINA: We've spent years treating privacy as a constraint I think on innovation. What we're really seeing now is the opposite, right? Organizations that operationalize privacy correctly are the ones that actually move faster. AI doesn't fail because of a lack of ambition. Obviously, it fails because organizations can't safely use the data they already have. Adam, any other parting words that you would say to folks out there as they take on their AI journey and look at their data operations?
[00:24:00] ADAM: Yeah, absolutely. We're no longer at the point where AI is like constraining what models can do. The models are very good, but it, it instead is constraining like what types of data organizations can use for these AI models. So solving privacy, preserving access to real world data will determine which organizations can scale their AI systems in which are always stuck in that experimentation demo day, hackathon phase. So I think it's really where organizations need to look over the next couple years.
[00:24:33] KRISTINA: Great. Well thanks so much for the conversation, Adam. This has been a lot of fun.
[00:24:36] OUTRO: Thank you for joining the Power of Digital Policy; to sign up for our newsletter, get access to policy checklists, detailed information on policies, and other helpful resources, head over to the power of digital policy.com. If you get a moment, please leave a review on iTunes to help your digital colleagues find out about the podcast.
Feel free to respond to this podcast here: