Learn Python with Talk Python's 270 hours of courses

#409: Privacy as Code with Fides Transcript

Recorded on Thursday, Mar 23, 2023.

00:00 We all know that privacy regulations are getting more strict and that many of our users no longer believe that privacy is dead. But for even medium-sized organizations, actually tracking how we are using personal information in our myriad of applications and services is very tricky and error-prone. On this episode, we have Thomas LaPiana from the FIDES project here to discuss privacy in our applications and how the FIDES project can enforce and track privacy requirements in your Python applications. This is Talk Python to Me, episode 409, recorded March 23rd, 2023. Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy. Follow me on Mastodon where I'm @mkennedy and follow the podcast using @talkpython, both on fosstodon.org. Be careful with impersonating accounts on other instances, there are many. Keep up with the show and listen to over seven years of past episodes at talkpython.fm. We've started streaming most of our episodes live on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of of that episode. This episode is sponsored by Microsoft for startups founders hub. Check them out at talkbython.fm slash founders hub to get early support for your startup. And it's brought to you by sentry. Don't let those errors go unnoticed. Use sentry. Get started at talkbython.fm slash sentry. Thomas, welcome to talk by the me. Hey, thanks so much for having me. Yeah, it's great to have you here. I'm excited to talk about privacy. I feel like there was this period where everyone just gave up and decided privacy doesn't matter either because it was a good trade off for them at the time, or they decided it was, you know, trying to push a rock up a hill that was never going to make it to the top.

02:02 And so you just don't, don't stress about it, but I feel, you know, like things are coming back a little bit and, you know, we all get to be semi-autonomous beings again.

02:09 Yeah.

02:10 There's definitely been that feeling that, and I think actually it, it a little bit mirrors the way things are going with AI now, right.

02:16 where people feel like the genie's out of the bottle, how do we put it back?

02:19 But I think we've actually seen that happen successfully with privacy, where there was a long time when, you know, you talk to your parents about, hey, maybe don't use Facebook.

02:27 I know this happens to me at least, right? Personal anecdotes.

02:29 So, hey, maybe don't use Facebook.

02:30 You sell your data.

02:31 And the response is always like, well, who cares?

02:33 So I'm not I'm not doing anything bad anyway.

02:36 Why does it matter?

02:37 And I think we've seen very much a reversion to, hey, actually, maybe I don't want my insurance company to know everything about me and my family's medical history type of thing.

02:46 And people are starting to care about it again.

02:47 And somehow we're getting that genie back in the bottle, which is, which is great.

02:51 The internet used to be, it's this thing on the side.

02:54 It was like a hobby or something you were interested in.

02:57 Like, Oh, I'll go on the internet and I'll read, read some user forums or I'll search for some interesting thing that I might be interested in.

03:04 And now it's become all encompassing, right?

03:06 Tech, tech and everything else is interwoven so much that I think people are are starting to realize like, oh, if all of these companies can buy, sell, and exchange too much information about me, then that might actually have a real effect in my regular life, my day-to-day life, right?

03:24 It's not just like, oh, I get weird ads on my hobby time off that I fiddle with the screen, like, no, this is, this is everything, right?

03:31 And so we're going to talk a little bit about the laws and the rules that are coming into place, a little bit of these, these changes, but mostly some platforms that you all are creating to allow companies, especially large companies with complex data intermingling, to abide by these laws and be good citizens of this new world that we're talking about?

03:54 - Yeah, absolutely.

03:55 And I think that's another thing we've seen as part of this shift of consumers caring about privacy is you also have individual engineers or individual contributors or managers or people within the organizations that regardless of what laws may require them to do, they also do care about building privacy-respecting software just as the right thing to do.

04:12 And I think we've, yeah, we've seen a kind of a general trend in that as well.

04:15 So that's been good to see.

04:17 - Yeah.

04:18 Well, I'm looking forward to exploring the ideas and then the platform as well.

04:21 Before we get to that though, let's start with your story.

04:24 How'd you get into programming, Python, privacy, all these things?

04:28 - So I actually studied politics in college, but my best friend was a computer science major.

04:33 And when I found out that in college, he was already freelancing, working at home and making way more money than I did in my part-time job.

04:41 I was like, hold on, I think this computer science thing might have a future.

04:44 So I was just kind of self-taught, and I ended up doing some data.

04:49 I got a data intelligence job right out of school despite having zero relevant experience or knowledge.

04:54 And I was told like a week or two in to, I was working on this case that we had, and I had to pull stuff from API and put it in database.

05:02 And I had basically never really written a line of code before.

05:05 And I somehow ended up on Python and somehow ended up with, you know, my SQL and I made it work.

05:10 And just from there, just fell in love with the, really the problem solving aspect of coding and just creating value from basically nothing, right?

05:18 - Yeah, I can just see how that search goes.

05:20 You know, how do I easily pull data from API?

05:23 Use requests in Python.

05:25 I go, okay, let's give this a try.

05:27 - It was probably something similar.

05:28 I think I had a friend, you know, in the Slack team at the time that was in, you know, into Python or something.

05:35 And I think I just ended up on Python and it's been one of the best accidents of my life.

05:39 Now, however many years later, still working with Python daily.

05:42 - Yeah, excellent.

05:44 And now what are you doing for a job?

05:46 - Yeah, so I'm working at a company called Ethica.

05:48 So we focus on privacy tooling for engineers specifically or in a broader sense these days, working on privacy tools in general that can be kind of a meeting point between engineers and compliance professionals, right?

06:03 So like the compliance team lawyers, things of that nature, your company, trying to build a common ground for them to kind of build off of and work together in.

06:11 - Excellent, it sounds really fun.

06:12 And you've got a cool platform that you all have open sourced, and we're gonna talk about that in a minute.

06:18 But let's keep it high level for a moment.

06:21 I talked about the swinging pendulum where it went to like, "YOLO, I don't care.

06:27 "Internet's fun, it's free, it doesn't matter to you.

06:29 "Oh my gosh, it matters, I want my privacy back.

06:32 "I can't believe people are doing X, Y, and Z and not just showing me ads with this.

06:36 So we got the GDPR, obviously that made such a huge splash that made me personally scramble to match, to meet the requirements.

06:46 And what I think is really interesting about that is those kinds of laws, and maybe I could get your thoughts on this.

06:52 I think it's, it's a bit of a problem or a challenge and these kinds of laws, you can just see through the veil.

06:59 Like, okay, it talks about internet company or something, but what they mean those Facebook, Google, you know, like there are five huge companies or something in the world.

07:10 Most of them on the West coast of the U S that are like bullseye.

07:13 The, the, the, the sites are on them and these laws trying to apply to them.

07:17 Yes.

07:18 In general, outside of it too, but like it's, it's those five or whatever that really, you know, were the catalyst for this.

07:25 Whereas, you know, small companies like me are like, oh, well I have to have a recorded history of an opt-in, not just an opt-in in principle, but I need the date and a time and I need a record.

07:37 So I got to go rewrite my systems so that I can have a record of opt-in for communications if I have it.

07:43 And when there's one or two of you at a company that shuts the company down for weeks, when there's 10 people at Google that I got to stop and go do that, Google just keeps going.

07:53 Right.

07:54 And so there's this tension of, I think, unintended harm that can come from these by asking for a blanket equal compliance when they're really what the laws are written for and people have in mind are like these mega trillion dollar companies that have unlimited money to solve these problems.

08:15 How do you see that in the world?

08:17 - Yeah, it's been, so I think specifically with the CCPA, which is the California Consumer Privacy Act, they kind of noticed that and there is actually a cutoff, I believe, I wanna say something like revenue revaluation under 50 million.

08:30 So there is kind of a safety clause for smaller businesses because like you said, when GDPR came in and it just went after everyone, right?

08:39 Irrespective of size or resources.

08:42 And it was actually more of a punishment for smaller companies.

08:45 Because like you said, if they come for you, Michael, and they say, "Hey, talk Python, you're doing great.

08:51 "You've got all this data.

08:52 "All these people are buying courses, but you're not keeping track of consent or whatever.

08:55 And like you're, you know, you're one or two person team, you know, now you've got to stop for weeks and Google and Facebook, they have the privilege and the ability to, yeah, they're going to hire privacy engineers and they're going to try to do things the right way. But if they could find a few hundred million, it's just the cost of doing business, right? They, they, they are calculating these fines as part of their annual spend.

09:14 And that's just, that's just how they do it. Whereas for you, that could be, you know, something that ends your business or puts you in hot water or does something else. Or maybe just you don't even want to do business in the EU anymore, because it's too much of a hassle compared to what it was before. So I think yes, GDPR really misstepped there and it did end up punishing a lot of smaller businesses. But I think they've learned from that, they're trying to iterate on it. I think cookie consent is a big one. They're now kind of revisiting that and saying, hold on, did we implement this the right way? Did everyone implement this the right way? And I think the CCPA is building on that in a really good way. And we can see there's also a lot of shared language. So I think even Even though GDPR was, it was disruptive, and it probably hurt the people that it didn't mean to hurt, at least initially, but it was still a pretty good base for us to work off of just in terms of a general privacy framework and will hopefully get us to a place that's a little bit more equitable in terms of who's being punished and who's actually dangerous.

10:08 Like you said, it was like a few outliers that brought about this requirement, right?

10:12 It was Facebook acquiring WhatsApp and then doing uncool things with access to both natives.

10:16 So it was things like that, that this was designed to stop.

10:19 I think slowly we're getting to a place where it's being wielded more in that vein.

10:23 - Yeah, the US Supreme Court has not ruled yet on section 230, but who knows what that's gonna unleash.

10:29 That's a whole nother topic, we're not on that one.

10:31 But you know, I mean, there's still large waves like that could crest and crash and whatnot.

10:37 I do wanna come out and say explicitly, I'm not against the GDPR, and I'm not unhappy that I changed my systems around to comply to it.

10:46 Like it's really important to me.

10:48 I've talked to advertisers and told them, no, we're not inserting tracking pixels.

10:53 We're not inserting other types of retargeting stuff for you.

10:57 If you need that more than you need to talk to my audience, you go find some other audience.

11:02 Like seriously, I've told them to go away.

11:04 And usually they're like, okay, fine.

11:05 We'll work around it.

11:07 But I also just want to kind of put that out there like a look, these rules come in aimed at the top and sweep up a bunch of other people as well.

11:17 Right.

11:17 So I think there's like this mixed bag, I guess is what I'm trying to say.

11:20 Yeah, there definitely is.

11:22 And funny you mentioned that, like there are, I was, I was shocked for you to see to find out that their podcast I'm subscribed to, not yours, so don't worry.

11:28 A podcast I've subscribed to that I can't even download on a VPN because there's, it is just like, there's some tracking requirement that my VPN blocks that it won't even let me download the episode, which is, it's just crazy to think that that that's kind of the point where we're at, even with podcasts.

11:42 - It's really terrible.

11:43 And a lot of that, I think, comes from people wanting to do dynamic ad insertion.

11:48 - Yeah.

11:48 - Right, they wanna go, okay, this person is coming from this IP address in this town, and we have this extra information we've gathered from like these nefarious back channels.

12:00 And we're pretty sure, pretty sure that's Thomas, and he works in tech, and we're gonna offer, we're gonna dynamically insert this, you know, And if, if they, if there's walls that stop that, then, then maybe, maybe now, let me go on one more really, really quick rant here.

12:16 just to mention the cook though, this, this cookie can say, no, I'll try to not be on the soapbox too much for this episode, but I think right here at the CCPA, the California law, this right to non-discrimination for exercising your rights.

12:29 When I look at the web these days, I, those, all these cookie pop up notifications, they're like the plague.

12:36 They're just everywhere.

12:37 Many of them just say, you have to just say, okay, or go away.

12:40 And you know, it's like, okay, well that's not a lot of control I have.

12:44 on the other hand, we have a lot of technology control.

12:48 I have a, my network, I have next DNS, which will block almost all the tracking and retargeting and all of that stuff.

12:54 I use Vivaldi with the ad blocker on and I'll go to these places and they'll say, turn off your ad blocker.

13:00 And if they'll show you the cookie thing and they'll say, turn off your ad blocker.

13:05 If you don't turn it off, you don't get to come in.

13:07 I think what they should have done instead of having a law that says you have to tell people you're tracking them and then make them press okay.

13:14 And then track the heck out of them.

13:15 Right.

13:16 Say like kind of this last line here in the CCPA say there's a right to non-discrimination for exercising your privacy and say, you should be able to have an ad blocker or other tracking protection mechanisms without being punished or blocked compared to the other visitors that would have solved it.

13:34 And there'd be no need for these pop-ups everywhere.

13:38 We could just, if as a informed citizenship, if we decide we want to run ad blockers or other types of tracking blockers, we can and we just go about our business, right?

13:47 Like that would have been a more sophisticated solution, I think, than making everybody say, okay, I agree, you're tracking me, let's go.

13:55 - Yeah, this is, yeah, obviously, I think it's been contentious, right?

13:58 Like I can barely remember the internet now despite it being just a few years ago when, and now it's just so normal.

14:04 You go to a website and you're just waiting. You're waiting for the other shoe to drop and the thing pops up and you click the thing.

14:08 Where's the cookie thing? Okay, I got it. I got it. It's out.

14:10 Exactly.

14:11 Is it in the top or the bottom of this one?

14:12 Yeah. It's a mini game every time you go to a new website. And there are even browsers now that have a toggle of just reject. Just don't show me cookie consents. Just do it for me type of thing.

14:22 Yes. I accept them. I accept them.

14:24 Yeah. And I understand why it was there. They want people to be aware of what's going on, But it's kind of like, you know, EULAs or end user license agreements or whatever.

14:34 Like people don't really stop to read if you're informed, right?

14:37 And again, this is where actually I think the talking about non-discrimination and kind of the privilege of being more privacy where there are a lot of people, right?

14:44 If you look at, you know, someone like my parents who, you know, they're going to go to the websites, they're going to see the thing that's like, hey, we're using -- they don't know what a consent cookie is.

14:53 They have no idea what the concept of tracking on a website is.

14:55 They don't know what the data is actually used for, anything like that.

14:58 So once again, it's just kind of an inconvenience that I don't think really increases anyone's privacy, especially because a lot of companies are still using the dark pattern of, you can click yes, track everything, but if you wanna reject or you wanna only give the required cookies, it's like another step or two, and it's unobvious, and it's like, you know, just a tiny little hyperlink text you have to click on.

15:19 It's not a big, flashy button.

15:21 - Light gray on dark gray.

15:22 (laughing)

15:24 - Exactly, exactly. - Text or something, yeah.

15:25 Yeah, yeah, yeah.

15:26 - Exactly, something like that.

15:27 Hey folks, let's pause for an ad.

15:31 Remember last time I had an ad for Microsoft for Startups?

15:34 This time around, I asked the AI to have some fun with it just to show you how powerful what they have on offer really is.

15:40 Here we go.

15:41 Me.

15:42 Can you rewrite the Founders Hub ad in the style of a corporate executive using too many buzzwords?

15:47 AI.

15:48 Sure.

15:49 Here's a rephrased version of the ad in corporate speak.

15:52 This segment of Talk Python to Me is brought to you by the Microsoft for Startups Founders Hub, a synergetic platform of innovation and digital transformation.

16:00 As a visionary entrepreneur, you are invited to leverage the cutting edge technology solutions available through our collaborative ecosystem.

16:09 Our offering is unparalleled with over six figures worth of benefits, including $150,000 of Azure credit allocation, which is the epitome of value proposition in the cloud market.

16:19 You will have access to the new Azure OpenAI service and OpenAI APIs, enabling you to infuse AI-driven solutions into your applications with ease.

16:29 Our dedicated technical advisory team will provide you with a tailored guidance on best practices for implementation, scalability, and security, while our network of mentors will provide you with valuable connections within the startup landscape.

16:42 This program is designed to maximize the growth of your venture, with expert support in areas such as product development, fundraising, and go-to-market strategies.

16:51 There are no funding requirements, making it accessible to entrepreneurs at all stages of their journey.

16:56 So take advantage of this disruptive opportunity and join Microsoft for Startups Founders Hub today.

17:02 The process is simple with an effortless five minute application process and the benefits are substantial.

17:08 Transform your startup with the power of AI and become a leader in digital transformation.

17:13 Visit talkbython.fm/foundershub to enroll.

17:16 Thank you to Microsoft for supporting the show and to OpenAI for making this ad fun.

17:21 So I think we're finding our way still.

17:26 We're not totally figuring it out.

17:28 There's, there's attempts sometimes that don't really have the outcome.

17:32 I think people intended like this cookie consent one.

17:34 but, but there's still the, the idea behind it was, was pretty good, even if the way it came out, wasn't that great.

17:40 At least my perception.

17:42 I know some people really appreciate the ability to have those buttons, but I just say like, I'm just going to block it no matter what I answer is not coming through.

17:48 So I don't care.

17:49 but companies have to live in this world, right?

17:53 They have to live in the world where the pendulum is swinging back.

17:56 And so I guess, you know, we talked about GDPR and I really want to go too much more into it at this point, because we talked about so much, there's an interesting article called the 30 biggest GDPR finds so far, and it's not updated for this year, but people can look through and see what kind of, you know, it's the, it's exactly the companies that I ascribed for the most part, except for there's like some weird ones in here.

18:17 I don't know if you've seen this article, but there's like H&M, this German clothing company where they like film their employees and then like shared that internally.

18:28 And that was the violation, which was unusual.

18:31 But so these are the ones that people might know.

18:34 Go ahead.

18:35 What are you gonna say about that?

18:36 I was gonna say that a lot of people actually forget.

18:37 I think that's an interesting one specifically because GDPR has kind of different classes of people it protects.

18:43 And actually, employees is absolutely one of them.

18:46 Because a thing that you will see is companies will use internal employee data and say, "Well, there are employees.

18:51 They don't have data privacy rights because they work for us.

18:54 So we can use their information however they want to be." You can't do that, right?

18:56 GDPR says specifically, "Yeah, you can't sell your employees data.

18:59 You can't use their biometrics for whatever you want to," all that kind of stuff.

19:05 So I think it is really important also that H&M got fined for that because it's showing, "Hey, you have to treat your employees as well as you treat your customers." when it comes to data privacy.

19:12 - Interesting. - It's pretty cool.

19:13 - Yeah, it's not just your web visitors, it's the people that--

19:16 - Exactly, it's trying to protect everyone, no matter what their relation is to said company.

19:21 - Another one I think that just highlights the greater subtlety of all of this is this article in the register entitled, "Website fined by German court for leaking visitors' "IP address via Google Fonts." Are you familiar with this at all?

19:35 - Yeah, I'm vaguely familiar with it.

19:38 I think it's an interesting case you can see the fine is 100 pounds, right? Or sorry, not pounds, 100 euros, which ends up being 110 US dollars. And so I think it was very much meant to catch news headlines, right? And just kind of warn people, "Hey, we've now kind of decided that Google Fonts is not going to be great. And so this is a very inexpensive warning to everyone else that maybe you should start looking into if you're using Google Fonts or not." Yeah, I find this very interesting. Because again, it's almost like the cookie can set thing.

20:08 This will ripple across most websites probably.

20:11 - Right, I think people think about Google Analytics and some of these other conversion tracking systems that you plug in, you're like, "Okay, I realize we're tracking." But even really subtle little things like linking to an image of a YouTube video, like that will drop cookies from YouTube and Google onto your visitors and all those things.

20:33 You're like, "Wait a minute, I just pointed an image?

20:35 like that's nuts. And this is like this, when I read this, I thought, oh, maybe there's like one person that sued this company and they got $100 or something, I don't know. But what if they had 100 million visitors and everyone decided, oh, we'll do a class action lawsuit. I mean, it could explode. I think that's why it caught the headlines so much.

20:52 Yeah. And so most of the time it is like with these violations, and this is where it even gets a little bit sticky because individual countries, data protection kind of agencies will go after companies. So if you are... For instance, a lot of big ones happen in Ireland because, again, a lot of tech companies, especially Silicon Valley tech companies have headquarters in Ireland.

21:11 So you see the Irish privacy authority levy a lot of these fines in some cases, although people are criticizing them for being kind of lenient. But I think in this case, it was very specifically, like you said, for one reason or another, someone or the government just decided, "Hey, we need to call out this Google hosted web font and warn everyone else that, "Hey, maybe you shouldn't be using those." I don't know, it is very interesting.

21:32 And I do feel like a lot of these go under the radar.

21:34 - I think they do, and I don't even think they name the company that this was applied to.

21:39 By the way, people, if they're like, "But what do we do?" Fonts.bunny.net is a really cool option.

21:46 Zero tracking, no logging, privacy first, you are compliant, and a drop-in replacement for Google Fonts.

21:52 So people should check that out.

21:55 If they're like, "I kind of want this functionality, "but I kind of don't want it anymore." (both laughing)

22:00 - Yeah, so that's a pretty cool option.

22:03 All right, so that sets the stage a little bit, but let's maybe talk about some of the problems that large organizations have.

22:12 So I know that you worked at a large organization where it was like, we have this, what data do you have about me request, or how do you use my data request?

22:21 And I can only imagine at a multiple thousand person company there's these databases and people like dip into them and take something and then who knows where it goes.

22:31 And then they hook up some third party other thing and then like, it's off to the races.

22:36 Like, tell me what you did with that.

22:37 Like, I don't know, it's out.

22:40 - Yeah, it feels like the, maybe this is too much of an American cultural reference, but like a take a penny, leave a penny, but for data, right?

22:47 Like, you might drop some in there, you might take some out.

22:50 No one really knows where it went.

22:51 It's just now circulating in the broader economy.

22:54 And so with that company, this is my last company, also a startup, and I was a data engineer there.

23:00 So I've been mostly a data engineer until my current position.

23:03 And it got to the point where, and this is gonna sound crazy, we were, luckily through the power of like, you know, Snowflake and DBT and things like that, we were able to actually replicate a data warehouse per country.

23:14 So like all of our EU data stayed in EU, all of our Canada data stayed in Canada.

23:19 We were basically just spitting up as many warehouses we needed to.

23:21 Like so when CCPA came online, we were like, All right, we're spinning one up in California and then the rest of the US has one somewhere else, right?

23:28 But it's just obviously, that's just not sustainable.

23:30 We were relatively small data engineering team and we automated most of it, but it was very clear that that became a huge problem.

23:36 - It might be sustainable if it's Europe, US and other, or something like that.

23:40 - Right, exactly.

23:41 - Where the US people-

23:42 - Australia was slowly gonna get on that list, yeah.

23:45 - We're like, all right, US people, they get no protections.

23:49 We sell them like crazy.

23:50 The Europeans will be a couple of them - The Canadians, they're nice, we'll kind of be nice with them.

23:55 But we're seeing this stuff pop up more and more, like more regionally, and it's getting harder and harder to follow.

24:02 - Yeah, that was the problem, right?

24:03 When it's California and then the rest of the US, which at the time it was, this was a few years ago, and say, "Okay, so we have a data center in California to comply with CCPA and then we have a data center outside of California and we're good." And I was like, "Well, like Virginia just passed a security law.

24:16 I don't think we have servers in Virginia." like, oh, like Idaho just passed.

24:20 I don't know if there's server farms in Idaho.

24:23 It's like, you know, it became a problem like that where you can't, you're not gonna spin up 50, you know, data centers, one in each.

24:28 Like, I don't know why you probably didn't have data centers.

24:30 Energy's tough there, so.

24:31 - But they did, if you need to spin them up, you need to do it in person and it's gonna take a month.

24:36 - Yeah, exactly, exactly.

24:38 It's very true.

24:39 - Sorry, I got a bad tan, but the data center's coming along.

24:42 - Yeah, so it was, you know, so that complexity, right?

24:46 So all these different laws.

24:47 But then on top of that, like you said, putting in the access controls for figuring out where data is going and how, so definitely having a tool like dbt, which if people don't know, it's kind of like a very programming focused data analytics tool, building models and stuff.

25:01 So you had a good lineage draft of where all the data was coming from, what it was doing, but we still had to document our use because legally you have to have a valid use for every piece of data that you are storing in there and things like And so I was just spending more and more time in calls with, you know, our, our, our security team and our privacy professionals, our compliance team, just answering questions of just like, Hey, here's a gigantic graph of all of our tables, all of our databases, just everything we could possibly be doing.

25:29 Like, how does data flow through here?

25:30 Like explain to me how this PI goes from here to here and what it's used for exactly.

25:35 And that kind of, as we talked about before, it scales, sorry, it should be, I think, just DBT.

25:40 Yeah.

25:42 That's the wrong one.

25:42 I know.

25:42 probably happens.

25:43 I'll find it.

25:44 Keep going.

25:45 And, but that's not really scalable.

25:48 So again, we talked about before this ended up punishing smaller companies a lot more because so if you're Google, you can throw, you can just hire 20 people out of nowhere, call them privacy engineers and you know, just say, Hey, it is now your full time job to just keep track of these things and help us stay compliant.

26:05 But if you're a smaller company like we were, then a lot of that fell to like myself and the data engineering team.

26:10 And then of course the product engineers as well.

26:12 And so that makes it really difficult. That adds a pretty large burden to doing business.

26:16 And after being there for a while, I then had an opportunity to come work at Ethica. And I was absolutely sold on working at Ethica when they said, "We're trying to build a platform that allows engineers to just handle this like engineers like to," which is with automated tooling, with CI checks, with open source. Yeah, with the YAML files, with open source Python code.

26:36 And I was like, "Hey, this sounds great. I'm spending most of my day worrying about this anyway, I'd love to just get paid to solve this problem for other people. And that's how I ended up at Ethica. And it's been been a journey ever since. And I think we're one of the challenges of tackling something like this is like we just talked about. It's such a broad problem space. So you can come in and you can handle the cookie consent thing, right? But then they're gonna say, well, to have a holistic private solution, we also need to handle knowing what our code does, and data mapping and DSRs, which you know, we're going to get to a second. So there's actually, it's like this multi-prong...

27:07 - DSR data, what's the DSR stand for?

27:11 - Yeah, so DSR is data subject request, and that is--

27:14 - Is that like what data you have about me kind of thing?

27:16 - Exactly, exactly.

27:17 So I think one that people have probably heard before is there's also like the right to be forgotten is included in that.

27:23 So that's the ability for me to go to a company and say, "Hey, I would like to see what data you have about me." And so you actually, I'm gonna give you my email, right?

27:32 And that's my primary identifier on your system.

27:33 I'm gonna give you my email, And then you need to go scour your entire infrastructure and every piece of PII and data you have related to me, you need to get back to me in like a CSV format so that I can very easily see what you're tracking.

27:46 Then you have, like I said, the right to be forgotten.

27:48 So that is, you know, say I'm using BigCo's whatever email service and I don't want to use their service anymore.

27:55 And I say, hey, I'm sending you a request to delete all any and all data related to myself.

28:00 So I no longer want to be your customer.

28:02 I've deleted my account.

28:03 everything with, you know, john@somecompany.com.

28:07 You need to take that email, run it through your system and delete every single piece of PII related to it.

28:11 - Yeah.

28:12 - And so these are the, these are kind of like privacy protections we're talking about, but that stuff is, is complicated.

28:17 And so.

28:18 - Yeah.

28:19 Well, I talked earlier about how it was really challenging for small, small companies.

28:24 I think this thing you're talking about now is, it's actually not that bad for small companies.

28:28 I think it's killer for the medium sized business that doesn't have the Google sized tech team to track it.

28:34 - Right.

28:35 - They've got a ton of people that mess with it and a ton of data.

28:37 - Yeah, and a lot of complexity.

28:39 - A lot of integrations, yeah.

28:40 - Yeah, and that's an interesting thing we've seen is that a lot of times when people are out of compliance, it's not actually because they are malicious and they don't care about people's privacy.

28:52 It is because they just, they physically cannot.

28:55 If you go to someone and say, "Hey, you have a hundred thousand," This is not uncommon, like 100,000 Postgres tables.

29:01 And you need to tell me exactly where every bit of PI is in those 100,000 Postgres tables.

29:06 It's not going to happen.

29:07 No one actually knows.

29:09 There's probably people that have left that may be new.

29:11 And now there's some dangling Postgres database out there in AWS somewhere that has PIs they don't even know about.

29:15 Just doesn't even show up on their maps anymore.

29:17 And that's the biggest challenge, is that it's not people doing things out of malice.

29:23 It is purely the technical scale of the problem that's just huge.

29:27 And again, like I said, even Google with an army of privacy engineers or Meta with an army of privacy engineers, they still get fined all the time because it's just not really possible to catch everything manually at that scale.

29:38 And that's what most people are still trying to do is to do everything manually.

29:41 - This portion of Talk Python to Me is brought to you by Sentry.

29:47 Is your Python application fast or does it sometimes suffer from slowdowns and unexpected latency?

29:54 Does this usually only happen in production?

29:56 It's really tough to track down the problems at that point, isn't it?

29:59 If you've looked at APM, Application Performance Monitoring products before, they may have felt out of place for software teams.

30:06 Many of them are more focused on legacy problems made for ops and infrastructure teams to keep their infrastructure and services up and running.

30:14 Sentry has just launched their new APM service.

30:18 And Sentry's approach to application monitoring is focused on being actionable, affordable, and actually built for developers.

30:25 Whether it's a slow running query or latent payment endpoint that's at risk of timing out and causing sales to tank, Sentry removes the complexity and does the analysis for you, surfacing the most critical performance issues so you can address them immediately.

30:39 Most legacy APM tools focus on an ingest everything approach, resulting in high storage costs, noisy environments, and an enormous amount of telemetry data most developers will never need to analyze.

30:51 Sentry has taken a different approach, building the most affordable APM solution in the market.

30:57 They remove the noise and extract the maximum value out of your performance data while passing the savings directly onto you, especially for Talk Python listeners who use the code Talk Python.

31:07 So get started at talkpython.fm/sentry and be sure to use their code, Talk Python, all lowercase, so you let them know that you heard about them from us.

31:18 My thanks to Sentry for keeping this podcast going strong.

31:21 What about bad actors?

31:26 By that I mean there are companies that try to do the right thing like mine.

31:32 You can go to the website and there I spent a lot of this part of that two weeks.

31:36 There's a button download everything you know about me and there's a nuke my account completely wipe me off the face of the earth as far as you're concerned.

31:43 And to my knowledge those are totally accurate and sufficient.

31:47 However, what if there's a company that says here's all the data I have with you and here's the places I share it.

31:53 And they leave out the three most important and dangerous ones.

31:56 Like, do you know what recourse there is?

31:59 Because it looks like they're complying.

32:00 It looks like I requested the thing, they gave it to me, I asked it to be deleted, they did.

32:04 Except for in that dark market where they're selling it to shadow brokers for ad data and credit card mix-ins.

32:10 And that's way more valuable, we'll keep that.

32:12 - Yeah, I mean, this comes down to somehow they would just have to get found out.

32:16 There'd have to be an internal whistleblower.

32:18 There'd have to be an investigation.

32:19 There would have to be some kind of audit because they do, as part of GDPR, you are required to submit things like a data map, which we'll talk about in a little bit, which is basically, where is data going?

32:31 What is our valid use for said data and all that kinds of stuff.

32:34 But like you said, if there's a truly bad actor that is leaving things out of reports on purpose and not letting customers know that they're doing certain things with their data, I'm actually not sure how that would get kind of discovered. - I think you're right.

32:44 Maybe a whistleblower or maybe somebody says, "There's no way this data got over there without going through there until I'm--

32:50 - Right, exactly.

32:51 - I'm gonna try to get some legal recourse to like make you show us, make you testify, at least lie under oath.

32:57 (laughing)

32:58 Instead of lie to you.

33:00 - And even this was, this actually was a big sticking point recently.

33:03 Florida's also, you know, working on their own privacy law and a big sticking point that I believe may not go through was that they could not agree on whether individual citizens should be allowed to sue companies for data misuse or if it should be purely something the government handles.

33:18 And I think that's, that's an interesting thing to think about.

33:20 It is.

33:21 It's one of those things that sounds amazing.

33:22 Like, yes, sure.

33:23 If you're abusing, you know, company X is abusing Thomas, Thomas should have some direct recourse.

33:28 But you could easily destroy a company just by going like, let's give 50 people to all like, here's your cookie cutter letter that we send over as part of the legal process and, you know, just knock them offline.

33:39 Right.

33:39 Knock them out of business.

33:40 So I, I can see both sides all that all over again.

33:43 All right.

33:43 So I kind of derailed you.

33:44 we were talking about like the types of things that these medium scale organizations like really get hung up on and you touched on some, but.

33:52 Yeah. So sorry. Let me, yeah, I'll, I'll go back.

33:54 So number one is just the, the, the largest issue that we see in this scale is actually anywhere from medium to large, right.

34:01 Even with Google and you'll probably have Twitter size, you know, kind of the thing I would also bet really good money.

34:07 There's no one there that really knows where everything is. It's just, it's just too much to, to handle manually or within people's heads.

34:14 So the number one problem is that people don't know where their data is.

34:16 That's a huge issue.

34:17 The number two problem is even if they know where all that data is, right?

34:21 Theoretically in a perfect world, if someone gives you an email and says, Hey, you need to delete this email across all of your tables.

34:29 And okay, I know we have this email and this PI and a hundred tables and three different APIs that we use because we use whatever Zendesk and Salesforce.

34:37 Okay.

34:37 So now you've got that information in a perfect world.

34:40 How do you actually execute that?

34:41 Like there are plenty of companies that have someone on staff full time that just fulfills these DSRs and right to be forgotten and things like that. So it is not really efficient to say, okay, I've now got to manually go run SQL queries in 100 different, you know, database tables, 100 different databases, I've now got to log into three different API's. And it's just, it's not, it's again, not doable in an automated, you know, you need to automate it. Yeah, basically, I'm trying to say, so even if you know, everything is, how do you automate that? So that's another problem we were trying to solve. And then finally, it's the data mapping piece, right? So you need to understand, you not only need to know where your data is, you need to know what type of data is and why you have it. And that's really difficult. Because maybe three years ago, I did some proof of concept where I was grabbing people's addresses and trying to figure out a way to find cheaper shipping for our e commerce website. And whatever the table still there. And so then three years later, someone comes in, hey, I found all this PI and database, like, why did you collect this? Like, what is this for? And I've already moved on to another company because you know, it's startups. And that's a problem because you need to have a valid use for every bit of PI that you have in your system. And so it's this kind of this lack of documentation and knowledge that just brings about all these problems. And again, without automated tooling, it's just, I just don't think it's really feasible, which is kind of again, where Ethica saw a place to solve a huge problem.

36:01 Probably also a little fear. By that I mean, the time, the short times that I spent at at these larger companies, there were systems that were like, don't touch that.

36:10 Runs.

36:11 It's important.

36:12 Nobody can make it.

36:13 Nobody can fix it.

36:14 We probably can't redeploy it.

36:16 Just don't touch it.

36:17 And what if it, what if it has a bit of data?

36:20 It cannot have a nullable foreign key relationship.

36:23 No, that's a strong, and I want to remove it from this table, but, but the thing that shall not be touched and no one can keep it running, it's my problem.

36:31 If I break it, I don't want that problem.

36:33 I could just stay.

36:34 Yeah.

36:35 That's a problem.

36:36 Right.

36:36 Yeah, that definitely becomes a problem too.

36:39 Things that get forgotten about, things that people don't want to touch, things they've lost kind of the institutional knowledge of how it got there, and how to even get out of it if they wanted to.

36:47 Like you said, fear of downstream breaking changes, right?

36:51 So say I come in and mask this username, I have no idea what it's going to break some analytics tool, if it's going to ruin our marketing department, like I have no idea, right?

37:00 Right.

37:01 Why can't we send email anymore?

37:02 Well, you see.

37:03 Yeah, exactly.

37:04 And so it's also this difficulty of communicating across the organization.

37:10 Because oftentimes you'll get privacy engineers, and they'll be embedded into a product, into a team.

37:16 And theoretically, this is just to talk across the entire org, but there's not like some centralized tool.

37:21 There's no Zendesk of privacy where like, okay, a whole organization uses this one tool, and we can put in tickets, or we can see what the state of privacy is across the organization, et cetera, things like that.

37:31 There's nothing like that that really existed.

37:34 And so that's when we kind of realized, okay, we need to build some kind of platform where it can just be like a one-stop shop for everything privacy engineering related.

37:43 So that's going to be engineers and privacy professionals.

37:46 The engineers do their work.

37:47 It all flows upwards into this tool.

37:49 And then the compliance professionals can get all the information they need out of that tool and trust that it's correct because it's done in a programmatic way.

37:56 And it's automated and all the stuff that you need, right?

37:59 Yeah.

38:00 All right.

38:01 Now, really one final question before we jump into your platform, which solves many of these problems. What about AI? What if it learned something through personal information and then you ask for your personal information? You can't go and show me like the node in the neural network that has my information.

38:18 But at the same time it knows something about it, right?

38:21 Correct. Yeah, so it is trained. There are different ways to deal with this. So for instance, but like you said, you can never really know. So I mean this is rabbit So you can use AI to generate fake PII and then train a model on fake generated PII.

38:39 That's one way to do it.

38:40 Right, right, right.

38:41 But again, like you said, due to the very opaque nature of, and like you said, we're talking about actual neural nets.

38:46 We're not just talking about machine learning, statistical learning models.

38:48 It's like a neural network, that stuff becomes completely obfuscated.

38:51 Probably like mid-journey, dolly, these types of things.

38:53 Yeah, exactly.

38:54 Yeah, it becomes truly a black box.

38:55 There's really no way to know, right?

38:57 to know, right?

38:58 And that comes down to regulators stepping in and again, just saying, hey, you cannot use - Yeah. - AI in this model, regardless of the fact that eventually, theoretically, it would be obfuscated.

39:09 You know, that comes down to governments just say, hey, that's not cool regardless.

39:13 - It's gonna be so interesting as this evolves 'cause if it was trained on that information, it kind of is corrupted in a sense, like you can't take one person's information out, you'd have to redo the model. - Exactly.

39:23 - That's so much work. - Exactly.

39:24 - Yeah, it's so tricky. - Exactly.

39:25 So you got to think of that up front.

39:27 So, all right.

39:28 So let's talk about your project, Fides.

39:31 Tell us about Fides.

39:33 - Absolutely.

39:33 So Fides is an open source, I guess, tool for, platform maybe is a better word for it, an open source platform for privacy engineering.

39:42 And it's really designed towards those two personas that I've talked about, where you have privacy professionals, you have a compliance team, and they need an easier and a more accurate way to interface and work with the engineering team, other than just calling cons of Zoom calls to ask them, hey, what does this table do?

39:58 Which again, is fine.

39:59 Like that's their job.

40:00 They're supposed to be doing that for protecting the company and protecting the privacy of the user's data.

40:04 But then on the other side, you have engineers and engineers, they probably don't want to be in these Zoom calls all the time.

40:09 And they would probably much rather interface with privacy engineering in a way that's more familiar with them.

40:14 So CI checks, command line tools, YAML files, like we mentioned.

40:18 And so we thought, okay, we need to build a tool that bridges that gap, right?

40:22 Like we need to create an overlapping set of tools that both sides will be happy with and that provide a good user experience for both sides.

40:30 And so we have Fides, right?

40:32 So Fides is primarily Python.

40:34 Pretty much everything's in Python.

40:35 We also have TypeScript for the front end.

40:37 We use a lot of other open source stuff and we're also on GitHub.

40:41 So anyone that wants to use this for themselves is totally able to, right?

40:47 Because we kind of fundamentally, I think as a privacy company, it's important to believe that privacy is a human right.

40:52 While we do have some paid features, the vast majority are completely available, like compliance is completely available for free and open source.

41:02 We don't think we should be saying, "Hey, your privacy is really important, but only if you pay us." But we think any engineers should be able to come and look at our repo and grab FIDAs and then start working off of it and be able to respect user privacy within their applications without having to really pay anything.

41:20 - Right, and since you brought it up, so Fides is this open source project that people can grab and fulfill and automate much of what we've been talking about here, which is awesome. - Correct.

41:31 - And your business model with Ethica is, I guess what you would probably classify as open core, right?

41:37 Like, you sell-- - Yeah, that's how we refer to it internally. - Is that how you consider it, yeah? - Yeah, open core.

41:42 So we have, internally we'll call it Fides core, right, which is this repo, which is where a lot of the work, it's not like you'll see me, most of my PRs are in there.

41:51 So FIDO's core is really what we build on.

41:53 And then we have an additional, what we call FIDO's plus.

41:57 And that is where you would get additional features that are really more like enterprise focused, right?

42:02 So if you are, like I said, we're talking about those medium to very large enterprises where you have a hundred thousand tables and maybe you want a machine learning classifier to help you figure out what kind of data is in those tables, then like that'd be a paid feature.

42:16 - But if you just want to-- - Like you kind of bootstrap it, like, "We have this data, go look at it and tell me what you think about it." Something like that?

42:22 - Yes, exactly, exactly.

42:23 So it'll walk databases, tables, fields, all that kind of stuff, and say, "Hey, this is probably this type of data, this is probably this type of data." You know, obviously as accurate as we can get it.

42:33 With the most part, things are going to be happening in OpenCore.

42:36 So in the OpenCore product, we are going to tackle the three major things that we think are gonna be required for any kind of privacy first application.

42:48 So first we're going to let people, and you can see for all the YouTube viewers, video viewers on the right side, here we've got YAML files.

42:54 So YAML files are where you're gonna define kind of the primitives you wanna use for your application.

42:59 So we have like data uses, we have different data category types, and you can define systems, data sets, kind of the building blocks of how you're going to describe and define your application from a privacy perspective.

43:10 Once you've done that, right, once we have all this information that you've given us as metadata, you've given us about your application and your datasets, we're then able to start enforcing that automatically.

43:18 We're able to start building those data maps and telling you, "Hey, based on what you told us in your metadata annotations, this is everywhere your data lives and this is the type of data that lives there." Additionally, based on those, we're going to say, "Hey, if you give us an e-mail, we have actually an execution engine with a bunch of different connectors." We're going to say, "Okay, so you've told us you have a Postgres database here and you have a Mongo database here, and we're looking for this email and these tables are going to be where the PII is.

43:46 So it'll automatically go and execute that. It's built on top of Dask, but we're doing our own logic for some directed acyclic graph to go out and find that data and delete it in the right order or to retrieve it in the right order and then give it to user request.

44:00 So we're really leveraging this power of using metadata to go ahead and automate all these tasks.

44:06 We'll also, more and more as we go into the future, we're trying to figure out ways to just automate it completely. So if we got to a point where engineers didn't even have to write these YAML files, and we could just introspect the code and figure out quite programmatically what was actually going on there, what we need to be concerned about, that's kind of where we want to get to and where we see the future of privacy being, especially with the incredible explosion of these large language models and things like chat, you begin OpenAI, doing some kind of natural language processing to allow us to understand what the code is doing without burdening developers with writing YAML is where we hope to get to eventually as well.

44:40 That's, that's ambitious, but five years ago, I would have said, Oh, that's insane.

44:45 Anymore, you give these large language models, good chunks of code, and they have a really deep understanding of what's happening.

44:52 It's right.

44:53 It's Yeah, so it's very impressive.

44:55 So for now, we're still in we're still in the animal land.

44:58 Hopefully engineers are pretty comfortable there.

44:59 We've been there for a while, I think with Kubernetes and all kinds of other tools.

45:03 But yeah, hopefully we just we just want to keep lowering the barrier to entry for for for privacy compliance and for, you know, building applications that are private by design.

45:13 - Right.

45:14 So in the parlance of your all's website, that's privacy checks as code and continuous integration.

45:19 The two other things are programmatic subject requests and automated data mapping.

45:25 It sounded like you touched on the automated data mapping, but talk about the programmatic subject.

45:29 - Yeah, so the programmatic subject request is what I mentioned briefly about kind of how we build an execution graph for when those data requests come in.

45:38 Again, like I said, we have that metadata, we know where your data lives and what type of data it is.

45:42 When a user says, "Hey, here's my e-mail, please get rid of all the information you have about me." We're able to do that subject for us programmatically because we know, okay, we're going to reach out to the Salesforce API, we're going to reach out to this Postgres database, users table where we know that data lives.

45:57 We're going to do that for you automatically because like I mentioned before, there are plenty of relatively large and relatively small enterprises where there is someone on staff full time waiting for these emails to come in. And then they say, "Okay, this email needs to get deleted." And I've done this before, so I'm not above this. When I was a data engineer, we had to do this as well in Snowflake, right? You know, something comes in and they say, "Okay, I've got this email. Now I need to go to these 20 systems and run all these manual scripts and hope that I don't do it in the wrong order." Because like you said, if there's like foreign key constraints, you need to know about that. Because if you do it the wrong word, it's going to mess things up. So we basically handle that for you based on the metadata.

46:34 Cannot complete transaction. There's a foreign key constraint violation. Sorry.

46:38 Correct. Right. It's like knowing that and being able to figure that out, that stuff is important. So we will handle that. And then, like I mentioned with the data mapping.

46:46 So this is really, really important for compliance professionals, because this is kind of like their bread and butter. Like they have to be able to produce these data maps to show compliance with GDPR. And that's going to show all the systems, you know, within their application, within their enterprise, and then what those systems are doing and with what kind of data.

47:04 And that's really, really important. And again, we can generate all this based off of the YAML file.

47:08 And for engineers, the thing that we have is what we call privacy checks to CI code, or privacy checks as code in CI, where we're shifting privacy left, kind of the same way that we saw with security, right? Where you went from, okay, we push the application out, and now a a security team is just going to play with the production version and figure out where there are problems.

47:27 Right.

47:28 We're going to disregard security and give it to you.

47:30 And then you tell us how you broke it.

47:31 Yeah, basically.

47:32 Yeah.

47:33 But that's how a lot of companies treat privacy now too is like, we'll kind of figure it out on production, right?

47:37 Like ship it, we'll figure it out on production.

47:41 And now you see, oh, actually, there are really great static analysis tools for code, right?

47:46 You have Snyk, you have, you know, various other open source versions that were like, let's we're going to scan your code before this commit even.

47:53 Are you leaking secrets?

47:55 Have you stored anything that maybe you shouldn't be?

47:57 We're trying to do that for privacy as well.

48:00 We're shifting privacy left and based on this metadata, and based on the policies you defined, we can say, "Hey, you've added this new feature and you've annotated in YAML, and now you're stating that this system is using user data for third-party advertising, and we're going to fail your CI check.

48:17 We're going to throw an error and say, "Hey, there's a violation here of this privacy policy that your company has, because you define you're using user data in this way, and that's gonna break that." So again, that's just a way to short circuit that whole thing of, okay, the engineers have shipped it, and now someone comes running back and says, "Hey, hey, why did you ship this?

48:37 You're using personal data in a way you're not supposed to be." We're trying to kind of get around that by saying, pretty early on, we'll know what's going on, and we can avoid pushes to main or deploys that don't pass the CI checks.

48:51 - Okay, so as an engineer writing software in this, guarded by this, it's my, I have to be proactive and state how I'm using data if I'm bringing new data into the system or does it somehow get discovered?

49:05 - No, so that is currently what is required is it is up to the engineers to maintain that YAML.

49:12 So we're working on ways to automate that.

49:14 We actually have automated it for datasets because it's obviously much more programmatic.

49:18 If you say, "Hey, here's my application database, here's my Postgres database, and I've annotated every field and all that stuff." We can automatically scan that and we say, "Hey, in your YAML definition, you've left out these two columns," which maybe you added in this PR.

49:36 Before that PR goes in, it's going to remind you, "Hey, you need to add these two new columns to your dataset.yaml file so that we know what's going on in those new database columns.

49:45 - Okay, I can see how you might do a lot of, sorry, I can see how you might do a lot of introspection, like, oh, I'm using SQL model and here's the Pydantic thing describing the table and here's the two new things, but then you could also traverse the usages of those two columns and see where they're used elsewhere and possibly, is there any API call that is like that data is being passed to, for example?

50:09 And like, oh, is it coming out of--

50:11 - Exactly.

50:12 All right, you might be able to find the common integrations and see what's happening there.

50:15 - Yeah, that's exactly what we're looking at next.

50:18 Like I said, looking for ways to automate this, right?

50:21 So even if we just had a really basic dictionary of like, hey, these APIs are gonna be related to storing or sending user data, right?

50:29 And making sure that's annotated.

50:30 And like you said, maybe some of those code level annotations we're using, like if you think about Pydantic, right?

50:36 Where you can have the field object and you can define, okay, here's a field, Here's the default, here's the description.

50:43 And then if there was another field that was like, privacy category, data category, whatever, data use or data subject, something like that, that's absolutely something we've been looking at as well of kind of like the next step, because we know, again, this comes down to, this is still partially manual and therefore still potentially error prone.

51:01 So as long as we're scanning databases, that gives us some guarantee that, okay, there's probably not gonna be an entirely new thing that we miss out on.

51:09 But if it's sending you like third party APIs or something, it would still be possible.

51:13 So that's kind of the holy grail we're trying to get to is, how do we just make this even easier for developers?

51:17 How do we lower the barrier even further?

51:19 Because we know this is still somewhat of a barrier to entry, but also hopefully still a huge step up from nothing.

51:26 - Yeah, no, it's great.

51:27 And if your job is to put together a system that can explain how it's using data and how that's enforced and how you're checking that, then something like this seems way better than code review.

51:39 Not instead of, but it's certainly a huge bonus.

51:42 - Yeah, correct, in addition to.

51:44 - Yeah, so what we discussed here, this part feels like a GitHub Action type of plugin as part of a CI step or something along those lines.

51:54 What about the other one?

51:55 So for example, the programmatic subject request that's gonna cruise through the data and pull out the things, either show people what they got 'cause they asked for it or delete it.

52:04 Is that, like, how does that run?

52:07 Is that something you plug into your app?

52:09 Is that a service that just has access to your data?

52:12 - Yeah, so that would be something, so we actually do have a hosted version of Vitez, right?

52:17 So for companies that don't want to bother kind of hosting their own, you know, a database and web server and things of that nature, we host it.

52:25 But all this is stuff that you could self host as well.

52:27 Like you can deploy in your own instance.

52:29 So we have something called the Privacy Center, which again is a thing that you spin up you run on your side that actually would then link to and that is then what would direct the privacy requests to your backend FIDO's web server instance.

52:43 And then that's where you would go and you would go, we call it the admin UI, right?

52:47 Like the admin would log in or the privacy professional log in, they would see, oh, these are all the requests that have come in.

52:51 I can approve these, I can deny these, et cetera, et cetera, based on what's going on there.

52:56 Yeah.

52:57 So we have the kind of the pre-deployment of the code would be the checks in CI and writing all those YAML files.

53:02 And then once you deploy the application, we have these runtime tools, like you said, these subject requests, and that's all stuff you would deploy.

53:09 Most people would do it doing Docker containers that we build and publish, or you could just download it and run it directly, pip install.

53:17 - Yeah, okay, interesting.

53:18 So basically there's a web app that you log into and you can either self-toast it or that can be yours and then it goes and does its magic at that part.

53:28 What about the data mapping bit?

53:30 - Yeah, and the data mapping is something where you can either log into the web portal, because again, we assume that most privacy professionals, most of them have legal backgrounds are probably not gonna wanna mess with command line tool.

53:43 Although we do have a command line tool, right?

53:44 For the engineers.

53:45 So if you're an engineer, you can of course run a command line, command that will give you the properly formatted, Excel document or CSV if you want it with all of the rows in there that you need, Or again, like I said, if you're a privacy professional, you can log in with UI and download it that way.

54:03 And just have it generate it.

54:04 It'll hit an endpoint and then it will just give you the file back.

54:07 - Okay, seems really neat.

54:09 You talked a bit about this, like automated checks with looking at the code to try to reduce the need for explicitly stating how things work.

54:18 And you've got this ML higher order piece that will hunt down the private information.

54:24 Where else are you, what's the roadmap beyond those things?

54:27 Where is it going?

54:29 Just as much as you can talk.

54:31 Yeah, yeah.

54:31 I mean, just more and more automation, right?

54:33 Because again, if we look at the ultimate goal is how do we just make privacy easier?

54:39 Because it's not going away.

54:40 It's only going to get more stringent.

54:42 The fines are only going to get bigger.

54:44 I think it's interesting that that kind of GDPR fine list ended in May.

54:48 Was it May or something?

54:49 March 2022.

54:50 Since then, fines have actually only been getting bigger.

54:53 They're basically making fines larger and larger and larger because they realize that tech companies just don't care for the most part.

54:57 And so it's becoming more and more dangerous for people that aren't compliant.

55:00 So, okay, how do we just make this as easy as possible?

55:03 We know that there are people that probably don't want to maintain a bunch of YAML.

55:06 So it's really just anything along those lines of how do we lower the barrier to entry for people that want to be privacy compliant and use Vita.

55:14 So like you said, that's going to take the form of probably more machine learning models, NLP models that are going to help us introspect the code.

55:22 It could be things like the in code annotations, right?

55:26 where like, okay, maybe there's now, in Pydantic there's now a field to add privacy information or maybe in dbt, the analytics we talked about before, Geno's used by a lot of companies.

55:36 Maybe we add metadata in there where people can now define what PI is being used in there and then we can just kind of natively read from those files instead of having to have our own file format, things of that nature.

55:45 Really, we're just looking at any possible place and getting feedback from people of where their pain points are and how we can help solve them.

55:52 - Sure.

55:53 Do you have any runtime things that you're thinking about?

55:56 You've got the deploy stuff with the YAML.

55:58 You've got the on request stuff with the other things, but we saw this JSON document come in.

56:05 It has to be called email.

56:06 We don't know.

56:07 - Yeah, so interesting you ask that.

56:09 So we have done some research since improved concepts.

56:13 I don't know if you're familiar with something called eBPF.

56:16 It is a way, yeah, eBPF.

56:18 So it's a way to, I believe it's in Linux kernels, to actually monitor the traffic going back and forth between basically over the network.

56:29 What we've been able to do is, if you have an application running in Kubernetes, then we'll deploy something called a system scanner.

56:37 It is runtime and what it's doing is it'll actually watch the traffic that's happening across your application in Kubernetes, and it will come back to you and say, "Hey, you've got these systems.

56:47 They're talking to each other in this way," and basically build a map automatically.

56:51 This is a really useful tool for you're already running everything, and you just want someone to tell you what you have running and kind of build the topography for you, tell you all your systems, tell you all your datasets, tell you what the traffic looks like across your application, then yes, we are working on a tool that can do this.

57:08 - Okay, yeah, this ebpf@ebpf.io looks nuts.

57:12 Dynamically programmed Linux kernel for efficient networking, observability, tracing, and security.

57:17 And yeah, you could just say things like, these are all the outbound connections, all of our systems are making.

57:23 And these IP addresses resolve to, we know this one is MailChimp, we know that one's Zendesk.

57:28 What's this one?

57:29 Correct.

57:29 Yeah, exactly.

57:30 So it's like, okay, so we see that we're making a bunch of calls to MailChimp, but you haven't, you've never mentioned in your metadata that you're talking to MailChimp.

57:37 We now know that that's something we need to worry about, right?

57:40 So this is also something that we have a little bit more.

57:42 This is a little bit of what I was thinking about at runtime stuff.

57:45 Like, can you watch the data flow and go, okay, this is inconsistent with what you've told me is happening.

57:51 Yes, exactly.

57:52 So the way it works now is you would deploy our system scanner into your environment, into your Kubernetes environment.

57:58 It just sits off as its own set of nodes, and it does its thing, and then it'll, after a certain amount of time, it'll just come back to you with, hey, this is what we saw, and it'll build you, it'll actually build the YAML files for you.

58:09 - Oh my goodness, okay.

58:10 - And say this is kind of the definition of what we saw.

58:12 - Very wild.

58:13 All right, that's cool.

58:14 That's cool.

58:16 So we're getting real short on time here.

58:19 I guess maybe give us just the quick report or thoughts on how the open core business model is going for you all.

58:28 I, I feel like going through the life cycle of this podcast, just had the eight year anniversary for running the podcast.

58:35 I've talked to a lot of people and yeah, thank you very much.

58:38 Eight, eight years ago, it was a lot of, ah, we accept donations and mostly To me it looked like what the successful story for open source mostly, at least traditionally at that point, had been, I'm a maintainer or major participant of this important library.

58:56 So I got a really good paying job at a high-end tech company, and high-end tech company gives me 20% time to work back on it.

59:03 Kind of like I've got a benefactor-like employer a little bit.

59:09 And it's moving more and more towards open core and other things.

59:12 I'd say both GitHub sponsors and open core, and then incredibly some like really interesting VC stuff that's happening as well.

59:19 Those are not disjointed necessarily.

59:22 Right.

59:22 Anyway, I think a long story short, I think the open core stuff people have really hit on with something that's seemed like it's kind of working for sustaining open source.

59:31 Yeah.

59:31 I think we've seen the similar, very similar thing.

59:35 I think that it is really hard to walk that line, but where we've come down on it is, with the belief that privacy is a human right, all the tools required to be compliant need to be open source, kind of like full stop. So for instance, like data mapping, that is a legal requirement. We're not going to put that behind a paywall. Handling those privacy requests, building those graphs and being able to go execute those, that is a requirement. You have to have to have that to be compliant, we're not going to pay all that.

01:00:03 That's the line that we're trying to ride.

01:00:05 Is it, okay, anything that makes it super easy, the machine learning stuff, the kind of runtime scanning, all that kind of more advanced stuff that's more cutting edge, more R&D, then we'll probably put that on the paid offerings, right?

01:00:19 But anything required to just get things running, because we know we actually have some very, very large companies that are using Fidesz purely open source.

01:00:26 They have zero contact with us.

01:00:28 don't really have any contact with us. But we know they're using, we know they're using feeders regularly to do their internal privacy. And that's, I think that's a good place to be right. If, if three years ago when you had to implement GDPR stuff for talkpython.fm, and you stumbled across this tool that could just help you do that, and instead of two weeks, it was two or three days, like for me, that still feels like a huge win. And there's still enough, there's still enough enterprise customers who have 100,000 Postgres tables that want us to help them classify, right? There's still enough of those people out there to make it sustainable. And it's also a competitive advantage. I think it's easier. It's easy for people to just think that, "Oh, because it's open source, you're kind of throwing away contracts you would have had otherwise." We've had a lot of people engage with us because we're open source, right?

01:01:14 Sure.

01:01:15 And it's not even that they don't want to pay. They're just saying, "Hey, we like your paid offering, but the fact that you also are open source and you have an open core model we can go look at and contribute to and put ideas into, that's really attractive for a lot of engineering teams.

01:01:27 Because again, that is one of our target markets.

01:01:29 - Yeah, I think traditionally looking back quite a ways, there was, I need a company and a private commercial software behind this.

01:01:36 So I have a SLA and someone to sue when things go wrong.

01:01:39 And that's still, you might be able to do that.

01:01:41 You might be able to crush that company 'cause they did you wrong, but you're still not gonna get your bug fixed and your software working any better because of it, right?

01:01:48 - Right.

01:01:49 - And I think people are starting to see open source as the escape hatch, right?

01:01:53 Like I can work with a company that has this premium offering on top of open source.

01:01:56 But if worse come to worse, we'll just fork the repo and keep on living.

01:02:00 You know what I mean?

01:02:01 And that's a way more, I think, useful and constructive way than we're gonna try to sue them to make them do our deal, right?

01:02:08 - Yeah, exactly.

01:02:09 You're talking about privacy engineering is being about risk mitigation, right?

01:02:13 Having an open source product is also quite a bit of risk mitigation.

01:02:16 You know, it gives engineers time to get comfortable with it.

01:02:18 Like you said, they can fork it.

01:02:19 They can, you know, we're very open to pull requests and people opening issues and feature requests and things of that nature.

01:02:27 So it just really makes it a much more pleasant process when you're just more transparent and more open, people can go and look at the code themselves.

01:02:35 They can, if they report a bug, they can probably go see it getting fixed in real time, et cetera, things like that.

01:02:41 I think it's definitely the way that I prefer working as an engineer.

01:02:44 You know, I can't speak for management or anything, but it's definitely more fun to be able to engage with the community in that way.

01:02:49 - Awesome, well, I'm happy to see you all making your way in the world based on your open source product.

01:02:56 That's really cool.

01:02:57 So I think we're about out of time to keep going on the subject.

01:03:00 So we'll have to just wrap it up with the final two questions here.

01:03:04 So if you're gonna write some code, work on Vitez or something else Python based, what editor are you using these days?

01:03:11 - It's VS Code.

01:03:12 I think it was pretty hardcore Vim until I think at one point you had someone on your podcast who was evangelizing VS Code and I went and tried it and I was like, So I'm very obviously more productive now.

01:03:23 So I've stuck with VS Code ever since then.

01:03:26 That was great.

01:03:27 Thank you for doing that.

01:03:29 - Yeah, you bet.

01:03:30 And a notable PyPI package.

01:03:32 - Yeah, I'm gonna give two here pretty quickly.

01:03:35 So the first one is Knox.

01:03:37 I think I also learned about it from this podcast.

01:03:39 This is, I do talk Python.fm driven development.

01:03:43 Anyway, so we were using, you know, make files before, 'cause I do a lot of work on the kind of dev experience, dev ops side.

01:03:51 And so I was a big Makefile user.

01:03:53 And when I basically heard, hey, it's like Make but more powerful and in Python, I immediately went and tried it out.

01:03:59 And I think spent like the next three days just completely rewriting our Makefile in Knox.

01:04:04 It has been great.

01:04:05 This has been, it has been so empowering for the other developers.

01:04:08 Whereas before Makefile was pretty archaic and it was pretty much only myself that would touch it.

01:04:12 Now the other developers feel very comfortable jumping into Knox and also being cross-platform.

01:04:18 I develop on Windows and literally everyone else at the company develops on Macs.

01:04:22 So it gives us a much more cross-platform way to handle scripting and building and things like that.

01:04:28 - Yeah, make files are quite different over on the Windows side.

01:04:32 - Yes, yes, I learned this the hard way.

01:04:34 It was very often there was like, it works on my machine type of stuff going on there.

01:04:40 And then finally, just another notable package is rich click.

01:04:44 And that is, people don't know, Rich is a package in Python that makes text output, like terminal output look very, very nice.

01:04:52 And that is a nice wrapper on click that makes click look very, very, very nice because it's wrapped in Rich.

01:04:58 So our CLI uses this.

01:05:01 I think it looks great because of it.

01:05:03 I'm sorry, I'd also highly recommend that.

01:05:05 If you're into looking for a more modern feeling CLI with a lot of flexibility and format ability.

01:05:11 - So it gives you kind of automated colorized help text and usage text and like option.

01:05:19 - Yeah, and you can customize that.

01:05:21 A really, really powerful thing also is that it will understand markdown in your doc strings, right?

01:05:28 So if you want to get a little more fancy, it also has kind of its own language that you can use.

01:05:34 So yeah, it's been really nice 'cause I was feeling a little bit jealous.

01:05:38 Our UI looks really nice and our CLI didn't.

01:05:40 And so I went and wanted to find something that would make the CLI look a little bit nicer for probably engineers.

01:05:44 It seems like, ah, that's kind of superficial.

01:05:47 Like who cares color for your silly CLI?

01:05:51 It makes a big difference.

01:05:52 Like the information bandwidth is so much higher.

01:05:55 It's high, yeah.

01:05:56 - Yeah, it really does make a huge difference.

01:05:58 - Awesome.

01:05:59 Well, I think we're gonna leave it here with this, but Thomas, thanks for being on the show.

01:06:03 It's been really great to have you here and a lot of fun to talk privacy.

01:06:07 Yeah, good luck with the project.

01:06:08 It looks great.

01:06:09 - Thank you, talk to you later.

01:06:10 - Yeah, see you later.

01:06:13 This has been another episode of Talk Python to Me.

01:06:16 Thank you to our sponsors.

01:06:17 Be sure to check out what they're offering.

01:06:19 It really helps support the show.

01:06:21 Don't miss out on the opportunity to level up your startup game with Microsoft for Startups Founders Hub.

01:06:25 Get over six figures and benefits, including Azure credits and access to open AIs, APIs.

01:06:30 Apply now at talkpython.fm/foundershub.

01:06:34 Take some stress out of your life.

01:06:35 Get notified immediately about errors and performance issues in your web or mobile applications with Sentry.

01:06:41 just visit talkpython.fm/sentry and get started for free.

01:06:46 And be sure to use the promo code, talkpython, all one word.

01:06:50 Want to level up your Python?

01:06:51 We have one of the largest catalogs of Python video courses over at Talk Python.

01:06:55 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:07:01 And best of all, there's not a subscription in sight.

01:07:03 Check it out for yourself at training.talkpython.fm.

01:07:06 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

01:07:11 We should be right at the top.

01:07:12 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the Direct RSS feed at /rss on talkpython.fm.

01:07:21 We're live streaming most of our recordings these days.

01:07:25 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:07:33 This is your host, Michael Kennedy.

01:07:34 Thanks so much for listening.

01:07:36 I really appreciate it.

01:07:37 Now get out there and write some Python code.

01:07:39 [Music]

01:07:55 (upbeat music)

01:07:58 [BLANK_AUDIO]

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon