Learn Python with Talk Python's 270 hours of courses

#409: Privacy as Code with Fides Transcript

Recorded on Thursday, Mar 23, 2023.

00:00 We all know that privacy regulations are getting more strict and that many of our users no longer believe that privacy is dead.

00:06 But for even medium-sized organizations, actually tracking how we are using personal information in our myriad of applications and services is very tricky and error-prone.

00:17 On this episode, we have Thomas LaPiana from the FIDES project here to discuss privacy in our applications and how the FIDES project can enforce and track privacy requirements in your Python applications.

00:29 This is Talk Python to Me, episode 409, recorded March 23rd, 2023.

00:34 Welcome to Talk Python to Me, a weekly podcast on Python.

00:51 This is your host, Michael Kennedy.

00:52 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both on fosstodon.org.

01:00 Be careful with impersonating accounts on other instances.

01:02 There are many.

01:03 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

01:09 We've started streaming most of our episodes live on YouTube.

01:12 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:20 This episode is sponsored by Microsoft for Startups Founders Hub.

01:25 Check them out at talkpython.fm/foundershub to get early support for your startup.

01:31 And it's brought to you by Sentry.

01:33 Don't let those errors go unnoticed.

01:35 Use Sentry.

01:36 Get started at talkpython.fm/sentry.

01:39 Thomas, welcome to Talk Python to Me.

01:42 Hey, thank you so much for having me.

01:43 Yeah, it's great to have you here.

01:44 I'm excited to talk about privacy.

01:47 I feel like there was this period where everyone just gave up and decided privacy doesn't matter,

01:53 either because it was a good tradeoff for them at the time or they decided it was, you know,

01:58 trying to push a rock up a hill that was never going to make it to the top.

02:02 And so I just don't stress about it.

02:03 But I feel, you know, like things are coming back a little bit and, you know, we all get to be semi-autonomous beings again.

02:09 Yeah, there's definitely been that feeling that, and I think actually it a little bit mirrors the way things are going with AI now, right?

02:16 Where people feel like the genie's out of the bottle.

02:18 How do we put it back?

02:19 But I think we've actually seen that happen successfully with privacy where there was a long time when, you know,

02:24 you would talk to your parents about, hey, maybe don't use Facebook.

02:26 I know this happens to me at least, right?

02:28 Personal anecdotes.

02:29 So, hey, maybe don't use Facebook.

02:30 You sell your data.

02:31 And the response was always like, well, who cares?

02:33 You know, I'm not doing anything bad anyway.

02:36 Why does it matter?

02:36 And I think we've seen very much a reversion to, hey, actually, maybe I don't want my insurance company to know everything about me and my family's medical history type of thing.

02:46 And people are starting to care about it again.

02:47 And somehow we're getting that genie back in the bottle, which is great.

02:51 The internet used to be, it's this thing on the side.

02:54 It was like a hobby or something you were interested in.

02:57 Like, oh, I'll go on the internet and I'll read some user forums or I'll search for some interesting thing that I might be interested in.

03:04 And now it's become all encompassing, right?

03:06 Tech and everything else is interwoven so much that I think people are starting to realize, like, oh, if all these companies can buy, sell, and exchange too much information about me, then that might actually have a real effect in my regular life, my day-to-day life.

03:24 It's not just like, oh, I get weird ads on my hobby time off that I fiddle with the screen.

03:29 Like, no, this is everything, right?

03:31 And so we're going to talk a little bit about the laws and the rules that are coming into place, a little bit of these changes.

03:38 But mostly some platforms that you all are creating to allow companies, especially large companies with complex data intermingling, to abide by these laws and be good citizens of this new world that we're talking about.

03:54 Yeah, absolutely.

03:54 And I think that's another thing we've seen as part of this shift of consumers caring about privacy is you also have individual engineers or individual contributors or managers or people within the organizations that regardless of what laws may require them to do,

04:07 they also do care about building privacy, respecting software, just as the right thing to do.

04:12 And I think we've, yeah, we've seen kind of a general trend in that as well.

04:15 So that's been good to see.

04:16 Yeah.

04:17 Well, I'm looking forward to exploring the ideas and then the platform as well.

04:21 Before we get to that, though, let's start with your story.

04:23 How did you get into programming, Python, privacy, all these things?

04:27 So I actually studied politics in college, but my best friend was a computer science major.

04:33 And when I found out that in college, he was already freelancing, working at home and making way more money than I did in my part time job, I was like, hold on.

04:41 I think it's a computer science thing might have a future.

04:44 So I was just kind of self taught.

04:46 And I ended up doing some data.

04:49 I got a data intelligence job right out of school, despite having zero relevant experience or knowledge.

04:53 And I was told like a week or two in to, I was working on this, this case that we had, and I had to pull stuff from API and put it in database.

05:02 And I had basically never really written a line of code before.

05:05 And I somehow ended up on Python and somehow ended up with, you know, my SQL and I made it work.

05:10 And just from there, just fell in love with the, really the problem solving aspect of coding and just creating value from basically nothing.

05:17 Right.

05:18 Yeah.

05:18 I can just see how that search goes.

05:20 You know, how do I easily pull data from API?

05:23 Well, use requests in Python.

05:25 I know.

05:25 Okay.

05:25 Let's give this a try.

05:26 It was probably something similar.

05:28 I think, I think I had, I had a friend, you know, in a Slack team at the time that was in, you know, into Python or something.

05:35 And I think I just ended up on Python and it's been one of the best accidents of my life.

05:39 Now, you know, however many years later, still working with Python daily.

05:42 Yeah.

05:43 Excellent.

05:43 And now what are you doing for a job?

05:46 Yeah.

05:46 So I'm working at a company called Ethica.

05:48 So we focus on privacy tooling for engineers specifically, or in a broader sense these days, working on privacy tools in general, that can be kind of a meeting point between engineers and compliance professionals.

06:03 Right.

06:03 So like the Kleinstein lawyers, things of that nature, your company, trying to build a common ground for them to kind of build off of and work together.

06:10 Excellent.

06:11 It sounds really fun.

06:12 And you've got a cool platform that you all have open sourced.

06:16 And we're going to talk about that in a minute, but let's keep it at high level for a moment.

06:21 I talked about the swinging pendulum where it went to like YOLO.

06:26 I don't care.

06:27 Internet's fun.

06:28 It's free.

06:28 It doesn't matter to you.

06:29 Oh my gosh, it matters.

06:30 I want my privacy back.

06:32 I can't believe people are doing X, Y, and Z and not just showing me ads with this.

06:36 So we got the GDPR, obviously that made such a huge splash that made me personally scramble to match, to meet the requirements.

06:46 And what I think is really interesting about that is those kinds of laws, and maybe I could get your thoughts on this.

06:52 I think it's a bit of a problem or a challenge.

06:56 And these kinds of laws, you can just see through the veil.

06:59 Like, okay, it talks about internet company or something, but what they mean is Facebook, Google.

07:04 You know, like there are five huge companies or something in the world, most of them on the West Coast of the U.S.

07:12 that are like bullseye.

07:13 The sites are on them and these laws trying to apply to them.

07:17 Yes.

07:18 In general, outside of it too, but like it's those five or whatever that really, you know, were the catalyst for this.

07:25 Whereas, you know, small companies like me are like, oh, well, I have to have a recorded history of an opt-in, not just an opt-in in principle, but I need a date and a time and I need a record.

07:37 So I got to go rewrite my systems so that I can have a record of opt-in for communications if I have it.

07:43 And when there's one or two of you at a company, that shuts the company down for weeks.

07:49 When there's 10 people at Google that I got to stop and go do that, Google just keeps going.

07:53 Right.

07:54 And so there's this tension of, I think, unintended harm that can come from these by asking for a blanket equal compliance when they're really what the laws are written for.

08:08 And people have in mind are like these mega trillion dollar companies that have unlimited money to solve these problems.

08:15 How do you see that in the world?

08:17 Yeah, it's been.

08:18 So I think specifically with the CCPA, which is the California Consumer Privacy Act, they kind of notice that.

08:25 And there is actually a cutoff, I believe.

08:27 I want to say something like revenue or valuation under 50 million.

08:30 And so there is kind of a safety clause for smaller businesses, because like you said, when GDPR came in and it just went after everyone, right, irrespective of size or resources.

08:41 And it was actually more of a punishment for smaller companies, because like you said, if they come for you, Michael, and they say, hey, talk Python, you know, you're doing great.

08:51 You've got all this data, all these people are buying courses, but you're not keeping track of consent or whatever.

08:55 And like you're, you know, you're one or two person team.

08:57 You know, now you've got to stop for weeks.

08:59 And Google and Facebook, they have the privilege and the ability to, yeah, they're going to hire privacy engineers and they're going to try to do things the right way.

09:06 But if they could find a few hundred million, it's just the cost of doing business, right?

09:09 They are calculating these fines as part of their, you know, annual spend.

09:14 And that's just how they do it.

09:16 Whereas for you, that could be, you know, something that ends your business or puts you in hot water or does something else, right?

09:21 Or maybe, or maybe just, you don't even want to do business in the EU anymore because it's too much of a hassle compared to what it was before.

09:27 So I think, yes, I think GDPR really misstepped there and it did end up punishing a lot of smaller businesses.

09:33 But I think they've learned from that.

09:35 They're, they're trying to iterate on it.

09:36 I think cookie consent is a big one.

09:38 They're now kind of revisiting that and saying, hold on, did we implement this the right way?

09:42 Did everyone implement this the right way?

09:44 And I think the CCPA is building on that in a really good way.

09:47 And we can see there's also a lot of shared language.

09:49 So I think that even though GDPR was, it was disruptive and it probably hurt the people that it didn't mean to hurt, at least initially, that it was still a pretty good base for us to work off of just in terms of a general privacy framework.

10:01 And we'll hopefully get us to a place that's a little bit more equitable in terms of who's being punished and who's actually dangerous.

10:07 And like you said, it was like a few outliers that brought about this requirement, right?

10:12 It was, it was Facebook acquiring WhatsApp and then doing uncool things with, with access to both data.

10:16 So it was things like that, that this was designed to stop.

10:19 I think slowly we're getting to a place where it's being wielded more in that, in that vein.

10:23 Yeah.

10:23 The U.S. Supreme Court is not ruled yet on Section 230, but who knows what that's going to unleash.

10:28 Yeah.

10:29 That's a whole nother topic.

10:30 We're not on that one, but you know, I mean, there's still, there's still large waves like that could crest and crash and whatnot.

10:37 I do want to come out and say explicitly, I'm not against the GDPR and I'm not unhappy that I changed my systems around to comply to it.

10:46 Like it's really important to me.

10:48 I've talked to advertisers and told them, no, we're not inserting tracking pixels.

10:53 We're not inserting other types of retargeting stuff for you.

10:57 If you need that more than you need to talk to my audience, you go find some other audience.

11:02 Like seriously, I've told them to go away.

11:03 Yeah.

11:04 And usually they're like, ah, okay, fine.

11:05 We'll, we'll, we'll work around it.

11:07 But, I also just want to kind of put that out.

11:10 There's like a, look, these rules come in aimed at the top and sweep up a bunch of other people as well.

11:17 Right.

11:17 So I think there's like this mixed bag, I guess is what I'm trying to say.

11:20 Yeah, there, there definitely is.

11:22 And funny you mentioned that.

11:23 Like there are, I was, I was shocked for you to see, to find out that there are podcasts I'm subscribed to, not yours.

11:28 So don't worry.

11:28 A podcast I've subscribed to that I can't even download on a VPN because there's, it is just like, there's some tracking requirement that my VPN blocks that it won't even let me download.

11:37 The episode, which is, it's just crazy to think that that's kind of the point where we're at, even with podcasts.

11:42 It's really terrible.

11:43 And a lot of that I think comes from people wanting to do dynamic ad insertion.

11:48 Yeah.

11:48 Right.

11:48 They want to go, okay, this person is coming from this IP address in this town.

11:54 And we have this extra information we've gathered from like these nefarious back channels.

11:59 And we're pretty sure, pretty sure that's Thomas and he works in tech and we're going to offer, we're going to dynamically insert

12:07 this thing, you know, and if, if they, if there's walls that stop that, then, then maybe, maybe no.

12:13 Let me go on one more really, really quick rant here.

12:16 Just to mention the cook, though, this, this cookie consent, I'll try to not be on the soapbox too much for this.

12:21 But I think right here at the CCPA, the California law, this right to non-discrimination for exercising your rights.

12:30 When I look at the web these days, those, all these cookie pop-up notifications, they're like the plague.

12:36 They're just everywhere.

12:37 Many of them just say, you have to just say, okay, or go away.

12:40 And, you know, it's like, okay, well, that's not a lot of control I have.

12:44 On the other hand, we have a lot of technology control.

12:48 I have on my network, I have NextDNS, which will block almost all the tracking and retargeting and all of that stuff.

12:54 I use Vivaldi with the ad blocker on.

12:57 And I'll go to these places and they'll say, turn off your ad blocker.

13:00 And if they'll show you the cookie thing and they'll say, turn off your ad blocker.

13:04 If you don't turn it off, you don't get to come in.

13:07 I think what they should have done instead of having a law that says, you have to tell people you're tracking them and then make them press okay and then track the heck out of them.

13:15 Say, like kind of this last line here in the CCPA, say there's a right to non-discrimination for exercising your privacy.

13:24 And say, you should be able to have an ad blocker or other tracking protection mechanisms without being punished or blocked compared to the other visitors.

13:33 That would have solved it and there'd be no need for these pop-ups everywhere.

13:38 We could just, if as an informed citizenship, if we decide we want to run ad blockers or other types of tracking blockers, we can.

13:46 And we just go about our business, right?

13:47 Like that would have been a more sophisticated solution, I think, than making everybody say, okay, I agree.

13:54 You're tracking me.

13:54 Let's go.

13:55 Yeah, this is, yeah, obviously I think it's been contentious, right?

13:58 Like I can barely remember the internet now, despite it being just a few years ago when, and now it's just so normal.

14:04 You go to a website and you're just waiting, right?

14:06 You're waiting for the other shoe to drop and the thing pops up and you click the thing.

14:08 Where's the cookie thing?

14:09 Okay, I got it.

14:10 I got it.

14:10 It's out.

14:10 Exactly.

14:11 Is it in the top or the bottom of this one?

14:12 Yeah.

14:13 It's a minigame every time you go to a new website.

14:15 And there are even browsers now that have a toggle of like, just project, just don't show

14:20 me cookie consent.

14:20 Just do it for me.

14:21 Yes.

14:21 I accept them.

14:23 I accept them.

14:24 Yeah.

14:24 And it's, you know, I understand why it was there, right?

14:28 They want people to be aware of what's going on, but it's kind of like, you know, EULAs

14:32 or end user license agreements or whatever.

14:33 Like people don't, like people don't really stop to read.

14:36 If you're informed, right?

14:37 And again, this is where actually I think the, talking about non-discrimination and kind of

14:41 the privilege of being more privacy, where there are a lot of people, right?

14:44 If you look at, you know, someone like my parents who, you know, they're going to go

14:48 to the websites, they're going to see the thing that's like, Hey, we were using, they don't

14:52 know what a consent cookies have no idea what the concept of tracking on a website is.

14:55 They don't know what the data is actually used for anything like that.

14:58 So it's, again, it's just kind of an inconvenience that I don't think really increases anyone's

15:02 privacy, especially because a lot of companies are still using the dark pattern of you can click

15:06 yes, track everything.

15:08 But if you want to reject, or you want to only give the required cookies, it's like another

15:13 step or two and it's unobvious.

15:15 And it's like, you know, just a tiny little hyperlink text you have to click on.

15:18 It's not a big, big flashy button.

15:20 Light gray on dark gray.

15:22 Exactly.

15:24 Yeah.

15:25 Yeah.

15:25 Yeah.

15:25 Yeah.

15:25 Something like that.

15:29 Hey folks.

15:29 Let's pause for an ad.

15:31 Remember last time I had an ad for Microsoft for startups.

15:33 This time around, I asked the AI to have some fun with it just to show you how powerful what

15:38 they have on offer really is.

15:39 Here we go.

15:40 Me.

15:41 Can you rewrite the founders hub ad in the style of a corporate executive using too many buzzwords?

15:46 AI.

15:48 Sure.

15:48 Here's a rephrased version of the ad in corporate speak.

15:51 This segment of talk Python to me is brought to you by the Microsoft for startups founders

15:56 hub, a synergetic platform of innovation and digital transformation.

16:01 As a visionary entrepreneur, you are invited to leverage the cutting edge technology solutions

16:05 available through our collaborative ecosystem.

16:08 Our offering is unparalleled with over six figures worth of benefits, including $150,000

16:14 of Azure credit allocation, which is the epitome of value proposition in the cloud market.

16:19 You will have access to the new Azure open AI service and open AI APIs, enabling you to infuse

16:26 AI driven solutions into your applications with ease.

16:29 Our dedicated technical advisory team will provide you with a tailored guidance on best

16:34 practices for implementation, scalability, and security, while our network of mentors will

16:39 provide you with valuable connections within the startup landscape.

16:41 This program is designed to maximize the growth of your venture with expert support in areas

16:47 such as product development, fundraising, and go-to-market strategies.

16:50 There are no funding requirements, making it accessible to entrepreneurs at all stages of their journey.

16:56 So take advantage of this disruptive opportunity and join Microsoft for startups founders hub

17:01 today.

17:01 The process is simple with an effortless five minute application process and the benefits are

17:06 substantial.

17:07 Transform your startup with the power of AI and become a leader in digital transformation.

17:12 Visit talkpython.fm founders hub to enroll.

17:16 Thank you to Microsoft for supporting the show and open AI for making this ad fun.

17:23 So I think we're finding our way still.

17:26 We're not totally figuring it out.

17:28 There's attempts sometimes that don't really have the outcome.

17:32 I think people intended like this cookie consent one.

17:34 But there's still the idea behind it was pretty good, even if the way it came out wasn't that

17:40 great.

17:40 At least from my perception.

17:42 I know some people really appreciate the ability to have those buttons, but I just say like,

17:46 I'm just going to block it no matter what I answer is not coming through.

17:48 So I don't care.

17:49 But companies have to live in this world, right?

17:53 They have to live in the world where the pendulum is swinging back.

17:56 And so I guess, you know, we talked about GDPR and I really want to go too much more into it at this

18:02 point because we talked about so much.

18:03 There's an interesting article called the 30 biggest GDPR find so far.

18:08 And it's not updated for this year, but people can look through and see what kind of, you know,

18:12 it's the exactly the companies that I described for the most part, except for there's like some

18:16 weird ones in here.

18:17 I don't know if you've seen this article, but there's like H&M is German clothing company where

18:25 they like film their employees and then like shared that internally.

18:28 And that was the violation, which was unusual.

18:31 But so these are the ones that people might know.

18:34 Go ahead.

18:34 What are you going to say about that?

18:35 I was going to say, sorry, I was going to say that a lot of people actually forget.

18:38 I think that's an interesting one specifically because GDPR has kind of different classes

18:42 of people it protects.

18:43 And actually employees is absolutely one of them because a thing that you will see is companies

18:48 will use internal employee data and say, well, there are employees.

18:51 They don't have data privacy rights because they work for us.

18:54 So we can use their information however they want to be.

18:55 You can't do that, right?

18:56 GDPR says specifically, yeah, you can't sell your employees data.

18:59 You can't use their, you know, biometrics for whatever you want to, all that kind of stuff.

19:05 So I think it is really important also that H&M got fined for that because it's showing,

19:08 hey, you have to treat your employees as well as your customers when it comes to data privacy.

19:12 Interesting.

19:13 Yeah.

19:13 It's not just your web visitors.

19:15 It's the people.

19:16 Exactly.

19:16 It's trying to protect everyone, no matter what their relation is to said company.

19:21 Another one, I think, just highlights the greater subtlety of all of this is this article in

19:27 the register entitled, Website Fined by German Court for Leaking Visitors IP Address Via Google

19:33 Fonts.

19:33 Are you familiar with this at all?

19:35 Yeah, I'm vaguely familiar with it.

19:38 I think it's an interesting case because you can see the fine is 100 pounds, right?

19:43 Sorry, not 100 pounds, 100 euros, which ends up being 110 US dollars.

19:46 And so I think it was very much meant to catch news headlines, right?

19:50 And just kind of warn people, hey, we've now kind of decided the Google Fonts is not going to be great.

19:56 And so this is a very inexpensive warning to everyone else that maybe you should start looking

20:00 into if you're using Google Fonts or not.

20:03 Yeah, I find this very interesting because, again, it's almost like the cookie can set thing.

20:08 This will ripple across most websites, probably.

20:10 Right.

20:11 I think people think about Google Analytics and some of these other conversion tracking systems

20:18 that you plug in.

20:19 You're like, OK, I realize we're tracking.

20:21 But even like really subtle little things like linking to an image of a YouTube video.

20:26 Yeah.

20:27 Like that will like drop cookies from YouTube and Google onto your visitors and all those

20:33 things.

20:33 You're like, wait a minute.

20:34 I just pointed an image like that's nuts.

20:37 And this is like this.

20:38 When I read this, I thought, oh, maybe there's like one person that sued this company and they

20:42 got $100 or something.

20:43 I don't know.

20:44 But what if they had 100 million visitors and everyone decided, oh, we'll do a class action

20:48 lawsuit.

20:48 I mean, it could explode.

20:50 I think that's why it caught the headline so much.

20:52 Yeah.

20:53 And so most of the time it is like with these violations, and this is where it even gets

20:57 a little bit sticky because individual countries, data protection kind of agencies will go after

21:03 companies.

21:03 Right.

21:04 So like if you are, for instance, a lot of big ones happen in Ireland because, again, a

21:07 lot of tech companies, especially like Silicon Valley tech companies have headquarters in Ireland.

21:11 So you see the Irish privacy authority levy a lot of these fines in some cases.

21:16 People are criticizing for being kind of lenient.

21:18 But I think in this case, it was very specifically, you know, for like you said, for one reason

21:22 or another, someone or the government just kind of decided, hey, we need to call out this

21:26 Google hosted web font and kind of warned everyone else that, hey, maybe you shouldn't be using

21:30 those.

21:30 I don't know.

21:31 It is very interesting.

21:32 And I do feel like a lot of these go under the radar.

21:34 I think they do.

21:35 And I don't even think they name the company that this was applied to.

21:39 By the way, people, if they're like, but what do we do?

21:42 Fonts.bunny.net is a really cool option.

21:45 Zero tracking, no logging, privacy first, you are compliant and a drop in replacement for

21:51 Google fonts.

21:52 So people should check that out.

21:55 If they're like, I kind of want this functionality, but I kind of don't want it anymore.

21:58 Yeah.

22:01 So that's a pretty cool option.

22:03 All right.

22:04 So that sets the stage a little bit, but let's maybe talk about some of the problems that like

22:11 large organizations have.

22:12 So I know that you worked at a large organization where it was like, we have this, what data do

22:18 you have about me request?

22:19 Or how do you use my data request?

22:21 And I can only imagine at like a multiple thousand person company, there's these databases and people

22:28 like dip into them and take something.

22:30 And then who knows where it goes?

22:31 And then they hook with some third party other thing.

22:34 And then like, it's off to the races.

22:36 Like, tell me what you did with that.

22:37 Like, I don't know.

22:38 It's out.

22:40 Yeah.

22:41 It feels like the, maybe this is too much of an American cultural reference, but like a

22:45 take a penny, leave a penny, but for data, right?

22:47 Like, you might drop some in there.

22:48 You might take some out.

22:49 No one really knows where it went.

22:51 It's just now circulating in the broader economy.

22:54 And so with that company, this is my last company, also a startup.

22:58 And I was a data engineer.

23:00 So I've been mostly a data engineer until my current position.

23:03 And it got to the point where, and this is going to sound crazy.

23:05 We were luckily through the power of like, you know, Snowflake and DBT and things like

23:10 that.

23:10 We were able to actually replicate a data warehouse per country.

23:14 So like all of our EU data stayed in the EU.

23:17 All of our Canada data stay in Canada.

23:19 We're basically just spinning up as many warehouses as we needed to.

23:21 Like, so when CCPA came online, we were like, all right, we're spinning one up in California.

23:25 And then the rest of the US has one somewhere else.

23:26 Right.

23:27 But it's just, obviously, that's just not sustainable.

23:30 You know, we were a relatively small data engineering team and we automated most of it, but it was

23:34 very clear that that became a huge problem.

23:35 It might be sustainable if it's Europe, US, and other or something like that.

23:40 Right.

23:41 Exactly.

23:41 Where the US people.

23:42 Australia was slowly going to get on that list.

23:45 We're like, all right, the US people, they get no protections.

23:48 We sell them like crazy.

23:50 The Europeans will be a couple of them and the Canadians, they're nice.

23:53 We'll kind of be nice with them.

23:54 But these are, we're seeing this stuff pop up more and more, like more regionally.

24:00 And it's getting harder and harder to follow.

24:02 That was the problem, right?

24:03 When it's California and then the rest of the US, which at the time it was, this was

24:06 a few years ago.

24:07 Okay.

24:07 So we have a data center in California to comply with CCPA.

24:12 And then we have a data center outside of California and we're good.

24:14 And now it's like, well, like Virginia just passed a security law.

24:16 I don't think we have servers in Virginia.

24:18 I'm like, oh, like Idaho just passed.

24:20 I don't know if there's server farms in Idaho.

24:22 It's like, you know, it became a problem like that where you can't, you're not going to

24:25 spin up 50 data centers, one in each.

24:28 Like Hawaii probably didn't have data centers.

24:30 Any Jesus comes there.

24:31 But they did, if they, if you need to spin them up, you need to do it in person and it's

24:35 going to take a month.

24:36 Yeah, exactly.

24:37 Exactly.

24:38 It's very true.

24:38 Sorry.

24:39 I got a bad tan, but the data center is coming along.

24:41 Yeah.

24:43 So it was, you know, so that complexity, right?

24:46 So all these different laws.

24:47 But then on top of that, like you said, putting in the access controls for figuring out where

24:52 data is going and how.

24:54 So definitely having a tool like DBT, which if people don't know, it's kind of like a very

24:58 programming focused data analytics tool building models and such.

25:01 So we had a good lineage graph of where all the data was coming from, what it was doing,

25:04 but we still had to document our use because legally you have to have a valid use for every

25:10 piece of data that you are storing in there and things like that.

25:13 And so I was just spending more and more time in calls with our security team and our privacy

25:19 professionals, our compliance team, just answering questions of just like, hey, here's a gigantic

25:24 graph of all of our tables, all of our databases, just everything we could possibly be doing.

25:29 Like how does data flow through here?

25:30 Like explain to me how this PI goes from here to here and what it's used for exactly.

25:34 And that kind of, as we talked about before, it scales, sorry, it should be, I think just

25:40 DBT.

25:40 Yeah.

25:41 DBT.

25:42 That's the wrong one.

25:42 I know.

25:42 Probably happens.

25:43 I'll find it.

25:44 Keep talking.

25:46 And, but that's not really scalable.

25:48 So you, again, we talked about before this ended up punishing smaller companies a lot

25:52 more because, so if you're Google, you can throw, you can just hire 20 people out of nowhere,

25:57 call them privacy engineers.

25:58 And, you know, just say, Hey, it is now your full-time job to just keep track of these things

26:03 and help us stay compliant.

26:04 But if you're a smaller company like we were, then a lot of that fell to like myself and the

26:09 data engineering team.

26:10 And then of course the product engineers as well.

26:11 And so that makes it really difficult.

26:13 That adds a pretty large burden to doing business.

26:16 And after being there for a while, I then had an opportunity to come work at Ethica and

26:20 I was absolutely sold on working at Ethica.

26:22 And they said, you know, we're trying to build a platform that allows engineers to just handle

26:27 this like engineers like to, right?

26:29 Which is with automated tooling, with CI checks.

26:32 Yeah.

26:33 With the YAML files, with open source Python code.

26:36 And I was like, Hey, this sounds great.

26:38 I'm spending most of my day worrying about this anyway.

26:41 I'd love to just get paid to solve this problem for other people.

26:43 And that's how I ended up at Ethica.

26:45 And it's been a journey ever since.

26:48 And I think we're, one of the challenges of tackling something like this is like we just

26:52 talked about, it's such a broad problem space.

26:54 So you can come in and you can handle the cookie consent thing, right?

26:57 But then they're going to say, well, to have a holistic private solution, we also need

27:01 to handle knowing what our code does and data mapping and DSRs, which, you know, we're

27:04 going to get to a second.

27:05 So there's actually, it's like this multi-pronged data.

27:08 DSR data.

27:09 What's the DSR stand for?

27:11 Yeah.

27:11 So DSR is data subject request.

27:13 And that is-

27:14 Is that like what data you have about me kind of thing?

27:16 Exactly.

27:17 Exactly.

27:17 So I think one that people have probably heard before is there's also like the right to

27:21 be forgotten is included in that.

27:23 So that's the ability for me to go to a company and say, hey, I would like to see what data

27:28 you have about me.

27:29 And so you actually, I'm going to give you my email, right?

27:32 And that's my primary identify in your system.

27:33 I'm going to give you my email.

27:34 And then you need to go scour your entire infrastructure.

27:37 And every piece of PII and data you have related to me, you need to give back to me in like a CSV

27:42 format so that I can very easily see what you're tracking.

27:46 Then you have, like I said, the right to be forgotten.

27:48 So that is, you know, say I'm using BigCo's whatever email service, and I don't want to

27:54 use their service anymore.

27:55 And I say, hey, I'm sending you a request to delete all, any and all data related to myself.

28:00 So I no longer want to be your customer.

28:02 I've deleted my account.

28:03 Everything with, you know, John at somecompany.com.

28:06 You need to take that email, run it through your system and delete every single piece of

28:10 PII related to it.

28:11 Yeah.

28:11 And so these are the kind of like privacy protections we're talking about.

28:14 But that stuff is, is complicated.

28:16 And so.

28:18 Yeah.

28:18 Well, I talked earlier about how it was really challenging for small, small companies.

28:23 I think this thing you're talking about now is it's actually not that bad for small companies.

28:28 I think it's killer for the medium sized business that doesn't have the Google size tech team

28:33 to track it.

28:34 Right.

28:34 They've got a ton of people that mess with it and a ton of, ton of data.

28:38 A lot of integrations.

28:39 Yeah.

28:40 Yeah.

28:40 And that's, that's an interesting thing we've seen is that a lot of, a lot of times

28:44 when people are out of compliance, it's, it's not actually because they are malicious and

28:49 they don't care about people's privacy.

28:51 It is because they just, they physically cannot.

28:55 If you go to someone and say, Hey, you have a hundred thousand, this is not uncommon,

28:58 like a hundred thousand Postgres tables.

29:00 And you need to tell me exactly where every bit of PI is.

29:04 And there's 100,000 Postgres tables.

29:06 It's not going to happen.

29:07 Like no, no one actually knows.

29:08 Right.

29:08 Like there's probably people that have left that may be new.

29:11 And now there's some dangling Postgres database out there in AWS somewhere that has PI

29:14 that they don't even know about.

29:15 Right.

29:15 It just doesn't even show up on their maps anymore.

29:17 And that's the biggest challenge is that it's not, it's not people, you know, doing

29:22 things out of malice.

29:23 It is, is purely the technical scale of the problem is just huge.

29:26 And again, like I said, even Google with an army of privacy engineers or, or meta with

29:31 an army of privacy engineers, they still get fined all the time because it's just not really

29:36 possible to catch everything manually at that scale.

29:38 And that's what most people are still trying to do is to do everything manually.

29:43 This portion of Talk Python to Me is brought to you by Sentry.

29:46 Is your Python application fast or does it sometimes suffer from slowdowns and unexpected

29:52 latency?

29:53 Does this usually only happen in production?

29:55 It's really tough to track down the problems at that point, isn't it?

29:59 If you've looked at APM application performance monitoring products before, they may have felt

30:04 out of place for software teams.

30:05 Many of them are more focused on legacy problems made for ops and infrastructure teams to keep their

30:11 infrastructure and services up and running.

30:13 Sentry has just launched their new APM service.

30:18 And Sentry's approach to application monitoring is focused on being actionable, affordable, and

30:23 actually built for developers.

30:25 Whether it's a slow running query or latent payment endpoint that's at risk of timing out

30:29 and causing sales to tank, Sentry removes the complexity and does the analysis for you, surfacing

30:35 the most critical performance issues so you can address them immediately.

30:38 Most legacy APM tools focus on an ingest everything approach, resulting in high storage costs, noisy

30:45 environments, and an enormous amount of telemetry data most developers will never need to analyze.

30:51 Sentry has taken a different approach, building the most affordable APM solution in the market.

30:56 They've removed the noise and extract the maximum value out of your performance data while passing

31:01 the savings directly onto you, especially for Talk Python listeners who use the code Talk Python.

31:07 So get started at talkpython.fm/sentry and be sure to use their code Talk Python, all

31:14 lowercase, so you let them know that you heard about them from us.

31:17 My thanks to Sentry for keeping this podcast going strong.

31:24 What about bad actors?

31:25 By that, I mean there are companies that try to do the right thing like mine.

31:31 You can go to the course of the website.

31:33 I spent a lot of this part of that two weeks.

31:35 There's a button, download everything you know about me.

31:38 And there's a nuke my account, completely wipe me off the face of the earth as far as you're

31:42 concerned.

31:43 And to my knowledge, those are totally accurate and sufficient.

31:47 However, what if there's a company that says, here's all the data I have with you and here's

31:52 the places I share it.

31:53 And they leave out the three most important and dangerous ones.

31:56 Like, do you know what recourse there is?

31:59 Because it looks like they're complying.

32:00 It looks like I requested the thing they gave it to me.

32:02 I asked it to be deleted.

32:04 They did, except for in that dark market where they're selling it to shadow brokers for ad

32:08 data on credit card mix-ins.

32:10 And that's way more valuable.

32:11 We'll keep that.

32:12 Yeah.

32:12 I mean, this comes down to somehow they would just have to get found out.

32:16 There'd have to be an internal whistleblower.

32:18 There'd have to be an investigation.

32:19 There would have to be some kind of audit.

32:21 Because they do, as part of GDPR, you are required to submit things like a data map,

32:27 which we'll talk about in a little bit, which is basically, where is data going?

32:30 What is our valid use for said data?

32:32 And all that kinds of stuff.

32:34 But like you said, if there's a truly bad actor that is leaving things out of reports on purpose

32:38 and not letting customers know that they're doing certain things with their data,

32:41 I'm actually not sure how that would get kind of discovered.

32:44 I think you're right.

32:44 Maybe a whistleblower or maybe somebody says, there's no way this data got over there without going through there.

32:49 Right.

32:50 Exactly.

32:51 I'm going to try to get some legal recourse to make you show us, make you testify.

32:56 At least lie under oath instead of lie to EULA.

33:00 And even this was, this actually was a big sticking point recently.

33:03 Florida is also, you know, working on their own privacy law.

33:06 And a big sticking point that I believe made it not go through was that they could not agree

33:11 on whether individual citizens should be allowed to sue companies for data misuse or if it should

33:15 be purely something the government handles.

33:17 I mean, that's an interesting thing to think about.

33:20 It is.

33:21 It's one of those things that sounds amazing.

33:22 Like, yes, sure.

33:23 If you're abusing, you know, company X is abusing Thomas.

33:26 Thomas should have some direct recourse, but you could easily destroy a company just by going like,

33:31 let's get 50 people to all like, here's your cookie cutter letter that we send over as part of the legal process.

33:37 And, you know, just knock them offline.

33:39 Right.

33:39 Knock about a business.

33:40 So I, I can see both sides all over again.

33:43 All right.

33:43 So I kind of derailed you.

33:44 We were talking about like the types of things that these medium scale organizations like really

33:50 get hung up on and you touched on some, but.

33:52 Yeah.

33:52 So, sorry, let me, yeah, I'll, I'll go back.

33:54 So number one is just the, the, the largest issue that we see in this scale is actually anywhere

34:00 from medium to large, right?

34:01 Even with Google and, you know, probably like Twitter size, you know, kind of the thing.

34:04 I would also bet really good money.

34:07 There's no one there that really knows where everything is.

34:09 It's just, it's just too much to, to handle manually or within people's heads.

34:14 So the number one problem is that people don't know where their data is.

34:16 That's a huge issue.

34:17 The number two problem is even if they know where all that data is, right?

34:21 Theoretically in a perfect world, if someone gives you an email and says, Hey, you need

34:25 to delete this email across all of your tables.

34:29 And okay.

34:29 I know we have this email and this PI and a hundred tables and three different APIs that

34:35 we use.

34:35 Cause we use whatever Zendesk and Salesforce.

34:37 Okay.

34:37 So now you've got that information in a perfect world.

34:39 How do you actually execute that?

34:41 Like there are plenty of companies that have someone.

34:43 On staff full time that just fulfills these DSRs and right to be forgotten and things like

34:49 that.

34:49 So it is not really efficient to say, okay, I've now got to manually go run SQL queries

34:54 in a hundred different, you know, database tables, a hundred different databases.

34:57 I've now got to log into three different APIs.

34:59 And it's just, it's not, it's again, not doable in an automated, you know, you, you need

35:04 to automate it.

35:05 So even if you know where everything is, how do you automate that?

35:09 So that's another problem we were trying to solve.

35:12 And then finally it's the data mapping piece, right?

35:14 So you need to understand, you not only need to know where your data is, you need to know

35:18 what type of data is and why you have it.

35:20 And that's really difficult because maybe three years ago, I did some proof of concept where

35:25 I was grabbing people's addresses and trying to figure out a way to find cheaper shipping

35:30 for our e-commerce website and whatever the table's still there.

35:33 And so then three years later, someone comes and says, Hey, I found all this PI in this

35:36 database.

35:36 Like, why did you collect this?

35:38 Like what is this for?

35:38 And I've already moved on to another company because, you know, it's startups.

35:41 And that's a problem because you need to have a valid use for every bit of PI that you

35:46 have in your system.

35:47 And so it's this kind of this lack of documentation and knowledge that just brings about all these

35:52 problems.

35:53 And again, without, without automated tooling, it's just, I just don't think it's really

35:56 feasible, which is kind of, again, where ethical saw a place to solve a huge problem.

36:01 Probably also a little fear.

36:02 By that, I mean the time, the short times that I spent at these larger companies, there

36:08 were systems that were like, don't touch that.

36:10 That runs.

36:11 It's important.

36:12 Nobody can make it.

36:13 Nobody can fix it.

36:14 We probably can't redeploy it.

36:16 Just don't touch it.

36:17 And what if it, what if it has a bit of data?

36:20 It cannot have a nullable foreign key relationship.

36:23 No, that's a strong, and I want to remove it from this table, but, but the thing that shall

36:27 not be touched and no one can keep it running.

36:29 It's my problem.

36:31 If I break it, I don't want that problem.

36:33 I could just stay.

36:34 Yeah.

36:35 That's a problem.

36:35 Right.

36:36 Yeah.

36:37 That definitely becomes a problem too.

36:38 Things to get forgotten about things that people don't want to touch things.

36:41 They've lost kind of the institutional knowledge of how it got there and how to,

36:45 how to even get out of it if they wanted to.

36:47 Like you said, fear of, of downstream breaking changes.

36:50 Right.

36:51 So say, say I come in and mask this username.

36:53 I have no idea what it's going to break some analytics tool.

36:57 If it's going to ruin our marketing department.

36:58 Like I have no idea.

36:59 Right.

37:00 Right.

37:00 Why can't we send email anymore?

37:01 Well, you see.

37:03 Yeah, exactly.

37:04 And so it's also this, this difficulty of communicating across the organization.

37:10 Yeah.

37:10 Because oftentimes you'll get privacy engineers and they'll, they'll be embedded into a product,

37:15 into a team.

37:16 And theoretically, there's just to talk across the entire org, but there's not like some

37:19 centralized tool.

37:20 There's no Zendesk of privacy where like, okay, a whole organization uses this one tool

37:25 and we can put in, you know, we can put in tickets or we can see what the state of privacy

37:29 is across the organization, et cetera, things like that.

37:31 There's nothing like that that really, that really existed.

37:33 And so that's when we kind of realized, okay, we need to build some kind of platform where

37:39 it can just be like a one-stop shop for everything privacy engineering related.

37:43 So that's going to be engineers and privacy professionals.

37:45 The engineers do their work.

37:47 It all flows upwards into this tool.

37:48 And then the compliance professionals can get all the information they need out of that

37:52 tool and trust that it's correct because it's done in a programmatic way.

37:55 And it's automated and all the stuff that you need, right?

37:59 Yeah.

37:59 All right.

38:00 Now, really one final question before we jump into your platform, which solves many of

38:04 these problems.

38:04 What about AI?

38:05 What if it learned something through personal information and then you ask for your personal

38:10 information?

38:11 Like you can't go and show me like the node in the neural network that has my information.

38:17 Yeah, exactly.

38:18 But at the same time, it knows something about it, right?

38:21 Correct.

38:22 Yeah.

38:22 So it is trained.

38:23 There are different ways to deal with this.

38:27 So for instance, but like you said, you can never really know.

38:31 So, I mean, this is rabbit hole.

38:32 So you can use AI to generate fake PII and then train a model on fake generated PII.

38:39 That's one way.

38:40 Right, right, right.

38:41 But again, like you said, due to the very opaque nature of, and like you said, we're

38:45 talking about actual neural nets.

38:46 We're not just talking about, you know, machine learning, statistical learning models.

38:49 It's like a neural net where that stuff becomes completely obfuscated.

38:51 Like mid-journey, Dolly, these types of things.

38:54 Yeah, it becomes truly a black box.

38:55 There's really no way to know, right?

38:58 And that comes down to regulators stepping in and again, just saying, hey, you cannot use

39:03 PII in this model, regardless of the fact that eventually theoretically would be obfuscated.

39:09 You know, that comes down to governments to just say, hey, that's not cool regardless.

39:13 It's going to be so interesting as this evolves because if it was trained on that information,

39:17 it kind of is corrupted in a sense.

39:20 Like you can't take one person's information out.

39:22 You'd have to redo the model.

39:23 Exactly.

39:24 That's so much work.

39:24 Yeah, it's so tricky.

39:25 So you've got to think of that up front.

39:27 All right.

39:28 So let's talk about your project, Fides.

39:30 Tell us about Fides.

39:32 Absolutely.

39:33 So Fides is an open source, I guess, tool for a platform, maybe is a better word for it,

39:40 an open source platform for privacy engineering.

39:42 And it's really designed towards those two personas that I've talked about, where you have

39:47 privacy professionals, you have a compliance team, and they need an easier and a more accurate

39:52 way to interface and work with the engineering team, other than just calling cons or Zoom calls

39:56 to ask them, hey, what does this table do?

39:57 Which again, it's fine.

39:59 Like that's their job.

39:59 They're supposed to be doing that for protecting the company and protecting the privacy of the

40:03 user's data.

40:04 But then on the other side, you have engineers and engineers, they probably don't want to

40:07 be in these Zoom calls all the time.

40:08 And they would probably much rather interface with privacy engineering in a way that's more

40:13 familiar with them.

40:13 So CI checks, command line tools, YAML files, like we mentioned.

40:18 And so we thought, okay, we need to build a tool that bridges that gap, right?

40:22 Like we need to create an overlapping set of tools.

40:26 And so we have a lot of tools that both sides will be happy with and that provide a good

40:28 user experience for both sides.

40:29 And so we have Fides, right?

40:31 So Fides is primarily Python.

40:34 Pretty much everything is Python.

40:35 Most of us have TypeScript for the front end.

40:37 We use a lot of other open source stuff.

40:39 And we're also on GitHub.

40:41 So anyone that wants to use this for themselves is totally able to, right?

40:47 Because we kind of fundamentally, I think as a privacy company, it's important to believe

40:51 that privacy is a human right.

40:52 And so while we do have some paid features, the vast majority are completely available.

40:58 Like compliance is completely available for free and open source.

41:02 We don't think we should be saying, you know, hey, your privacy is really important, but

41:04 only if you pay us.

41:06 But we think any engineers should be able to come and look at our repo and grab Fides and

41:12 then start working off of it and be able to respect user privacy within their applications

41:17 without having to really pay anything.

41:19 Right.

41:20 And since you brought it up, so Fides is this open source project that people can grab

41:25 and fulfill and automate much of what we've been talking about here, which is awesome.

41:30 And your business model with Ethica is, I guess, what you would probably classify as open core,

41:37 right?

41:37 Yeah.

41:38 That's how we refer to it internally.

41:40 Is that how you consider it?

41:41 Yeah.

41:41 Yeah.

41:42 Open core.

41:42 So we have, you know, internally we'll call it like Fides core, right?

41:45 Which is this repo, which is where a lot of the work, it's not like you'll see me,

41:49 most of my PRs are in there.

41:50 So Fides core is really what we build on.

41:53 And then we have an additional, what we call Fides plus.

41:56 And that is where you would get additional features that are really more like enterprise

42:01 focused, right?

42:02 So if you are, like I said, we're talking about those medium to very large enterprises where

42:06 you have a hundred thousand tables and maybe you want a machine learning classifier to help

42:10 you figure out what kind of data is in those tables, then like that'd be a paid feature.

42:15 But if you just want to-

42:16 It's like you kind of bootstrap it.

42:17 Like we have this data, go look at it and tell me what you think about it.

42:21 Something like that.

42:22 Yes, exactly.

42:22 Exactly.

42:23 Okay.

42:23 So it'll walk databases, tables, you know, fields, all that kind of stuff and say, Hey,

42:28 this is probably this type of data.

42:29 This is probably this type of data.

42:30 You know, obviously as accurate as we can get it.

42:33 For the most part, things are going to be happening in open core.

42:36 So in the open core product, we are going to tackle the three major things that we think

42:43 are going to be required for any kind of privacy first application.

42:47 So first we're going to let people, and you can see for all the YouTube viewers, video viewers

42:52 on the right side here, we've got YAML files.

42:54 So YAML files are where you're going to define kind of the primitives you want to use for your

42:58 application, right?

42:59 So we have like data uses, we have different data category types, and you can define systems,

43:02 data sets, kind of the building blocks of how you're going to describe and define your application

43:08 from a privacy perspective.

43:09 Once you've done that, right, once we have all this information that you've given us as

43:13 metadata, you've given us about your application and your data sets, we're then able to start

43:17 enforcing that automatically.

43:18 We're able to start building those data maps and telling you, you know, Hey, based on what

43:24 you told us in your metadata annotations, this is everywhere your data lives.

43:27 And this is the type of data that lives there.

43:29 Additionally, based on those, we're going to say, Hey, if you give us an email, we have

43:34 actually an execution engine with a bunch of different connectors.

43:37 And we're going to say, okay, so you've told us you have a Postgres database here, and you

43:41 have a Mongo database here, and we're looking for this email.

43:44 And like these tables are going to be where the PAI is.

43:46 So it'll automatically go and execute that, right?

43:48 Like it builds, it's built on top of Dask, but we're kind of doing our own logic for some

43:52 kind of directed acyclic graph, right?

43:54 To go out and find that data and delete it in the right order.

43:57 Or to retrieve it in the right order and then give it to user request.

43:59 So we're really leveraging this power of using metadata to go ahead and automate all these

44:05 tasks.

44:06 We'll also, more and more as we go into the future, we're trying to figure out ways to just

44:10 automate it completely.

44:11 So if we got to a point where engineers didn't even have to write these YAML files, and we

44:16 could just introspect the code and figure out quite programmatically what was actually going

44:21 on there, what we need to be concerned about.

44:22 That's kind of where we want to get to and where we see the future of privacy being is,

44:26 especially with the incredible explosion of these large language models and things like

44:30 chat, QBD, and open AI, right?

44:31 Doing some kind of natural language processing to allow us to understand what the code is doing

44:35 without burdening developers with writing YAML is where we hope to get to eventually as well.

44:40 That's ambitious, but five years ago, I would have said, oh, that's insane.

44:43 Not anymore.

44:45 You can give these large language models good chunks of code, and they have a really deep

44:51 understanding of what's happening.

44:52 It's scary.

44:53 Yeah.

44:54 So it's very impressive.

44:55 So for now, we're still in YAML land.

44:57 Hopefully, engineers are pretty comfortable there.

44:59 We've been there for a while, I think, with Kubernetes and all kinds of other tools.

45:02 But yeah, hopefully, we just want to keep lowering the barrier to entry for privacy compliance

45:08 and for building applications that are private by design.

45:13 Right.

45:13 So in the parlance of YAML's website, that's privacy checks as code in continuous integration.

45:19 The two other things are programmatic subject requests and automated data mapping.

45:25 It sounds like you touched on the automated data mapping, but talk about the programmatic

45:29 subject request.

45:29 Yeah.

45:29 So the programmatic subject request is what I mentioned briefly about kind of how we build

45:34 an execution graph for when those data requests come in.

45:38 So again, like I said, we have that metadata.

45:39 We know where your data lives and what type of data it is.

45:41 So when a user says, hey, here's my email.

45:44 Please get rid of all the information you have about me.

45:47 We're able to do that subject request programmatically because we know, okay, we're going to reach out

45:51 to the Salesforce API.

45:52 We're going to reach out to this Postgres database user's table where we know that data lives.

45:57 And we're going to do that for you automatically.

45:59 Because like I mentioned before, there are plenty of relatively large and relatively small enterprises

46:04 where there is someone on staff full time waiting for these emails to come in.

46:09 And then they say, okay, this email needs to get deleted.

46:11 And I've done this before.

46:12 So I'm not above this.

46:13 When I was a data engineer, we had to do this as well in Snowflake, right?

46:17 You know, something comes in and they say, okay, I've got this email.

46:20 Now I need to go to these 20 systems and run all these manual scripts and hope that I don't do it in the wrong order.

46:25 Because like you said, if there's like foreign key constraints, you need to know about that.

46:28 Because if you do it in the wrong order, it's going to mess things up.

46:31 So we basically handle that for you based on the metadata.

46:33 Cannot complete transaction.

46:35 There's a foreign key constraint violation.

46:38 Sorry.

46:38 Correct, right?

46:39 And it's like knowing that and being able to figure that out, that stuff is important.

46:42 So we will handle that.

46:44 And then, like I mentioned with the data mapping, so this is really, really important for compliance professionals.

46:50 Because this is kind of like their bread and butter.

46:51 Like they have to be able to produce these data maps to show compliance with GDPR.

46:56 And that's going to show all the systems, you know, within their application, within their enterprise.

47:01 And then what those systems are doing and with what kind of data.

47:04 And that's really, really important.

47:05 And again, we can generate all this based off of the YAML file.

47:08 And for engineers, the thing that we have is, you know, what we call privacy checks to CI code.

47:14 Or privacy checks as code in CI.

47:16 Where we're shifting privacy left.

47:18 Kind of the same way that we saw with security, right?

47:20 Where you went from, okay, we pushed our application out.

47:23 And now a security team is just going to play with the production version and figure out where there are problems.

47:27 Right.

47:28 We're going to disregard security and give it to you.

47:30 And then you tell us how you broke it.

47:31 Yeah.

47:33 But that's how a lot of companies treat privacy now too.

47:35 Is like, we'll kind of figure it out in production, right?

47:37 Like ship it.

47:38 We'll figure it out in production.

47:39 And now you see, oh, actually there are really great static analysis tools for code, right?

47:46 You have Snyk.

47:47 You have, you know, various other open source versions that were like, hey, let's, we're going to scan your code before this commit even.

47:53 And are like, are you leaking secrets?

47:55 Have you stored anything that maybe shouldn't be?

47:57 And so we're trying to do that for privacy as well, right?

48:00 So we're shifting privacy left and based on this metadata and based on the policies you defined, we can say, hey, you've added this new feature and you've annotated it in YAML.

48:10 And now you're stating that this like system is using, you know, user data for third-party advertising and we're going to fail your CI check.

48:17 We're going to throw an error and say, hey, there's a violation here of this privacy policy that your company has because you define you're using, you know, user data in this way.

48:27 And that's going to break that.

48:28 So again, that's just a way for, to like short circuit that whole thing of, okay, the engineers have shipped it.

48:34 And now someone comes running back and says, hey, hey, why did you ship this?

48:37 You're using, you know, personal data in a way you're not supposed to be.

48:40 We're trying to kind of get around that by saying, you know, pretty early on we'll know what's going on and we can, you know, avoid pushes to main or deploys that don't pass these CI checks.

48:51 Okay. So as an engineer writing software in this sort of guarded by this, it's my, I have to be proactive and state how I'm using data if I'm bringing new data into the system or does it somehow get discovered?

49:05 No. So that is, that is currently what is required is it is up to the engineers to maintain that YAML.

49:11 So we're working on ways to automate that.

49:14 And we actually have automated it for data sets because it's obviously much more programmatic.

49:18 So if you, if you say, hey, here's my application database, here's my Postgres database.

49:23 And I have, you know, I've annotated every field and all that kind of stuff.

49:27 So we can automatically scan that.

49:28 We'd say, hey, in your, in your YAML definition, you've left out these two columns, right?

49:33 Which you maybe you added in this PR, right?

49:35 Right.

49:35 And so before that PR goes in, it's going to remind you, hey, you need to, you need to add these two new columns to your dataset.yml file so that we know what's going on in those new database columns.

49:44 Okay. I can see how you might do a lot of, sorry, I can see how you might do a lot of introspection, like, oh, I'm using SQL model.

49:52 And here's the Pydantic thing describing the table.

49:55 And here's the two new things.

49:56 But then you could also, you know, traverse the usages of that, that those two columns and see where they're used elsewhere.

50:04 And possibly, is there any API call that is like that data is being passed to, for example, and like, oh, is it coming out of exactly?

50:11 All right.

50:12 You might be able to find the common integrations and see what's happening there.

50:15 And yeah, that's, that's exactly what we're, we're looking at next with, like I said, looking for, looking for ways to automate this.

50:20 Right.

50:20 So we, even if we just had a really basic dictionary of like, Hey, these APIs are going to be related to storing or sending, you know, user data.

50:29 Right.

50:29 And making sure that's annotated.

50:30 And like you said, maybe some of those, like those code level annotations we were using.

50:34 Like if you think about, you know, Pydantic, right, where you can have the field object and you can define, okay, here's a field.

50:39 Here's a, here's the default.

50:40 Here's the, here's the description.

50:41 here's the description.

50:42 And then if there was like another field that was like, you know, privacy category, data category, whatever data use or data subject, something like that.

50:49 That's absolutely something we've been looking at as well, of kind of like the next step, because we know, again, this comes down to, this is still partially manual and therefore still potentially error prone.

51:00 so as long as we're scanning databases, that gives us some guarantee that, okay, there's probably not going to be an entirely new, you know, thing that we miss out on.

51:09 Yeah.

51:09 But if it's sending you like third, third party APIs or something, it would still be possible.

51:13 So that's kind of the holy grail we're trying to get to is how do we just make this even easier for developers?

51:17 How do we lower the barrier even further?

51:19 Because we know this is still somewhat of a barrier to entry, but also hopefully still a huge step up from nothing.

51:25 Yeah, no, it's, it's great.

51:27 And, you know, if, if your job is to put together a system that can explain how it's using data and how that's enforced and how you're checking that, then something like this seems way better than code review, but not instead of, but it's certainly a huge bonus.

51:41 Yeah.

51:42 Yeah.

51:42 Correct.

51:43 In addition to.

51:43 Yeah.

51:44 So what we discussed here, this part feels like a GitHub action type of plugin as part of a CI step or something along those lines.

51:53 Mm-hmm .

51:54 What about the other one?

51:55 So for example, the programmatic subject request, that's going to cruise through the data and pull out the things, either show people what they got because they asked for it or delete it.

52:04 Is that like, how does that run?

52:07 Is that something you plug into your app?

52:09 Is that a service that just has access to your data?

52:12 Yeah.

52:12 So that would be something.

52:15 So we actually do have a hosted version of FIDES, right?

52:17 So for companies that don't want to bother kind of hosting their own, you know, a database and web server and things of that nature, we host it.

52:25 But all this is stuff that you could self-host as well, like you can deploy in your own instance.

52:29 So we have something called a privacy center, which again is a thing that you spin up and you run on your side that actually would then link to.

52:37 And that is then what would direct the privacy requests to your backend FIDES web server instance.

52:43 And then that's where you would go and you would go, we call it the admin UI, right?

52:47 Like the admin would log in or the privacy question would log in.

52:49 They would see, oh, these are all the requests that have come in.

52:51 I can approve these.

52:52 I can deny these, et cetera, et cetera, based on what's going on there.

52:55 Yeah.

52:56 So we have the kind of the pre-deployment of the code would be the checks in CI and writing all those YAML files.

53:02 And then once you deploy the application, we have these runtime tools, like you said, these subject requests.

53:07 And that's all stuff you would deploy.

53:09 Most people would do it doing Docker containers that we build and publish.

53:13 And, or you could just download and run it and run it directly.

53:16 pip install.

53:17 Yeah.

53:17 Okay.

53:18 Interesting.

53:19 So basically there's a web app that you log into and you can either self-host it or that can be yours.

53:24 And then it goes and does its magic at that point.

53:27 Yeah.

53:28 What about the data mapping bit?

53:30 Yeah.

53:31 And the data mapping is something where you can either log into the web portal, because it's going to be assumed,

53:37 that most privacy professionals, most of them have legal backgrounds are probably not going to want to mess with command line tool.

53:43 Although we do have a command line tool, right?

53:44 For the engineers.

53:45 So if you're an engineer, you can of course run a command line, you know, command that will give you the properly formatted, you know, Excel document or CSV if you want it with all of the rows in there that you need.

53:59 Or again, like I said, if you're a privacy professional, you can log into the UI and download it that way.

54:03 And just have it generate it.

54:04 It'll hit an endpoint and then it will just give you the file back.

54:07 Okay.

54:08 Seems really neat.

54:09 So, again, where you talked a bit about this, like automated checks with looking at the code to try to reduce the need for explicitly stating how things work.

54:18 And you've got this ML higher order piece that will hunt down the private information.

54:24 Where else are you?

54:25 What's the roadmap beyond those things?

54:27 Where is it going?

54:29 Just.

54:30 As much as you can talk about.

54:31 Yeah.

54:32 Yeah.

54:33 I mean, just more and more automation, right?

54:34 Because again, if we look at the ultimate goal is how do we just make privacy easier?

54:39 Because it's not going away.

54:40 It's only going to get more stringent.

54:42 The fines are only going to get bigger.

54:44 I think it's interesting that that kind of GDPR fine list ended it, you know, May, was it May or something?

54:49 March 2022.

54:50 Since then, fines have actually only been getting bigger, right?

54:53 They're basically making fines larger and larger and larger because they realize that tech companies just don't care for the most part.

54:57 And so it's becoming more and more dangerous for people that aren't compliant.

55:00 So, okay, how do we just make this as easy as possible?

55:03 We know that there are people that probably don't want to maintain a bunch of YAML.

55:06 So it's really just anything along those lines of how do we lower the barrier to entry for people that want to be privacy compliant and use FIDA.

55:14 So like you said, that's going to take the form of probably more machine learning models, NLP models that are going to help us introspect the code.

55:22 It could be things like the in-code annotations, right?

55:26 We're like, okay, maybe there's now, you know, in Pydantic, there's now a field to add privacy information or maybe in DBT, the analytics, we talked about before, which I know is used by a lot of companies.

55:36 Maybe we add metadata in there where people can now define what PI is being used in there and then we can just kind of natively read from those files instead of having to have our own file format, things of that nature.

55:45 Really, we're just looking at any possible place and getting feedback from people of where their pain points are and how we can help solve them.

55:52 Sure. Do you have any runtime things that you're thinking about?

55:55 You've got the deploy stuff with the YAML.

55:58 You've got the on request stuff with the other things, but you know, we saw this JSON document come in.

56:04 Yeah, yeah, yeah, yeah.

56:05 It has a thing called email.

56:06 We don't know, you know?

56:07 Yeah.

56:08 So interesting you ask that.

56:09 So we have done some research and some proof of concepts.

56:13 I don't know if you're familiar with something called eBPF.

56:15 No.

56:16 It is a way at eBPF.

56:17 It is a way to use the Linux kernels to actually monitor the traffic going back and forth between basically over the network, right?

56:29 And so what we've been able to do is if you have an application running in Kubernetes, then we'll deploy something called a system scanner.

56:38 And it is runtime and what it's doing is it'll actually watch the traffic that's happening across your application in Kubernetes.

56:43 And it will come back to you and say, hey, you've got these systems.

56:47 They're talking to each other in this way and basically build a map kind of automatically.

56:51 So this is a really useful tool for you're already running everything and you just want someone to tell you what you have running and kind of build the topography for you and tell you all your systems, tell you all your data sets, tell you what the traffic looks like across your application.

57:04 Then yes, we are working on a tool that can do this.

57:08 Okay. Yeah. This eBPF at eBPF.io looks nuts.

57:12 Dynamically programmed the Linux kernel for efficient networking, observability, tracing and security.

57:17 And yeah, you could just say things like, these are all the outbound connections, all of our systems are making.

57:23 Exactly.

57:24 And these IP addresses resolve too. We know this one is MailChimp. We know that one's Zendesk. What's this one?

57:29 Correct. Yeah, exactly. So it's like, okay.

57:31 So we see that we're making a bunch of calls to MailChimp, but you haven't, you've never mentioned in your metadata that you're talking to MailChimp.

57:37 We now know that that's something we need to worry about. Right.

57:39 So this is also something that we have.

57:42 Yeah, this is a little bit of what I was thinking about at runtime stuff.

57:45 Like, yeah, like, can you watch the data flow and go, okay, this is inconsistent with what you've told me is happening.

57:51 Yes, exactly. So you would, the way it works now is you would deploy our system scanner into your environment, into your Kubernetes environment.

57:58 It just, it sits off as its own, you know, set of nodes and it does its thing.

58:02 And then it'll, after a certain amount of time, it'll just come back to you with, hey, this is what we, this is what we saw.

58:06 And it'll build you, it'll actually build the EMO files for you.

58:08 Oh my goodness. Okay.

58:09 And say this is, this is kind of the definition of what we saw.

58:12 Very wild. All right. That's, that's cool. That's cool.

58:16 Well, so we're getting real short on time here. I guess maybe give us just the, the, the quick report or thoughts on how the open core business models going for you all.

58:28 I, I feel like going through the life cycle of this podcast, just had the eight year anniversary for running the podcast.

58:35 I've talked to a lot of people.

58:36 Oh, congratulations.

58:37 Yeah. Thank you very much. Eight, eight years ago, it was a lot of, ah, we accept donations and, and mostly to me, it looked like what the successful story for open source mostly, at least traditionally at that point had been, I'm a maintainer or major participant of this important library.

58:56 So I got a really good paying job at high end tech company and high end tech company gives me 20% time to work back on it. Right.

59:04 Kind of like I've, I've got a benefactor like employer a little bit and it's moving more and more towards open core and other things.

59:12 So I'd say both GitHub sponsors and open core and then incredibly some like really interesting VC stuff that's happening as well.

59:19 And those are not disjointed necessarily. Right. Anyway, I think a long story short, I think the open core stuff, people have really hit on with something that's seemed like it's kind of working for sustaining open source.

59:31 Yeah, I think we've seen the similar, very similar thing. I think that we are, it is really hard to walk that line, but kind of where we've come down on it is, you know, with the belief that privacy is a human right, all the tools required to be compliant need to be open source kind of like full stop.

59:48 So, so for instance, like data mapping, that is a legal requirement. We're not going to put that behind the paywall. Handling those privacy requests by building those, those graphs and being able to go execute this thing. That is a requirement. You have to have that to be compliant. We're not going to paywall that.

01:00:02 That's, that's the line that we're trying to ride. Is it okay? Anything that makes it super easy, the machine learning stuff, the kind of runtime scanning, all that kind of more advanced stuff.

01:00:12 That's more cutting edge, more R and D then we'll probably put that on the paid offerings. Right. But anything required to just get things running.

01:00:21 Cause we know we actually have some very, very large companies that are using FIDES purely open source. Like they don't, they have zero contract with us. They don't really have any contact with us, but we know they're using, we know they're using FIDES regularly to do their internal privacy. And that's, I think that's a good place to be. Right.

01:00:36 If, if three years ago when you had to implement GDPR stuff for talk python.fm and you stumbled across this tool that could just help you do that. And instead of two weeks, it was two or three days.

01:00:46 Like for me, that still feels like a huge win. And there's still enough, there's still enough enterprise customers who have a hundred thousand Postgres tables that want us to help them classify. Right. There's still enough of those people out there to make it sustainable.

01:00:58 And it's also a competitive advantage. I think it's easier. It's easy for people to just think that, Oh, because it's open source, you are, you're, you're kind of, you know, throwing away contracts you would have had.

01:01:10 Otherwise we've had a lot of people engage with us because we're open source. Right. And it's not even, it's not even that they don't want to pay. They're just saying, Hey, we like your paid offering, but the fact that you also are open source and you have an open core model, we can go look at and contribute to and, and put ideas into.

01:01:24 That's really attractive for a lot of engineering teams. Because again, that's, that is one of our target markets.

01:01:29 Yeah. I think traditionally looking back quite a ways, there was, I need a company and a private commercial software behind this. So I have a SLA and someone to sue when things go wrong.

01:01:39 And that's still the, you might be able to do that. You might be able to crush that company because they did you wrong, but you're still not going to get your, your bug fixed in your software working any better because of it. Right.

01:01:48 Right. And I think people are starting to see open source as the escape hatch, right? Like I can work with a company.

01:01:54 I can work with a company that has this premium offering on top of open source, but if worse come to worse, we'll just fork the repo and keep on living. You know what I mean?

01:02:01 And that's a way more, I think useful and constructive way than we're going to try to sue them to make them do our deal. Right.

01:02:08 Yeah. Exactly. Yeah. Talking about, you know, privacy engineering is being about risk mitigation, right? Having, having an open source product is also quite a bit of risk mitigation.

01:02:16 You know, it gives engineers time to get comfortable with it. Like you said, they can, they can fork it. They can, you know, we're, we're very open to pull requests and, you know, people opening issues and feature requests and things of that nature.

01:02:26 So it just, it just really makes it a much more pleasant process when you're just more transparent and more open. People can go and look at the code themselves. You know, they can, if they report a bug, they can probably go see it getting fixed in real time, et cetera, things like that.

01:02:40 Yeah. Yeah. I think it's definitely the way that I prefer working as an engineer. You know, I can't speak for management or anything, but it's definitely more fun to be able to engage with the community in that way.

01:02:49 Awesome. Well, I'm happy to see you all making, making your way in the world based on your open source project. That's really cool.

01:02:56 So I think we're about out of time to keep going on the subject. So we'll have to just wrap it up with the final two questions here.

01:03:03 So if you're going to write some code, work on feed as, or something else, Python based, what editor using these days?

01:03:10 It's, it's VS Code. I think, I think it was pretty hardcore Vim until I, I think at one point you had someone on your podcast, it was evangelizing VS Code. And I went and tried it and I was like, Oh, I, I'm very obviously more productive now. So I stuck with VS Code ever since then.

01:03:26 That was great. Thank you for, thank you for doing that. Yeah, you bet. And a notable PI PI package.

01:03:32 Yeah, I'm going to give two here pretty quickly. So the first one is Knox. I believe I also learned about it from this podcast. This is, I do talk Python.fm driven development.

01:03:43 Anyway, so, so, so we were using, you know, make files before, because I do a lot of work on the kind of dev experience DevOps side. And so I was a big make file user. And when I basically heard, Hey, it's like make, but more powerful. And then Python, I immediately went and tried it out. And I think spent like the next few days just completely rewriting our make file in Knox.

01:04:04 It is. And it's been great. This has been, it has been so empowering for the other developers. Whereas before make files pretty archaic and it was pretty much only myself that would touch it.

01:04:11 now the other developers feel very comfortable jumping into Knox. and also being cross platform, I developed on windows and literally everyone else at the company develops on, on max.

01:04:22 So it gives us a much more cross platform way to handle, scripting and building things like that.

01:04:27 Yeah. Make files are quite different over on the windows side.

01:04:31 Yes. Yes. I learned this the hard way. It was like, you know, it was very, very often. There was like, you know, it works on my machine, type of stuff going on there. and then finally just, another notable package is rich click.

01:04:44 and that is, people don't know rich is a package in Python that makes text output like terminal output look very, very nice. And that is a nice wrapper on click that makes click look very, very, very nice. because it's wrapped in rich.

01:04:58 So, our CLI uses this, I think it looks great because of it. so I'd also highly recommend that if you're, if you're into looking for a more modern feeling CLI, with a lot of, you know, flexibility and format ability.

01:05:11 So it gives you kind of automated colorized help text and usage text and, like option.

01:05:19 Yeah. Yeah. And you can, and you can customize that. a really, really powerful thing also is that it will, it will understand markdown in your doc strings. Right.

01:05:27 So if you want to get a little more fancy, it also has kind of its own language that you can use. so yeah, it's, it's been really nice. cause I was feeling a little bit jealous.

01:05:38 Our UI looks really nice and our CLI didn't. And so I, I went and wanted to find something that would make the CLI look a little bit nicer for probably engineers.

01:05:44 It seems like, ah, that's kind of superficial. Like who cares color for your silly CLI. It makes a big difference. Like the information bandwidth is so much higher.

01:05:55 Yeah. Yeah. It makes, it really does make a huge difference.

01:05:58 Awesome. Well, I think we're going to leave it here, with this, but Thomas, thanks for being on the show. It's been really great to have you here and a lot of fun to talk privacy. Yeah. Good luck with the project. It looks great.

01:06:09 Thank you. Talk to you later. Yeah. Yeah. See you later.

01:06:12 Thank you.

01:06:13 This has been another episode of talk Python to me. Thank you to our sponsors. Be sure to check out what they're offering. It really helps support the show. Don't miss out on the opportunity to level up your startup game with Microsoft for startups founders hub.

01:06:25 Get over six figures and benefits, including Azure credits and access to open AI APIs, APIs. Apply now at talk Python.fm slash founders hub. Take some stress out of your life. Get notified immediately about errors and performance issues in your web or mobile applications with century. Just visit talk Python.fm slash century and get started for free. And be sure to use the promo code talk Python. All one word.

01:06:49 All one word. Want to level up your Python. We have one of the largest catalogs of Python video courses over at talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async. And best of all, there's not a subscription in sight. Check it out for yourself at training.talkpython.fm.

01:07:06 Be sure to subscribe to the show, open your favorite podcast app and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google play feed at /play and the direct RSS feed at /play.

01:07:19 RSS on talk Python. We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talk Python.fm slash YouTube. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.

01:07:39 I'll see you next time.

01:07:58 Thank you.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon