Data Science Year in Review 2018 Edition

Episode #193, published Mon, Dec 31, 2018, recorded Wed, Nov 21, 2018

Episode Deep Dive Links Transcript

This year, 2018, is the year that the number of data scientists doing Python equals the number of web developers doing Python. That's why I've invited Jonathon Morgan to join me to count down the top 10 stories in the data science space.

You'll find many accessible and interesting stories mixed in with a bunch of laughs. We hope you enjoyed it as much as we did.

Episode Deep Dive

Guest introduction and background

Jonathan Morgan is a seasoned data scientist and the founder of the company New Knowledge, specializing in disinformation defense. He’s also deeply involved in Data for Democracy, a nonprofit collective of thousands of technologists passionate about social impact and ethical data practices. Previously a host of the "Partially Derivative" podcast, Jonathan brings a unique blend of startup experience, research-driven data science, and community building to the discussion.

What to Know If You're New to Python

If you’re unfamiliar with Python but eager to dive into data science topics, you’ll want a handle on the basics: code structure, package management, and simple data manipulation. Below are resources mentioned or implied in this conversation to help you get up to speed:

Python for Absolute Beginners: A thorough introduction to Python, perfect for starting from zero.
Familiarity with Jupyter notebooks can help you follow the code-driven research examples discussed.
Understanding simple machine learning workflows (training data, making predictions) can contextualize stories around AI/ML from the episode.

Key points and takeaways

2018: Data Science on Par with Web Dev in Python More data scientists now use Python than ever, matching or even exceeding Python’s popularity among web developers. This shift highlights Python’s expanding ecosystem of numerical, data manipulation, and machine learning libraries. The episode acknowledges how tools like Jupyter, pandas, and scikit-learn have made Python a top choice for data-intensive work.
- Links and tools:
  - Jupyter.org
  - pandas.pydata.org
AI Babysitter Screening Services Some companies, such as Predictim, attempted to use “AI” to scan prospective babysitters’ social media for harassment, bullying, or other “risky” attributes. While well-intentioned in theory, these screening algorithms raise concerns about transparency, bias, and fairness. The discussion underscores the importance of understanding the limits of “AI” and being aware of unintended consequences - like rejecting great babysitters for trivial word choices online.
- Links and tools:
  - predictim.com (Referenced for context around AI babysitter screening)
Airline Algorithms Splitting Families Certain airlines allegedly used revenue-optimizing seat algorithms to separate families across a plane, charging extra to sit together. The guest and host highlight how algorithmic decisions, aimed at profit, can harm customer experience. The newly formed UK Center for Data Ethics took notice, underscoring rising scrutiny over opaque data-driven systems.
- Links and tools:
  - Gov.uk (UK government site, referenced for data ethics in the conversation)
Ethics Pledge and Data for Democracy Jonathan introduced a Data for Democracy project establishing an ethics pledge for data practitioners to adopt in their daily work. This pledge focuses on principles of fairness, openness, and transparency, offering a framework for responsible AI and data practices. Signing the pledge at Data for Democracy can help practitioners advocate for ethical standards within their organizations.
- Links and tools:
  - datafordemocracy.org
Scientific Papers vs. Interactive Notebooks Traditional academic PDFs are giving way to interactive, reproducible research using Jupyter notebooks. The discussion shows how open-source tools let researchers showcase code, data, and methodology side by side, improving the transparency and repeatability of scientific work. This shift fosters quicker, more collaborative advancements across fields like data science and economics.
- Links and tools:
  - Jupyter.org
  - archive.org or arxiv.org (mentioned as hosting open-access papers)
Nobel Prize in Economics with Jupyter Paul Romer’s Nobel Prize-winning research was partly done in Jupyter notebooks, illustrating how open technologies boost accessibility and reproducibility. He transitioned from proprietary tools like Mathematica to Python-based platforms, making it easier for peers to validate and build on his work. This demonstrates open source’s growing role in major scientific breakthroughs.
- Links and tools:
  - Nobelprize.org
  - Jupyter.org
Waze Data Reducing Traffic Accidents Waze’s real-time traffic data offers a window into driver behavior across cities. By analyzing this aggregated data, city officials have predicted crash hotspots and preemptively deployed resources, reducing accidents by an estimated 20%. This case highlights data science as a positive force when used collaboratively for public good.
- Links and tools:
  - waze.com
Machine Learning for Voter Registration Integrity Data scientist Jeff Jonas applied AI to match and clean voter rolls across multiple states, identifying millions of eligible but unregistered voters. The aim was increasing transparency and accuracy without tipping into partisan controversy. It’s another reminder of machine learning’s power to tackle large, messy, and socially important datasets.
- Links and tools:
  - erielections.org (example of local data, not specifically used but relevant to election data)
  - Jeff Jonas’s Twitter (Referenced in the conversation)
AI-Assisted Breast Cancer Detection Using computer vision models to spot subtle cancer indicators can drastically improve early detection rates. Google’s experiments showed AI models catching over 99% of certain early-stage cancers, where human specialists might only hit 38% in difficult cases. Integrating such AI systems as “medical second opinions” may save lives and reduce review times for pathologists.
- Links and tools:
  - ai.google
  - TensorFlow.org
China’s Social Credit System China’s extensive surveillance infrastructure assigns “social credit” scores that reward good behavior (like volunteering) and punish infractions (like traffic violations). Over 11 million flights and 4 million high-speed train trips were reportedly denied to low scorers. This Black Mirror-like scenario underscores how AI can reinforce societal norms - or create new tensions - when used at scale.

Links and tools:
- reuters.com (news coverage on social credit system)
- bbc.com (additional references)

Google Dataset Search As data needs increase, Google launched a dataset-specific search tool to help researchers and analysts find relevant sources. It crawls countless websites, identifying structured data and making it readily discoverable via semantic understanding. This is a significant step for the data community, which often struggles to locate curated, trustworthy information.

Links and tools:
- toolbox.google.com/datasetsearch

Interesting quotes and stories

"It really matters what's behind the curtain with data science work. You can't just rely on the final result." - Jonathan Morgan

"AI is essentially a computer making a decision for me - it might just be an if-statement behind the scenes." - Host’s reflection

"We want an authentic process, where as many people who can vote do vote. And we want to do it responsibly." - Jonathan Morgan

Key definitions and terms

Data Science: The practice of extracting insights and patterns from structured or unstructured data using statistics, machine learning, and domain expertise.
Machine Learning: A method of data analysis that automates analytical model building, allowing computers to learn and adapt from data without explicit programming.
Social Credit System: A national regulatory framework in China where individuals’ behaviors are rated, and privileges can be granted or revoked based on cumulative scores.
Jupyter Notebook: An open-source web application that allows you to create and share documents containing live code, equations, visualizations, and explanatory text.

Learning resources

Python for Absolute Beginners: Ideal if you’re starting your Python journey from scratch.
Data Science Jumpstart with 10 Projects: Get hands-on experience with practical data science tasks.
Python Data Visualization: Learn to communicate findings visually and effectively, a key skill for any data scientist.
Move from Excel to Python with Pandas: Great for analysts transitioning from spreadsheet work to more powerful Python-based data workflows.

Overall takeaway

We’re witnessing Python’s extraordinary rise as a data science powerhouse - surpassing its own success in web development. From AI babysitter screening to Nobel Prize research in Jupyter, these stories underscore both the promise and perils of data-driven decisions. When leveraged ethically, machine learning and open tools can improve society - whether by reducing traffic accidents, boosting voter registration accuracy, or spotting early-stage breast cancer. The opportunity and responsibility rest with each of us to shape these powerful technologies for the greater good.

Links from the show

Show guest: Jonathon Morgan: @jonathonmorgan

Top Data Science Stories of 2018

AI Finds the Perfect Babysitter: washingtonpost.com

The Scientific Paper Is Obsolete: theatlantic.com

Algorithm intentionally splits up families who are flying together: independent.co.uk

Data for Democracy launches ethical principles for data practitioners: datafordemocracy.org/pledge

This year’s Nobel Prize in economics was awarded to a Python convert: qz.com

AI platform, fed by Waze data, predicts accidents, reduces crashes by 20%: zdnet.com

AI finds millions of unregistered voters: nytimes.com

Google AI better than doctors at detecting breast cancer: sciencefocus.com

China’s new “social credit” system will go live by 2020: bloomberg.com

Google launches Data Set Search: toolbox.google.com/datasetsearch
Episode #193 deep-dive: talkpython.fm/193
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #193 deep-dive: talkpython.fm/193

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 This year, 2018, is the year that the number of data scientists doing Python

00:04 equals, maybe even exceeds, the number of web developers doing Python.

00:09 That's why I've invited Jonathan Morgan to join me to count down the top 10 stories

00:13 in the data science space.

00:14 You'll find many accessible and interesting stories mixed in with a bunch of laughs.

00:19 We hope you enjoyed it as much as we did.

00:22 This is Talk Python To Me, recorded November 25th, 2018.

00:26 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the

00:44 ecosystem, and the personalities.

00:46 This is your host, Michael Kennedy.

00:48 Follow me on Twitter, where I'm @mkennedy.

00:50 Keep up with the show and listen to past episodes at talkpython.fm.

00:54 And follow the show on Twitter via at Talk Python.

00:56 Jonathan, welcome back to Talk Python.

00:58 Hey, Michael.

00:59 Thank you so much.

01:00 It's super awesome to be back.

01:01 And it is great to have you back.

01:02 Where have you been?

01:03 I've been like a thousand places.

01:04 I have not been podcasting.

01:06 I feel a little bit guilty about it.

01:08 Very much miss the partially derivative audience.

01:11 Very much miss being on the show.

01:13 I don't think we did this last year, so I'm super pumped to be doing this again.

01:16 But it's because I've been doing two things.

01:20 I have a company, New Knowledge.

01:22 We're focused on disinformation defense, so a lot of data science.

01:26 Ultimately, we like to think that we're protecting public discourse, improving democracy, big

01:30 things like that.

01:31 Not at all pretentious.

01:32 That is a really awesome goal.

01:34 And I suspect you've probably been busy, right?

01:36 I'm just saying.

01:37 It's a lot.

01:38 It's a lot.

01:38 I mean, we've perhaps bitten off more than we can chew.

01:41 It's like every month we expect that to be fading from public consciousness.

01:45 Like, all right, this is the month when people are going to get tired of talking about

01:49 online manipulation and Facebook and Twitter and everything.

01:51 And then every month, it's like it gets worse, which is, you know, it's like it's a thing.

01:55 But it's great.

01:56 And then there's also Data for Democracy, which is the nonprofit that I work on as well.

01:59 And that's a community of about now 4,000 technologists and data scientists who are working

02:04 on social impact projects.

02:05 So also kind of mission aligned.

02:07 You know, democracy is cool.

02:08 But yeah, between those two things, I haven't had as much time to podcast as I would have

02:13 liked.

02:14 But I'm glad, you know, back on the saddle.

02:16 Well, I'll say if you're going to hang up your headphones and your microphone, you've hung

02:19 them up for pretty good reasons.

02:20 Like, those are pretty awesome projects.

02:22 Thanks, man.

02:23 Thanks.

02:23 Yeah, it's exciting times.

02:24 Lots of lots of good stuff to work on.

02:26 Yeah.

02:26 Well, you know what else is exciting?

02:27 I would say data science is as popular as ever, wouldn't you?

02:30 I agree.

02:31 I feel like data science is coming into its own a little bit.

02:33 It's actually it's been interesting to see some of the transition towards just some more

02:39 like established workflow driven team based data science, like a lot of things that I

02:44 think software engineers are probably super familiar with and comfortable with that were

02:48 still pretty nascent the last time that we talked even a couple of years ago.

02:51 So, yeah, it seems it might be here to stay.

02:54 I don't know.

02:54 I'm not going to call it, but I think it's possible that it will be here for a while.

02:57 I definitely think it's a thing.

02:58 You know, it's starting to show up as like full on courses at Berkeley and things like that,

03:03 which is pretty awesome.

03:04 We'll come back to that.

03:05 But I found a cool definition of a data scientist, an individual who does data science.

03:10 And I thought I'd throw that out there and just see what you thought in light of what

03:13 we're about to talk to.

03:14 So this guy named Josh Wills, I don't know him, but Twitter knows him.

03:17 And he said a data scientist is defined.

03:19 He is.

03:21 At least this tweet is.

03:22 He said that a data scientist is defined as a person who is better at statistics than

03:26 any software engineer and better at software engineering than any statistician.

03:30 What would you say?

03:31 Just like burn left and right.

03:34 I think those are both good.

03:36 I think it's a positive representation of a data scientist.

03:39 I think that that's actually true.

03:40 I don't think that Josh is trying to demean anybody with that tweet, although it kind of

03:45 sounds like it, you know, like take that statisticians.

03:49 My software engineering skills totally blow yours out of the water.

03:51 But I think that's right, because it is like this weird hybrid where the you're producing

03:56 software.

03:57 Ultimately, I think software engineers might disagree with that.

04:01 The data scientists are producing software.

04:03 And that's fair.

04:04 That's fair.

04:04 You can debate whether a notebook is a piece of software.

04:06 I think it's a very interesting debate.

04:08 That's a good point.

04:09 And actually, it's interesting how much people are trying to like turn notebooks into runtime,

04:14 like actual production execution environments.

04:17 But that's probably a whole we could go down that rabbit hole all day.

04:20 But it's true.

04:21 I think that's actually that really captures it.

04:23 It gets that blend right in the middle of the Venn diagram between stats and software engineering.

04:28 Yeah, it's cool.

04:28 Yeah.

04:29 So with that, that very precise definition, what we have done is, is mostly you, Jonathan,

04:36 have gathered up a bunch of topics that represent some of the bigger pieces of news from 2018.

04:43 And we're going to go through them for all the data science fans out there.

04:46 Yes.

04:46 I mean, it was a joint effort.

04:47 It was a collaboration.

04:49 I feel like we've touched on some important things here in our list.

04:51 I don't know if you were thinking along these lines, but I had a little bit of a theme with

04:56 a lot of the stories that I was choosing.

04:58 There's kind of a AI may be coming to kill us all or save us all.

05:02 It's like, it's one of the right now.

05:04 It's like, it's like dark times for machine learning.

05:07 It's very interesting.

05:08 It might accidentally kill us while trying to save us.

05:11 Right.

05:11 It's like really well intentioned.

05:13 Like, oh, I can see what you're trying to do there.

05:14 I mean, I'm dead, but I can, I can see.

05:18 You know, you had my best interest in mind.

05:20 All right.

05:21 Thanks for thinking of me, man.

05:22 Right.

05:22 Good try.

05:24 Good try.

05:24 A forever.

05:25 That's right.

05:26 So would you say that the AIs need a babysitter or maybe the other way around?

05:29 Whoa, look at that.

05:31 That was like smooth like butter.

05:32 Speaking of babysitters, I'm not sure how many people have seen this.

05:37 It was kind of an odd little story, but I wanted to, it's like perfectly encapsulates

05:42 well intentioned, but perhaps unforeseen consequences, air quotes AI.

05:47 So a software company called Predictum thought to themselves, you know, it's really tough

05:54 to find a babysitter.

05:55 And it's true.

05:56 You know, you're a parent.

05:57 I'm a parent.

05:58 When you're trying to find somebody to watch your kids, you're like, well, maybe my friends

06:02 use somebody who worked out or maybe I use some online service and they do background checks

06:07 or whatever.

06:07 But it's tough to feel comfortable and confident that the person who's going to come into your

06:12 home and be responsible for your child is a good person, or at least not somebody who

06:16 will put them in danger.

06:17 Like somewhere in that like sweet spot.

06:19 Especially when the baby, when it's a baby, when the baby doesn't speak, it can't report

06:24 to you.

06:25 Yeah.

06:25 The babysitter beat me or the boyfriend came.

06:27 Like it's, you know, you don't even write.

06:30 It's just a baby.

06:30 It doesn't know.

06:31 Exactly.

06:31 It can't, it can't say.

06:32 Right.

06:33 So how do you know?

06:34 Well, the old fashioned way is to, is to use some of the social signals that I mentioned

06:39 before.

06:39 But the new AI way is to use social signals from social media.

06:45 So this gets into kind of creepy territory, I think.

06:49 So, but basically parents have started to turn to this application.

06:52 And, and what, what the system does is that it crawls the social media history of potential

06:57 babysitters and it ranks them on like a risk scale.

07:02 It gives them a risk rating.

07:03 And so it risks them on all sorts of, or it rates them on all sorts of things like

07:07 drug abuse, bullying, harassment, disrespectful, bad attitude, all sorts of things that I guess

07:13 you could get from social media.

07:15 Although I think for most of us, like we would get a five for all of those things because like

07:19 that's just how social media works.

07:20 Like that's what we want out of it, you know?

07:22 But nevertheless, like relatively speaking, I guess you get some kind of rating.

07:26 So it's caused a little bit of controversy for reasons that you might expect.

07:30 Like how would you classify?

07:32 So I should take a step back and say, I'm not sure if most of your listeners will

07:36 be familiar with how an AI system might even go about determining these things.

07:40 So how would I read your social media content and make a judgment that you were at risk of

07:46 bullying or harassment or disrespectful, but there's no way of knowing because the system

07:50 doesn't explain.

07:50 So it might be something really simple.

07:53 Like it just looks for quote unquote bullying words, like some keywords that they pulled out

07:58 of a dictionary that are related to bullying.

08:00 So really simple.

08:01 Or it might be that it's trained a machine learning classifier that it's somehow got a hold of

08:07 a bunch of example, bullying tweets or bullying Facebook posts or whatever it is.

08:11 And it said, Oh, I can now recognize bullying content versus non bullying content.

08:16 And it's trying to use that as a rating system.

08:18 Who knows?

08:18 Nobody knows.

08:18 But the same, the idea basically is that it's scoring any potential babysitters and it's

08:23 giving these to parents on a scale of one to five.

08:26 So somehow there's a difference between a risk assessment of one on the bullying scale, a risk assessment

08:34 of two on the bullying scale.

08:35 And so as parents, we'll have to decide in this kind of like arbitrary scale, am I comfortable

08:40 with a disrespectful score of three, but a bad attitude score of one?

08:45 I'm not really sure.

08:46 What are you teaching my child?

08:47 What kind of disrespect are you teaching my child?

08:50 But AI system has warned me that based on your social medias that maybe, you know, you're not

08:54 like a voice god or whatever it is.

08:56 So in any case, it's like it kind of gets at this idea that like, I think there's this

09:00 dream that we can look at people's digital exhaust on the internet, what they say on social media,

09:05 how they spend their money, where the places that they've been and get some kind of picture

09:10 about who they are as a person.

09:12 So that's the big leap.

09:13 Like you could probably make a guess about how people will behave on social media based

09:18 on how they behave on social media.

09:20 Or you could probably get a sense of what people are likely to buy in the future based on what

09:23 they purchased in the past.

09:24 But that like leap to say, now I know something about you as a person, like how you'll behave

09:30 in other environments where I've never observed you before.

09:32 That's what this application is doing.

09:34 And I think it's like a generally a trend in AI.

09:36 And I'm not sure anybody believes that's actually possible, which is where it's kind of tricky.

09:43 Like, should we use these types of tools and hiring and recruiting and other types of assessments

09:47 that we're making about really, you know, delicate, sensitive things like babysitters or maybe

09:52 less delicate and sensitive things?

09:54 Like, should you be my tax accountant if you get a four out of five for bullying on social

09:59 media?

09:59 Like, I don't know.

10:00 Maybe it's a good thing.

10:01 Maybe you want an aggressive tax account.

10:02 I don't know.

10:02 Exactly.

10:03 Exactly.

10:03 But what is a four, relatively speaking, to a three for bullying?

10:07 Like, I don't know.

10:08 Could you give me something on like a, you know, who's the most harassing person on social

10:13 media?

10:13 And we'll just call them a five.

10:14 And then everybody else is ranked on that person's scale.

10:16 I'm like, I'm trying not to call out anybody in particular.

10:18 Like, I'm really, really trying not to call out anybody in particular.

10:22 But I mean, I don't know.

10:24 So anyway, that's the debate that it's that this is stirred.

10:28 So like, predict them.

10:28 I'm sure.

10:29 Well-intentioned.

10:30 I'd love, you know, to know more about it.

10:33 It sounds so like Orwellian and creepy.

10:36 But let me read you just the quote from one of the founders of people that works there.

10:40 Because it sounds so like something you would want.

10:43 It says, if you search, this is like one of the people from Predict I'm speaking.

10:47 It says, if you search for abusive babysitters on Google, you'll see hundreds of results right

10:51 now.

10:51 There are people out there who either have mental illness or just born evil.

10:55 Our goal is to do anything we can to stop them.

10:57 And when you think of that description and your brand new baby alone, like, you really

11:03 don't want to put them together.

11:05 So it sounds so good.

11:06 But it's also got this really dark side, right?

11:09 It does.

11:10 Yeah, I agree.

11:11 Like, I think that the Predictum folks are pretty well-intentioned.

11:15 But it does highlight what is the unintended consequences of maybe giving too much weight to the output

11:24 of a data science model.

11:26 Like, just because we can package it in a data science workflow, just because it kind

11:31 of walks and talks like something that is algorithmically decided and therefore objective, like, it's

11:37 not necessarily objective at all.

11:39 Like, it's just as easy for me to encode my subjective bias into a machine learning model

11:43 as it is for me to act on that bias in real life.

11:45 So it's difficult for people who aren't familiar with data science to, I think, recognize that.

11:50 And it has these really strange implications.

11:52 What about all those babies?

11:53 Not the babysitters who are actually abusive.

11:56 Like, for sure, let's get rid of those.

11:57 But like, but what about the, you know, what about the people who have, who are otherwise excellent

12:03 babysitters, but at one point, you know, said something mean about a TV show they didn't like on social media, and their

12:11 bullying ranking went through the roof.

12:12 And so now they can't get jobs as a babysitter anymore.

12:15 Like, what are we, I think we need to think about how we're being transparent about the algorithms that we're using to make these sorts of choices.

12:22 And, and of course, that's a difficult thing for those developing those algorithms, because you want to, you know, it's your, it's your, it's your sort of

12:29 special sauce, your, you know, your secret sauce for your product.

12:32 So there's a real tension there.

12:33 And I'm not sure we're getting it right just yet.

12:35 Yeah, it's, it's pretty crazy.

12:36 I guess a few parting thoughts is like, okay, for babysitters, right?

12:40 That's usually a part time small thing for most folks.

12:42 It's not right.

12:43 If you're babysitting career goes a little sideways, you could do something else.

12:47 But a lot of this is applied to all sorts of jobs.

12:50 Like they talked about a company called Fama that uses AI to police all of the workers at like these companies or how Amazon actually canceled what became a clearly biased algorithm for hiring across all of Amazon.

13:04 Then it becomes a sort of have more serious social effects, right?

13:07 Yeah.

13:08 And I think this is coming at a moment when it's like a reckoning for Silicon Valley and technology and machine learning in general.

13:16 We've been, I think, almost naively assuming that everybody who uses our technology will use it with the best intentions or the intentions that we had when we developed it.

13:26 And I think the theme of 2018 is like actually not.

13:31 It's actually possible for things to go horribly wrong in ways that you didn't intend.

13:37 And I think it's good.

13:38 I think it's a good time to have this debate about how close are we actually to encoding our real life values into our digital technologies.

13:45 Like maybe, maybe we're just not that good at it yet.

13:48 Yeah.

13:48 It's, we're definitely new on it.

13:50 Yeah.

13:50 And I feel like the goodwill tours.

13:52 Exactly.

13:54 I feel like a lot of the goodwill has kind of left that, that space a little bit.

13:57 So we've got to be careful.

13:58 Something that is not new was invented in the 1600s and the Renaissance, right?

14:05 Like we're talking the scientific paper.

14:07 What?

14:08 You know, Kepler's orbiting of the planets and all those things, right?

14:13 Written up.

14:14 So actually the scientific paper turned out to be a big invention.

14:18 It was, it used to be you have to write books or you just wouldn't write it down your, your inventions.

14:22 Or they'd be like private papers, like Einstein wrote to Niels Bohr or something like, we just have to run across those papers, right?

14:29 But it turns out that the scientific paper has been around and basically unchanged for like 400 years.

14:36 I would say science has changed and certainly the dependence upon computing has changed, right?

14:42 Yeah, absolutely.

14:43 I mean, I'm, I'm not in most of the like quote unquote hard sciences and even in social science.

14:48 I think almost the majority of the work is, I have no basis of that.

14:53 That's probably fake news, but, but it does, you know, a lot of it is data driven.

14:57 It requires some type of computation for sure.

14:59 And difficult to reproduce papers without having access to that data and computation.

15:03 Right.

15:03 And a lot of times the computations were not published, just the results.

15:07 But I think that's starting to give way.

15:09 And one of the articles, this was published in the Atlantic and it's, it's a serious research project really.

15:16 It's called the scientific paper is obsolete.

15:19 Okay.

15:20 That's a big statement.

15:21 That is a big statement, right?

15:23 And the graphic is super intense, right?

15:26 On that homepage.

15:27 Yes.

15:28 So it's, it's this sort of traditional scientific paper with lots of, there's probably 15 authors and all sorts of, you know, stuff on there.

15:37 And it's literally on fire.

15:39 And it turns out to be a really interesting historical analysis, like sort of a storytelling of how do we go from scientific paper to closed source software with egomaniac sort of folks leading it like Mathematica to Python and open source and Jupyter and all that.

15:57 Right.

15:57 Yeah.

15:57 And I mean, I think it's interesting to see even the shift away from the traditional publishing model for anybody who's gotten into any type of research before that.

16:08 It used to be that research only happened primarily at academic institutions and everything goes through peer reviewed journals.

16:14 But almost like in the tradition of open source.

16:15 And I think it's like, yeah.

16:15 Yeah.

16:15 So it's like, yeah.

16:16 So it's like, yeah.

16:16 Yeah.

16:17 So it's like, yeah.

16:18 So it's like, yeah.

16:19 So it's like, yeah.

16:19 It's like a lot of people posting on something called archive.

16:22 So they'll write like in the style of a peer reviewed paper, but before it's peer reviewed, because partially because the technology changes so quickly, but also because they want to be open and transparent about their work.

16:32 They are uploading basically to this website called archive where you can search academic papers prior to them being published in some type of peer reviewed journal or in a conference or whatever, which is super interesting.

16:43 And I think it gives people a lot more access to a lot more techniques.

16:46 It feels kind of like posting your code to GitHub.

16:48 And this is anecdotal, but at least what I find is that the code that sits behind some of these papers is usually available on GitHub.

16:55 Like the same types of authors who post to archive, I find are the ones that also say, oh, by the way, here's the GitHub repository so you can go run it yourself.

17:02 And like, here's the sample data set that I used, which is like really gets at this idea, which really gets at this idea that there's something to be said about reproducing these findings for these, especially for these like complicated.

17:17 And I mean, I'm thinking mostly from the perspective of new machine learning developments, but these like these complicated new modeling techniques have to be replicable to be usable.

17:24 And in the same way that like your open source project is kind of only as good as it is usable.

17:29 So you might have the best new like JavaScript NBC framework.

17:34 Those probably aren't cool anymore.

17:35 But like when I was a software developer, everybody, blah, blah, blah.

17:40 But like, whatever, you might have the coolest new like JavaScript app.

17:42 But like if nobody uses it, it doesn't really matter how good it is.

17:45 Like the ones that ultimately the community rallies around are the ones that are the most usable, the most accessible, the most transparent.

17:49 And I think it's interesting to see that creeping into research as well.

17:53 So, I mean, I really dig the idea that perhaps it's a model that just model that needs to change.

17:59 Yeah.

17:59 And they talk a lot of this is a pretty deep research piece here.

18:03 It's quite the article.

18:03 It's not just a few pages.

18:05 And it really digs into the history of Mathematica, how it became really important for computation

18:10 and research.

18:12 But it just didn't really get accepted.

18:13 And then along comes Perez and Granger and those folks with their IPython and their open source and their not centralized way of doing things.

18:25 And it's just really interesting how that's become embraced by science and data science.

18:29 And I think how it's actually influencing science.

18:32 Like I hear all these scientists who I speak to talking about embracing open source principles and styles and more engineering.

18:39 And I think all of that is being brought to them from this.

18:41 Oh, yeah.

18:41 And I think it's even being embedded in the way that students are educated now, which is totally different.

18:48 I mean, I think in the article they talk about one of the authors of one of the open source projects, open source Python projects, like has a faculty appointment in the stats department at Berkeley now.

18:56 And I know that Brian Granger's work and his team focused on Jupyter Notebook is I think they're also based at Berkeley or not at Stanford.

19:06 But I mean, they're embedded inside a university department and they do some teaching as well.

19:10 And they're starting to teach courses that are entirely based around this like open source, like open source science, open science workflow.

19:16 So like Python as a programming language and then Jupyter Notebooks, which is if you're not familiar with Jupyter Notebooks, it's basically a way to like execute snippets of code sequentially, but in a web based environment.

19:27 So, you know, you can write a little bit of code, you can run it, you can see what the output is, and then you can build on that.

19:32 And then you can share your notebooks as you go.

19:34 So in the same way that like it basically captures the rough draft process of finding your way towards a data science solution or really any type of programming solution.

19:44 But often they're used by data scientists.

19:46 And these sorts of tools, I think, have made it possible to share like your entire thought process, which is a really important part, like kind of like showing your work to getting to the results that you need to, which I think is maybe more specific to data science and machine learning than it is to most types of software engineering.

20:01 Because like in software engineering, it works or it doesn't.

20:04 And it works fast enough.

20:05 Then like, all right, dope.

20:07 Like, let's move on.

20:07 Like, check that box and move on to the next task.

20:09 Push the button.

20:10 It did the thing.

20:11 We're good.

20:11 All right.

20:12 Fantastic.

20:12 I'm going to close that Jira ticket and march right along.

20:15 That is true to a certain extent in data science.

20:17 Like, you know, your code runs or it doesn't.

20:19 But often we're trying to evaluate like what's the quality of the findings or like what's the quality of the predictions that you're making and what tradeoffs is your model making.

20:26 And like all these like more like fine grained decisions.

20:29 It really matters like what's behind the curtain with most of the data science work and most of these like most of these research papers.

20:36 And so I think that's why these tools have basically like figured out how to capture all of that in a way that makes it really useful.

20:43 Like really usable, really easy to share, really open and transparent, which is why I think they've caught on.

20:48 Like they've caught on because they're usable and they have this great byproduct, like this great knock on effect that they make all of our work more transparent.

20:54 I see a lot of promise and I definitely see things going this way.

20:57 I think it's really good.

20:58 One of the things is you were describing that to me that really resonates is I feel like a lot of science is dull and boring to people who are not super into it because it's been massively sterilized down to its essence.

21:11 All right.

21:11 Here's the formula.

21:12 You plug this in and you get the volume of gas by temperature or whatever.

21:17 Right.

21:17 You're like, well, who cares about that?

21:18 That's so boring.

21:19 Right.

21:19 But if the story is like embedded with it and the thinking process that led up to it's so interesting, like one of the most interesting classes I ever had was this combination of sort of the history of mathematics and something called real analysis for the math people out there where we basically recreate calculus.

21:35 But from like the thinking about the building blocks and it was just so interesting because it had all the history of the people who created it in there.

21:43 Whereas if you just learn the formulas, you're like, well, this is boring.

21:45 Forget this.

21:47 This has the possibility of keeping more of the thinking and the exploration in the paper and in the reporting.

21:54 Oh, yeah, absolutely.

21:56 Which I agree is like is fascinating because it is an investigation at the end of the day.

21:59 And, you know, one of the authors of the Jupyter, like one of the leads of the Jupyter team, it's kind of meta because like his development of Jupyter is now part of this like upper level data science course at Berkeley in which all the students use Jupyter for all of the data science work that they're doing.

22:15 Like it's cool.

22:16 It's like, you know, it's it's it's notebooks all the way down.

22:18 Yeah.

22:19 Although I didn't know.

22:20 I mean, I know that you like pulled this out of the article, which I thought was is really spot on that the name Jupyter is actually in honor of Galileo.

22:28 So like going back to an early scientist, like going going way back into history.

22:33 So it's like we haven't forgotten where we came from, like the scientific method and how that gets encapsulated in these this structure that we've all accepted as, you know, research papers that we're standing on the shoulders of giants.

22:45 We're just moving forward into this new like kind of rapidly iterative, open source, more transparent era, which is cool.

22:51 Like why shouldn't research be democratized with all other types of information?

22:54 Like thank you, Internet.

22:55 And we're not forgetting where we came from, which I think is really important.

22:59 Like we don't want to throw the baby out with the backwater.

23:01 Yeah, absolutely.

23:02 There's so many interesting parallels between sort of early scientific discovery and open source versus closed source.

23:07 We'll come back to this actually.

23:09 But like part of his quote that you were pointing out is like Galloway couldn't go anywhere and buy a telescope.

23:14 So he had to build his own.

23:15 It's sort of like, you know, we just put it on GitHub and we had to just make it.

23:19 It wasn't there.

23:19 It's awesome.

23:20 You got to scratch your own itch.

23:21 You know, if you're fine, I'll build it myself.

23:24 I was going to take the weekend off.

23:26 But, you know, whatever world.

23:27 Come on, we're doing this.

23:28 This portion of Talk Python To Me is brought to you by us.

23:35 Have you heard that Python is not good for concurrent programming problems?

23:39 Whoever told you that is living in the past because it's prime time for Python's asynchronous features.

23:44 With the widespread adoption of async methods and the async and await keywords,

23:49 Python's ecosystem has a ton of new and exciting frameworks based on async and await.

23:54 That's why we created a course for anyone who wants to learn all of Python's async capabilities,

23:59 async techniques and examples in Python.

24:02 Just visit talkpython.fm/async and watch the intro video to see if this course is for you.

24:08 It's only $49 and you own it forever.

24:10 No subscriptions.

24:11 And there are discounts for teams as well.

24:16 So this next one, we were speaking about advanced machine learning and AI and analyzing social media sentiment analysis.

24:24 This next one is more about algorithms and less about AI.

24:28 It probably could be implemented with if statements, but it's actually pretty evil.

24:31 And just the focus on algorithms at the core is pretty interesting.

24:34 I thought so.

24:35 Although I would like to point out that I think half the time that people are talking about AI,

24:38 in air quotes, they're talking about a thing that could have been implemented with an if statement.

24:42 And in fact, I would argue, I know that there's technical definitions of AI,

24:46 but what most people mean is software making a decision for me.

24:49 And it's like, well, there is a way that software does that.

24:51 It is an if statement.

24:52 It's AI.

24:54 If you've written an if statement in your code, your freshman programming class,

24:58 you've written AI.

24:59 You just hang your head up.

25:00 It could go crazy and do a switch statement.

25:03 It may seem sort of wow.

25:04 But yeah, one of these branching things.

25:07 The decisions are endless.

25:08 You know, theoretically endless.

25:10 As many decisions as you want to spend time programming by hand.

25:13 Yeah.

25:13 So this one goes back to the sort of everything's going to be used for good, right?

25:17 I'm not sure how anybody thought this could have been used for good, actually.

25:21 Like this algorithm was just intentionally designed to screw everybody over.

25:25 So, you know, this episode is going to come out around the holidays and people will be traveling.

25:29 People will be traveling.

25:30 And if you are traveling with your family, you may have had this experience where you're like

25:36 trying to buy cheap plane tickets because there's like you and like all your kids and you're

25:39 traveling across the country at an expensive time to travel.

25:41 And you're trying to like get to your parents' house in time for Christmas.

25:45 And it's really stressful and you're annoyed.

25:47 And then you book your tickets and you try and get seats together and you can't do it.

25:52 And you're like, wait a second.

25:53 Come on, airline.

25:54 Like, you know, I booked all these tickets at the same time.

25:57 I'm clearly traveling with like a couple of kids like under 10 years old.

26:00 Come on.

26:00 Like this is this is hard enough as it is.

26:03 Shake fists.

26:03 Surely they know they should put the kids with the parents, right?

26:06 Right.

26:07 That's like a common.

26:08 It's like, why is this system not smart enough to recognize that?

26:11 But it turns out it is smart.

26:13 It's just smart in exactly the evil opposite way that you don't want it to be.

26:19 Because it turns out that at least in the UK, some airlines are using algorithms not to put families together, which is what we all assume they would be doing, but to intentionally split us up.

26:32 And you might ask yourself, why?

26:35 Like, are you just like, you know, like a sadist?

26:38 I don't know.

26:38 Like, what's the somewhere?

26:40 There's like, you know, the developer of this algorithm at the back of every airplane twiddling his thumbs like Mr. Burns and like laughing maniacally.

26:46 Exactly.

26:47 But it's because that way they can charge people more money so that they pay to sit together.

26:53 They're like, oh, do you not want the inconvenience of asking 47 people whether or not they're going to switch with you so that you can sit with your like your child who may or may not have the like emotional fortitude and maturity to to not like freak out by themselves on an airplane going across the country.

27:09 Yes.

27:09 So this apparently this is common practice with a number of airlines.

27:13 They are algorithmically looking at people who share last names.

27:17 So if you have a common surname, you are you will not be seated with each other when the seats are assigned, which seems really uncool.

27:26 I just come on.

27:27 So you can pay for the reserve seat, the extra twenty seven dollars per traveler or whatever.

27:32 Right.

27:32 How much do you care about your kids or like how uncomfortable are you with asking a bunch of strangers during the holidays?

27:39 To switch with you when it becomes that like that like really complicated calculus problem where you're like, well, my wife's sitting seven seats up and and she's at the window.

27:46 But we're also traveling with my son who's four seats up on the aisle.

27:49 So if you switch with her and this person in row 47, which is with him, then I actually think that we could get, you know, approximately close to each other.

27:56 It's like it's absurd.

27:57 So if you don't want to go through that, yeah, you can just like pay an extra 50 bucks.

28:00 But like wouldn't the decent thing to do just be.

28:04 It's so evil to split the families apart so that they'll pay to go back together.

28:09 Although, to be clear, I'm well out of the range where this matters.

28:13 Right.

28:13 My kids could sit alone and it wouldn't be that big of a deal.

28:16 But my thought is, all right, evil airline and your algorithms.

28:20 I see your play and I raise you in a lone child at three in the back by the business traveler.

28:26 I'm going to bed now.

28:28 Thank you very much.

28:29 I love it.

28:33 Just like probably shouldn't implement it.

28:35 But it seems like, you know, you could just turn it around.

28:38 I love it.

28:38 It's like, yeah.

28:39 Oh, you know what we did?

28:40 We skipped nap.

28:41 You know who had and I just like dumped like three, you know, pop rock sugar packets down their throat right before we got on the plane.

28:48 Because, you know, in 30 minutes they're going to freak out.

28:50 Yeah, exactly.

28:52 I like it.

28:53 Let's see like how many kids we can stack up, how many screaming children we can stack up on airplane at holiday.

28:58 It's really bad.

28:59 It's really bad.

29:00 Something like this actually happened to my daughter.

29:01 Not from algorithms, just other bad, bad stuff.

29:04 So it's not great.

29:06 And I do think it's really evil, the airlines, to do this.

29:08 It is.

29:09 It is.

29:09 But what was interesting is that it actually got referred to in the UK.

29:13 There's a new government organization called the Center for Data Science Ethics and Innovation.

29:18 So they actually have.

29:19 That's crazy.

29:20 It's cool, right?

29:21 It's cool.

29:21 So like.

29:21 Yeah, it's very cool.

29:22 Yeah.

29:23 And like we have similar types of stuff here in the US.

29:24 There's an Office of Science and Technology Policy that's in like in the administration, like it's part of the White House.

29:29 And so anybody who's kind of follows that sort of stuff or is interested in data science, like that's where our first US chief data scientist sits is in the OSTP.

29:39 I don't think there is a data science, chief data scientist in the current administration yet.

29:44 But there's a chief technology officer.

29:45 And anyway, it's where like it's where the geeks sit.

29:48 So but in the UK, there's a Center for Data Science Ethics and Innovation.

29:50 And this case was actually referred to them.

29:54 So I think they've just formed and they just formed.

29:57 And basically I got handed like this is the most offensive thing that algorithms have done.

30:00 Good luck with it.

30:02 Center for Data Science Ethics and Innovation.

30:03 Like fix this, man.

30:05 They're like, great.

30:07 This is why we exist.

30:08 Oh, my gosh.

30:09 It is.

30:09 I guess like, you know, it's like because it's a real softball.

30:12 You know, it's like what would the ethical thing to do it be to do in this situation?

30:16 Like, I don't I don't know.

30:17 Gosh, I'm I'm spent.

30:19 I couldn't I couldn't possibly come up with a better, more ethical alternative.

30:24 than splitting parents up from their children.

30:25 Even bureaucrats could totally solve this.

30:28 Yeah, for sure.

30:30 Yeah.

30:30 So, you know, that's a thing.

30:31 I mean, it seems like one of those like open and shut cases.

30:35 I think government ministers are calling it exploitative.

30:38 So that's usually not a good sign for your business practices.

30:40 No, but I.

30:43 Yeah, no, that's a bad start.

30:44 It's a bad start.

30:45 It's a bad start.

30:45 I think they may.

30:46 You know, they've got nowhere to go but up.

30:47 We can say that, you know.

30:48 That's kind of like a pun, I guess, because there are airplanes that go up in the air.

30:52 But anyway, but it brings me to something that I was really excited to talk about with all

30:57 of your listeners because it's something that's important to me personally and something that

31:04 I'm involved with is actually an ethics project for data scientists so that hopefully we could

31:11 prevent these types of mishaps in the future.

31:14 So as I mentioned at the top of the show, I'm involved in an organization called Data

31:17 for Democracy.

31:18 It's a nonprofit.

31:19 And we have recently launched what we're calling our ethical principles for data practitioners.

31:26 So the global data ethics principles.

31:30 So this is like the Hippocratic oath, like the doctor's take, but for.

31:33 Exactly, exactly.

31:34 And so because like we mentioned, this has been kind of a bad year, I would say, for technology

31:39 in general and technologists and and, you know, Silicon Valley and Silicon Valley culture

31:43 and data science and machine learning and AI.

31:45 And everybody's wondering, like, well, is this a good thing for society?

31:49 Is it not?

31:50 Like, how did we get here?

31:51 Like, how did we kind of like stumble into this dystopia where our minds are being manipulated

31:57 by propagandists on Facebook and in China?

32:00 They're doing social credit like that terrible Dark Mirror episode that I saw that we'll get

32:03 to like what's happening to us.

32:05 And, you know, fundamentally, as the people who are implementing this technology, we have a real

32:10 opportunity to think about our values, think about our ethics, like think about the way

32:15 that our technology might be used in ways that we hadn't intended because, you know, we're

32:19 a pretty optimistic group technologists.

32:21 I think we assume that like we want to put something really useful and meaningful out in

32:25 the world.

32:25 Maybe not this like family splitting algorithm.

32:27 That was probably.

32:28 There's probably the business department.

32:29 They say, hey, guys, can you?

32:30 Yeah.

32:31 They're like, well, this will, you know, we'll improve revenue by like 17 percent in

32:35 Q4.

32:35 Like, that'll be good for the world, you know, which and there's nothing wrong with improving

32:39 revenue.

32:39 I'm great.

32:41 Businesses are fantastic.

32:42 Not in the back of three year olds, maybe.

32:44 Right.

32:44 Maybe maybe not in this way, you know.

32:47 But I think also like the stuff that we do is actually pretty complicated.

32:51 People don't really understand at a deep level like what the software is doing and all

32:55 the potential ways that it might be used other than the intended use case.

32:59 Like that's something that really only we think about or are in a good position to think

33:03 about as technologists.

33:04 And so anyway, that was the idea behind this data for democracy project.

33:07 It's a global initiative and it's basically like what's a framework for thinking through

33:13 like putting ethics into your process.

33:16 So how do you incorporate these principles just in your everyday data and technology work?

33:21 What does it look like for data science?

33:23 I know it looks like for doctors.

33:24 You won't do any harm and that kind of thing.

33:26 How about for data scientists?

33:27 Yeah, exactly.

33:28 So like we have a we call it forts.

33:30 There's the high level, which is kind of like do no harm.

33:33 So you think about fairness, openness, reliability, trust and social behavior.

33:36 There's a handful of principles that is kind of like a checklist.

33:41 And so you can kind of go through this checklist.

33:43 And as you're developing a new feature or maybe developing a new model, if you're a data scientist

33:49 or, you know, some system for processing data, like anything that touches data, any of the

33:55 kind of technology work that you're doing, you can go through and, you know, it may be that

33:59 some of these principles like you don't have to check every box every single time.

34:02 But it's a nice like it's a framework for thinking about catching potential blind spots.

34:07 And so what's your intention when you're coding, like building this feature or developing this

34:12 model?

34:12 Have you made your best effort to guarantee the security of any data that you're going to use?

34:16 I mean, that seems like a no brainer.

34:18 But, you know, it's easy to forget when you're moving fast and you're maybe thinking about,

34:22 you know, the deadline or the fact that, you know, you're trying to ship this really important

34:25 feature because your customer really needs it.

34:27 Like, you know, give a second to remind yourself whether the data security is important.

34:31 Have you made your best effort to protect anonymous data subjects, which is really important in a lot

34:36 of data science research?

34:37 Sometimes if we don't think about this, we can inadvertently leak private data to the public

34:43 when they consume our research, even though that was never our intention and potentially is,

34:47 you know, irresponsible.

34:49 You can practice transparency, which is a lot of times understanding how our algorithms work.

34:54 So in this case, there was no transparency around how the algorithm that shows seat assignments

35:00 was functioning.

35:01 And then after an investigation, it was revealed that, well, it was actually examining whether

35:05 or not you shared a last name.

35:06 And if you do, it was splitting you up.

35:07 And if we if that was a transparent algorithm, we would have said, wait a second, this is totally

35:12 uncool.

35:12 You can't explain this algorithm transparently and people still accept it.

35:17 So, you know, like that would that's kind of like the light is great disinfectant, right?

35:20 In politics and all that in business.

35:22 Yeah, yeah, absolutely.

35:23 Another principle that maybe would have mitigated this problem was to communicate responsibly.

35:28 If the engineer who was responsible for implementing that went, hey, this is going to split up families.

35:33 And like during holiday travel, you can respect relevant tensions of stakeholders, which I also

35:39 think is really interesting because this is exactly what probably happened here.

35:42 The business team said, hey, but it'll improve revenue by 17 percent throughout the year.

35:46 And or whatever the number is, I'm making that up.

35:48 So it's a set of principles.

35:50 You can sign on to it, which is so if you go to data for democracy dot org, you will find

35:54 a link to sign the ethics pledge.

35:56 If you think that ethics are important, then you should totally sign up for it.

35:59 And here's why you should sign the pledge.

36:01 A, I think it's an important.

36:03 It's a cool thing to do if you think that ethics are important and you want to have this

36:07 like kind of mental checklist.

36:09 But it's also important because, you know, we all work for organizations or we're students

36:13 at academic institutions or we're otherwise doing this as our profession.

36:18 And the organizations that we work for are starting to adopt more ethical practices as companies

36:25 and academic institutions and governments.

36:27 like this is becoming more prominent.

36:29 But what's the way that we can make sure that our values are encoded in these kind of larger

36:36 business processes and these larger institutions is by making ourselves heard by sort of showing

36:41 our numbers.

36:42 And so if we show up as a technology community and we sign a pledge or we kind of communicate

36:47 to our manager or whatever it is, I mean, ultimately, like this isn't going to come from the top down.

36:51 And I don't think we want it to come from the top down.

36:53 I don't think we want this to come from people who aren't doing the work every day, who hear

36:57 about this ethics thing and they wanted this like a stamp of approval on whatever products

37:01 they're making.

37:02 That's fine.

37:03 But ultimately, the systems that they design won't actually accommodate the technology work

37:07 that we have to do every day.

37:09 So I think for those of us who are like writing software, for those of us who are developing

37:13 models, who are doing data science and software engineering every day, I think we need to make

37:18 our voices heard about the ethical principles that we want to see applied to the work that

37:23 we're doing.

37:23 So anyway, that's why I think it's important.

37:25 Data for democracy.org.

37:26 Don't be the developer who splits up families at the holidays.

37:31 You're better than that.

37:32 That's right.

37:34 That's awesome.

37:35 So I think this is great.

37:36 And I think it ties in really well to a lot of the themes that seem to be happening around

37:41 tech companies.

37:41 Like it used to just be, oh, tech companies are amazing.

37:45 Of course, we want to just encourage them.

37:46 And now there's some real skepticism around Facebook and Uber and all these types of companies.

37:51 And they kind of have to earn it.

37:53 And this oath is part of earning it, I think.

37:56 It's cool.

37:57 Yeah, I think so, too.

37:58 I think so, too.

37:59 And it's a good opportunity, I think, for, like I was saying, the actual doers, like

38:04 those of us who are doing the work, to participate in that conversation.

38:07 Because it's happening, like you're saying, it's happening whether we might want it to or

38:10 not.

38:11 The kind of attitude has changed.

38:13 People are thinking about legislating Silicon Valley, which would have been a totally foreign,

38:17 like bizarre idea almost, even a year or two years ago.

38:21 Yes, that's exactly what I was thinking of, is that kind of stuff.

38:23 It's like, wait, what?

38:24 What do you mean?

38:25 Right.

38:25 It's like, we're just over here doing good, making cool stuff.

38:28 Come on, like pull out that iPhone, buddy.

38:30 Like, let's play some Angry Birds.

38:31 And times have changed.

38:35 And more people are aware of the potential negative consequences.

38:38 And so, you know, now's our time to have a conversation and make the industry that we want

38:43 to be a part of.

38:44 Yeah.

38:44 So some of the things we've covered have been a little creepy.

38:46 Indeed.

38:47 Like the babysitter one.

38:49 But this next one is, I think it's just pure goodness.

38:52 100%.

38:53 I couldn't agree more.

38:54 So we know that Python is being used more in general scientific work.

39:00 And it's probably being used more in sort of the hard sciences.

39:03 Would you consider economics hard science?

39:06 Ooh.

39:06 Oh.

39:07 I'd put it right on the edge, right?

39:09 I mean, that's a lot of math in there.

39:11 There's numbers.

39:12 There's a lot of math in there.

39:14 I guess because I never really took economics.

39:16 I've only listened to like a couple economics textbooks in adulthood to try and something

39:21 I feel like I should know a little bit about.

39:22 And like, I feel like they were behavioral economics.

39:25 And it seems like a lot of correlation without causation.

39:30 Personally, no offense, economists out there.

39:32 But it's really data driven, which I think is really cool.

39:36 So there is a huge amount of information to consume when you're doing good economics.

39:41 So in any case, yeah, let's do it.

39:43 I'm on board.

39:44 I'm going to go one step further.

39:45 Let's call it a hard science.

39:46 I'm in.

39:46 All right.

39:46 Right on.

39:47 I certainly think some aspects of it are.

39:50 Now, we talked about the scientific papers obsolete, the move from just PDF or written

39:56 paper over to Mathematica to Jupyter.

39:58 And that's sort of a high level conversation of a trend.

40:03 But this year, the Nobel Prize in economics was basically won with Jupyter and Python, which

40:10 is awesome.

40:10 That is amazing and not surprising.

40:13 I mean, if you're going to do some scientific research, what other programming language would

40:17 you choose?

40:18 Yeah, absolutely.

40:19 Especially if you're like an economist.

40:21 So there was a bunch of folks who did a bunch of mathematics, Mathematica, and included in them

40:27 were these two guys named Nordhaus and Romer.

40:31 I think they're both American university professors.

40:34 And they're, let me see if I can mess up, poorly described what their Nobel Prize was, their work

40:41 generally is about.

40:42 But it was like, basically looking at how do you create long-term sustainable growth in a

40:48 global economy that improves the world for everybody through things like capitalism and

40:53 whatnot that mostly focus on very, very narrow self-interest.

40:57 They think they cracked that nut, which is pretty interesting.

41:01 And they cracked it with Jupyter.

41:03 I mean, that sounds Nobel Prize worthy to me.

41:05 And also the economic discoveries are probably useful.

41:09 No, but what I think interesting, because we talked about this in one of the previous stories,

41:16 is that the team, or Romer in particular, was the Python user.

41:20 And he wanted to make his research transparent and open.

41:23 That was like a key part of the research.

41:24 So that people could understand how he was reaching his conclusion.

41:26 So like all of my, you know, joking about behavioral economists a minute ago aside, like this is

41:31 actually an important part of the work.

41:32 So that you can understand his assumptions and, you know, at least understand the choices that

41:37 he's making and the data that he's choosing and that he tried to do it with Mathematica,

41:41 which is another way that people use or sort of perform computation for their research.

41:46 And apparently that just made it too difficult to share his work in a way that anybody who wanted

41:52 to try and understand his work would have to also use this proprietary software, which is a really,

41:56 really high bar.

41:57 Like it's really expensive.

41:58 Not everybody knows how to use it.

42:00 And it's not as simple, intuitive, open, and transparent as Python and Jupyter Notebooks.

42:04 So it was really core to the work that he did that ultimately won him the Nobel Prize,

42:08 which is pretty awesome.

42:10 Yeah, it's super cool.

42:11 And I do think that there's, if your goal is to share your work, having it in these super

42:15 expensive proprietary systems is not amazing, right?

42:19 I mean, we're talking about Mathematica here.

42:21 My experience was I did a bunch of stuff in MATLAB at one point and we worked with some

42:27 extra toolkits.

42:28 And these toolkits were like $2,000 a user just to run the code.

42:32 And if you wanted to run the code and check it out, you also paid $2,000.

42:36 Like that's really prohibitive.

42:37 It really is.

42:38 And I mean, I think that was my experience with proprietary software.

42:42 I mean, not only do those like programming languages or applications make it, it's just

42:47 difficult to collaborate.

42:48 And I think this will be of no surprise.

42:51 I would think that this is something that everybody accepts as fact amongst all of your

42:54 listeners.

42:54 But the open source community has made this current software revolution possible.

42:59 The fact that we are able to collaborate at this scale and share ideas and share code and

43:06 build on top of each other's work, I think is the reason that we've had this explosion in

43:10 entrepreneurship, this explosion in the kind of energy and excitement that comes out of Silicon

43:15 Valley, even though we were just talking about it, you know, maybe being a bad thing.

43:19 But this level of sort of rapid innovation that we've been going through is because programming

43:23 and it's just it's so much more accessible than it's ever been before.

43:27 And I think that that's largely because we transitioned away from these proprietary software

43:31 models.

43:32 Yeah.

43:32 And you think about the people who that benefits.

43:34 Obviously, it benefits everyone.

43:36 But if you live in a country where the average monthly income is a tenth of what it is in the

43:41 U.S. and you have to pay $2,000, it doesn't mean it's expensive.

43:44 It means that you can't have it.

43:45 Right.

43:46 I mean, it's just inaccessible to you.

43:47 And so these sort of opening ups of this research and these capabilities, I think it benefits the

43:54 people who need the most benefit as well.

43:55 Yeah, absolutely.

43:56 And it almost like distributes creativity much more broadly than we would have been able to

44:01 before.

44:01 Like we can we can tap sources of creativity and innovation that, like you're saying, would

44:08 have been just taken off the taken off the board because proprietary software is only accessible

44:13 to the small portion of the global population that can afford to spend thousands of dollars

44:18 annually on licensing it.

44:21 Yeah, absolutely.

44:21 So there's there's a couple more thoughts.

44:23 I just want to share really quickly from this before we move on to something I guess I'll call

44:27 positive.

44:27 So this you got it in 2018.

44:32 You got to put it up there positive or not.

44:34 So one thing I think it's really awesome about this is this guy is 62.

44:37 He transitioned into Python recently.

44:40 Right.

44:41 You feel a lot of people are like, oh, I'm 32.

44:44 I couldn't possibly learn programming or get into this.

44:46 Like this guy made this Nobel research change into Python programming and the data science

44:51 tools at 60s, late 50s.

44:54 That's awesome.

44:54 It is awesome.

44:55 And what's interesting is that he's now been like exploring how software works.

44:59 And you pulled out a really great quote where he says, the more I learn about proprietary

45:03 software, the more I worry that, wait for it, objective truth might perish from the earth,

45:09 which is an insanely powerful statement.

45:12 That is a powerful statement from a guy who won the Nobel Prize.

45:15 So he's probably right.

45:15 Yeah.

45:16 He definitely sees a change in the world.

45:18 And I think this opens, this is a little bit of what I was saying, like thinking when

45:22 I said, look, we've got open source being used by scientists, but also changing science,

45:27 science.

45:27 Right.

45:28 Right.

45:28 Absolutely.

45:29 Because it's not only making the work more approachable and accessible, but it's making

45:35 it more repeatable.

45:36 And it's allowing us, especially now that so much research is based on computation, like we've

45:40 been saying before, it really does make it possible.

45:43 It would be impossible to come to a consensus about whether or not something is correct, acceptable,

45:50 whether we can say that this is an established fact.

45:52 You can't really say it if you can't show your work.

45:55 And in this case, showing your work is the computation in the data.

45:58 Right.

45:58 It's like just writing the final number down in calculus.

46:00 You're like, no, no, you got to show your work.

46:02 Exactly.

46:03 It's very important.

46:03 Like science and objective truth depends on it.

46:06 Show your work.

46:07 That's right.

46:08 Show your work.

46:09 All right.

46:09 So the next one is more like a mundane day to day thing, but actually makes a big difference.

46:15 So what's the story with Waze and Waze reducing crashes?

46:19 So this is pretty cool.

46:21 So we've talked about using quote unquote AI to predict things.

46:25 Sorry.

46:26 Maybe we should tell people what Waze is.

46:27 I don't know how global Waze is.

46:29 Maybe they don't have experience.

46:30 Just give us a real quick summary.

46:31 That is a really good point.

46:32 So Waze is a mapping application that helps you find the shortest route from one point

46:41 to another in your car.

46:42 And so there's a whole thing about Waze.

46:44 Like it's a community of people.

46:46 And so you're kind of collaborating to point out things on the road.

46:49 So I'm stuck in traffic and you can tap a button or there's a police car trying to catch people

46:55 for speeding and you can tap the button or there's been an accident.

46:58 You can tap a button.

46:59 And so it's this way for like drivers to communicate with each other on the road.

47:02 And it's so big.

47:03 That community is so large that it can help you.

47:06 It routes you to one point from one point to another in a city that helps you avoid obstacles

47:11 that might be pretty dynamic and changing all the time.

47:13 So it's kind of like Google Maps.

47:15 It's more of a two-way street for sure, right?

47:17 Like the users send info back a lot more to it.

47:20 Okay, cool.

47:20 With that in mind, how does Waze play into this story?

47:23 Well, as you can imagine, it captures a ton of data about driving patterns.

47:29 So not only does it know what's happening in real time, but all that data is stored.

47:33 And so you start to get a sense about how people move through cities in general.

47:37 And once you have data that captures a behavior in general, in data science, you can start to make predictions using that data.

47:44 So you can train a model and say, generally, this is how things work.

47:49 And then maybe you could make some predictions about how things work in the future.

47:52 And assuming that historical data is accurate, then you can usually make pretty decent predictions.

47:57 And so the cool thing about an application like Waze that captures not only traffic patterns, but also events like accidents or car crashes,

48:07 is that you could predict when and where car crashes are likely to occur, which is kind of mind-blowing.

48:13 Like I think traffic seems like this total, super complex, impossible to understand, messy organic...

48:18 If you ever thought of chaos, it should totally apply to this, right?

48:22 Like if a butterfly can affect weather, like people should just be crazy in cars.

48:27 They can be crazy in cars.

48:28 And I don't know if everybody will be familiar with this.

48:31 There's this like famous experiment in where they, scientists set up a circular track and put cars,

48:37 they had cars drive around the circular track, all like equidistant from one another.

48:41 So in theory, they could all maintain their speed.

48:43 But because there are human beings driving the cars, they would occasionally make these little choices.

48:48 Like they would feel like they got a little bit too close to the car in front of them,

48:51 where they put a little bit too much gas and they would get too close.

48:53 And so they'd brake.

48:53 And then as soon as they put their foot on the brake, then the car behind them will put their foot on the brake.

48:57 And then there was this like cascading effect.

48:59 And no matter what, even though there was enough room for all of these cars on the road,

49:02 they wound up in a traffic jam driving around in the circle.

49:05 It's like...

49:05 That's awesome.

49:06 Yeah.

49:06 It's like a beautifully designed system that humans cannot participate in fully.

49:11 Because we're just too human.

49:14 So we could cause traffic.

49:15 We're flawed.

49:16 We're flawed.

49:17 And all those flaws are captured by Waze.

49:20 And so what Waze basically started doing is that they have all this data from connected

49:26 cars and road cameras and apps and whatever.

49:28 And they have an overview of how the city works.

49:31 And they've shared that data with local authorities who then basically have developed models that

49:37 predict when and where these accidents are going to occur.

49:39 And so the city's traffic and safety management agencies were able to take that data and say,

49:45 oh, there's likely to be an accident in this area at this time.

49:49 And then they sent...

49:50 They went to those areas to take preventative measures.

49:53 So they basically identified what are the most high risk areas for where accidents are likely

49:56 to occur at certain times of day or in certain conditions or whatever.

50:00 And if you can then take action to make those areas more safe, then, of course, you can reduce

50:06 the number of crashes.

50:07 And they reduce crashes by 20%.

50:09 So if on a normal day there's 100 car crashes, there's only 80, which, I mean, that's pretty

50:14 amazing.

50:14 So not only that, but if you're in an area where an accident's likely to occur, all of the other

50:20 services that happen around that accident are more readily available.

50:23 So you can get faster treatment for anybody who was injured.

50:26 You can more quickly clear and restore normal traffic flow.

50:30 And so in addition to actually making the health implications, like the public health implications

50:37 of having fewer car crashes, you can actually make it easier for people to get around the

50:41 city more quickly by moving, by sort of quickly dealing with accidents as they occur, because

50:46 you kind of knew or had a pretty strong indication that that accident was going to happen in advance.

50:50 So it's like minority report, but for car crashes.

50:53 It's like a pre-crash.

50:54 It is.

50:55 So they've got the pre-crash.

50:57 If that system is not called pre-crash.

50:59 Sir, do you know why I'm pulling you over?

51:00 No, you were going to crash up there.

51:03 What?

51:05 Right.

51:05 You were the cause of an accident that was about to occur.

51:08 Like what?

51:09 I actually, I kind of like this idea.

51:12 I mean, we're getting into territory that sounds like the wrong, like the unintended use of AI,

51:17 like we've been talking about this entire episode.

51:19 But I kind of love it.

51:21 I would love for the AI to tell me in advance, even if it meant I got a ticket.

51:25 Like you were going to cause an accident.

51:27 You didn't because we told you and we've changed the future.

51:30 But you still have a $25 ticket.

51:32 I'd pay that.

51:35 I would pay that $25 ticket.

51:36 That's so interesting.

51:37 That's like the 2025 edition.

51:39 The 2018 edition is they were able to cause fewer crashes, address the ones that happened better,

51:45 and basically just improve life for all the drivers.

51:48 That's awesome.

51:49 Yeah, it's totally cool.

51:50 And I think this is one of those like almost like classic data science problems now,

51:54 where it's weird to say that, that there are like such a thing as a classic data science problem,

51:58 but like maybe like an archetypal data science problem, where there's like where you have a bunch of data about the past,

52:02 and whatever system it is that you're interested in understanding is so predictable and repeatable

52:07 that all you really need to do is understand how it behaved in the past,

52:10 and then you can have a pretty good idea about how it's going to behave in the future.

52:13 Like if only you could crunch enough numbers.

52:15 So it's kind of like we could have a grand theory of the universe if only we could compute the universe.

52:20 Sorry, not to get too philosophical, but like, but the systems that we have the computing power to understand are increasingly large and complex.

52:29 So like a city's traffic flow is pretty large and complex, and probably too much data to really process in any meaningful way until now.

52:38 And now we not only have the data, but we have the ability to process it and make predictions.

52:41 And so this is a real significant, tangible improvement on human life.

52:45 Pretty awesome.

52:46 Yeah, that's really awesome.

52:47 I think another thing that would make an improvement on life is if more people participated in speaking about what they would like to happen in their country,

52:56 and what they would not like to happen in their country.

52:58 What do you think?

52:59 That seems like a good idea.

53:00 You know, like participating in the process.

53:02 I think that'd be a win all around.

53:04 You know, you know, way better than I do.

53:06 I feel like the average voting turnout is something like 65% or something in the US, maybe 70 on a crazy period.

53:13 But then that's just among registered voters.

53:15 What about all the people who aren't even registered, or they've become unregistered because they moved and they forgot to update it?

53:21 Right. This is, as we probably heard about in the 2018 election cycle that we just had, a big part of the process.

53:28 And anybody who, well, you probably met somebody who was like standing in front of your office or your church or your school or walked around your neighborhood and was saying,

53:37 hey, are you registered to vote?

53:38 Like there's these get out the vote campaigns that are really important because what?

53:41 Twitter asked me at least five times if I was registered.

53:44 Twitter did.

53:44 I think I Googled something the day before the election.

53:47 They were like, hey, here's your local polling location where you can go get registered to vote right now.

53:52 It's kind of cool.

53:53 I think there's people have recognized that there's low voter turnout.

53:56 And so there's all of these interesting initiatives to try and get people to go out, make sure that they follow the appropriate procedures and they actually cast their vote, which, you know, ultimately, you know, democracy dies in darkness.

54:08 Right.

54:08 Like it's important that we're, you know, participating in the process.

54:12 So technologists are, of course, trying to help.

54:15 And there's a guy called Jeff Jonas.

54:18 He's a prominent data scientist.

54:19 And he is interested in the integrity of voter rolls.

54:24 And so this is an interesting aspect of it.

54:27 So in most places, I think in all places, you can't just show up to the polling place and vote.

54:32 You have to be registered to vote.

54:33 And there's lots of reasons that you might get unregistered like you were just talking about.

54:37 So I recently moved from one place to another in Austin and the people were coming around my neighborhood.

54:42 And I actually didn't know that this impacted my ability to vote here where I live.

54:46 They said, hey, are you registered to vote?

54:48 I said, of course I am.

54:48 And I mentioned offhand like, oh, just move to the neighborhood.

54:52 Love it so much.

54:52 And they were like, oh, well, then you're not registered to vote because you've just changed your address.

54:56 And so there's all these like little details that are important, sometimes hard to understand.

55:01 Like it's a big government bureaucratic process at the end of the day.

55:03 And so this guy, Jeff Jonas, used his software for this multi-state project that he called the Electronic Registration Information Center that basically uses machine learning to identify eligible voters.

55:17 And then it cleans up the existing voter rolls.

55:20 So, OK, great.

55:22 Because the reason that you might need machine learning for this and not just like normal string matching is because people's names are sometimes slightly different on one form or another.

55:30 Or maybe their address has changed or maybe multiple people have the same name.

55:34 And so there's like all sorts of ways where like it's sometimes difficult to know whether a record in one database that represents a human being represents the same human being in another database, even if their names are identical or similar.

55:47 So there's it gets a little bit complicated.

55:49 And so that's why, you know, machine learning and AI are useful here because they can recognize these kind of subtle variations and patterns that ultimately lead you to be able to kind of triangulate a bunch of different data to point at the same human being.

55:59 And so this nonprofit, the Electronic Registration Information Center, identified 26 million people who are eligible to vote but unregistered and then 10 million registered voters who have moved.

56:12 Who maybe became uneligible because of that.

56:15 Yeah, that's right.

56:15 So it somehow became uneligible.

56:17 So they've moved.

56:18 They appear on more than one list for whatever reason.

56:21 This does happen.

56:22 Or they died.

56:23 So important to nobody.

56:26 After your death, you remain registered to vote.

56:28 And because who are you going to even you just died.

56:32 It's not the thing that you think of on your deathbed.

56:34 It's like, wait, take me off the voter rolls.

56:36 The end is near.

56:37 So, you know, this is actually pretty common.

56:39 But again, like, matching a record of death with a name on a voter registration roll is much more difficult than it sounds.

56:48 So anyway, super interesting project because I think in every sense, like we want our voter rolls to be authentic.

56:55 We want them to have integrity because we want our democracy to have integrity.

56:58 Like, you know, we should make sure that one person, one vote.

57:00 But at the same time, we want to make sure that as many people who can vote do because that's how the whole process works.

57:06 We want to make sure that the people are represented in their and their elected leader.

57:09 So, yeah, really, really interesting project.

57:12 And I think hard to imagine this going wrong.

57:16 Yeah, it's really good.

57:18 There was some conversation in the article about this project where it talks about trying to thread the needle of not looking partisan.

57:28 We just want people to vote.

57:29 How do we do that?

57:30 And that's an interesting challenge.

57:31 Right.

57:32 Right.

57:32 Because, of course, that's become a polarized topic of conversation.

57:35 There's a lot of concern about whether or not voting rolls are authentic.

57:40 Nobody wants this concern on one side of the aisle about voting fraud.

57:44 And then there's concern about from the other side of the aisle about voter suppression.

57:48 So how do we get to some common ground?

57:51 Because I think ultimately the common ground that everybody has is that we want an authentic process, whereas many people who are eligible to vote can.

57:58 I think we all believe in democracy at the end of the day.

58:01 But there's these two kind of opposing points of view on what to prioritize when making sure that our process has integrity.

58:06 So it was great to thread the needle and find a technology driven, sort of a nonpartisan approach to making sure that the system functions as best as it can.

58:15 Right.

58:15 Let's take this technology applied to a problem that is inherently political and try to make our political system better without, you know, raising the hairs and anger of any particular group.

58:26 It's pretty good.

58:27 Yeah.

58:28 It's a challenge.

58:28 So it's almost like these days it's hard to find anything that AI can do that makes everybody happy.

58:35 However, I think this next story that you found actually is something that I think universally we can all agree on.

58:42 I think it's pretty awesome.

58:43 Apolitical and universally beneficial, sort of unequivocally beneficial.

58:47 Yeah.

58:47 So I think there's two really interesting angles to this.

58:50 Let's take the super positive one first.

58:52 So this is about using machine learning and computer vision to fight breast cancer, which there's been several projects around this.

59:02 And they're universally good, I think.

59:05 Right.

59:05 You have these mammograms.

59:07 They have pictures of potentially cancerous regions.

59:11 But the key to catch cancer is to catch it early.

59:13 But to catch it early means to see things that are utterly subtle.

59:17 Right.

59:17 So they were saying something like 38% of radiologists or whatever group of doctors is called that looks at this.

59:26 They were doing like a 38% catch rate on these really super early cancers.

59:35 So Google came up with an AI.

59:38 It's even just like, well, they took one off the shelf and they applied it.

59:41 It's like something ridiculous.

59:43 Like off the cuff.

59:44 You're just like off.

59:45 We pointed at these and asked it some questions.

59:47 And apparently it found the cancerous regions 99% of the time.

59:52 Yeah, it is.

59:53 And for exactly the reasons that you mentioned.

59:55 Because it's capable of seeing much subtler patterns than human beings can.

59:59 Which is, I mean, exactly what machine vision is good for.

01:00:03 Yeah.

01:00:04 So first it started out like we're just going to show the AI all the pictures and just get its opinion.

01:00:09 Like forget the doctors.

01:00:10 And then they said, well, what if, like we have doctors.

01:00:13 And I think there still always will be doctors.

01:00:15 But what if we give the doctors a stethoscope?

01:00:18 We give them a tongue depressor.

01:00:20 We give them a camera.

01:00:21 What if we gave them AI as one of the tools they could point at the people and ask, you know, and analyze what the results are coming out of that machine, right?

01:00:28 So they basically said, all right, the second one would take six pathologists.

01:00:32 And they said they found it easier to detect small cancerous regions.

01:00:46 And it only took half the time.

01:00:48 At least for me, like in terms of where AI can be helpful in health care, especially these life-saving moments when the earlier that you detect cancer, for example, the more likely you are to be able to treat it.

01:00:59 Like it's so complicated.

01:01:01 Like I feel like the medical profession, the amount that doctors are expected to hold in their minds in order to like recognize relevant pieces of information and put together like a theory of the case.

01:01:13 You know, what might this be based on all the pieces of information that I'm seeing?

01:01:16 It is really like a marvel.

01:01:19 I have nothing but respect for doctors who go through the amount of training that they go through, the amount of information that they're able to retain, and the kind of creativity that's required to recognize these different symptoms and put together, you know, some likely candidates for what might be ailing the patient.

01:01:33 However, that type of pattern matching, assuming that you can capture accurate data, is exactly what machines have become really, really, really good at.

01:01:43 And so I do think this is an area where, again, assuming that good data is available, and I think when it comes to something that is more kind of binary to determine, like whether or not there's cancer tissue, like cancer tissue present, and we can take an image of a region of the body and sort of reliably provide that to the algorithm that's trying to make a determination about whether or not cancer is present.

01:02:03 Like, in examples like that, where the data is readily available, I think there's no reason that AI can't be a really powerful assistant so that doctors with all their knowledge and creativity can't augment that and sort of help recognize where they might have blind spots or where there might be technical limitations to their ability as human beings.

01:02:22 Like, our eyes only work so well.

01:02:52 I think that's open source earlier.

01:03:22 Aggregate knowledge of the medical community can be available to any practitioner, which is kind of amazing.

01:03:27 Yeah, it's pretty amazing.

01:03:28 So you spoke about the computers detecting these really careful, nuanced details and images.

01:03:34 And I would say China is doing a whole lot of interesting stuff like that to sort of assess people.

01:03:40 So they've got crazy facial recognition, the stuff going on over there.

01:03:45 They have a system that will detect your gait and identify you by the way you walk.

01:03:50 That's, yeah.

01:03:51 And all of these types of things are generally around this idea of like a social credit.

01:03:55 Like, can we just like turn cameras and machines on our population and find the good ones in the background?

01:04:00 It's true.

01:04:05 I mean, it's a really interesting.

01:04:06 It sounds crazy, right?

01:04:08 Like you're laughing, but it's kind of like you're going to laugh or you're going to cry.

01:04:11 Pick one.

01:04:12 Well, it's true.

01:04:12 It's true.

01:04:13 So, A, I think China is a really interesting counterpoint to the debate that we're having here in the U.S.

01:04:19 where here our values are very much privacy, individual freedom.

01:04:25 And whenever technology encroaches on that privacy, we're very suspicious.

01:04:30 So even the fact that advertisers can target me based on my browsing history, like, you know, makes a lot of people uncomfortable.

01:04:37 Like maybe that's a violation of my privacy.

01:04:39 Even though the advertiser doesn't know who I am as an individual, it's possible to make a pretty good guess about who I might be based on my behavior.

01:04:46 They know you want a Leatherman pocket knife, but they just don't know your name.

01:04:49 Right, right.

01:04:50 And they know that people like me also like other products that people who like Leathermans like.

01:04:55 I don't know.

01:04:56 We're outside of my comfort zone.

01:04:57 I'm sorry.

01:04:59 I'm just making this up.

01:04:59 But you get the idea, right?

01:05:01 Yeah, exactly.

01:05:02 But here, you know, a lot of the conversation is like, is that okay?

01:05:05 Whereas in China, they've gone the other direction.

01:05:07 They were like, I mean, it's a way that we can find the best people and maybe maintain social order.

01:05:12 I mean, I don't know.

01:05:13 I don't know what they're thinking.

01:05:13 Just quickly, because not everybody will have seen this, but there's a Netflix TV show called Black Mirror.

01:05:20 If you are into kind of a dystopian technology future or technology enabled dystopian future or whatever, I highly recommend Black Mirror.

01:05:28 There is an entire episode dedicated to a social credit system.

01:05:32 And everybody in the episode is ranked.

01:05:34 Like every social interaction you have, you can basically like rate the person you had the interaction with.

01:05:40 So did you get a smile from the barista at your favorite coffee shop?

01:05:43 Five stars.

01:05:44 Did you tip?

01:05:45 You get five stars back from the barista.

01:05:47 Like, did you take an Uber and the person was friendly?

01:05:50 Five stars.

01:05:50 And some of these things we already do.

01:05:52 Like, we rate our Uber drivers and our Uber drivers rate us.

01:05:55 But they're usually kept within the little space of Uber or Starbucks or whatever.

01:06:01 It's not like your global rating.

01:06:02 Like, you don't get a mortgage based on how you treated your Uber driver.

01:06:05 Or not.

01:06:06 Right.

01:06:07 Well, I mean, you get your babysitting gig depending on what you posted on social media.

01:06:11 But that's, again, here, like, it's like, oh, we're debating.

01:06:14 Is it okay to judge potential babysitters based on that rage post they made about the ending of the Dark Mirror episode that they thought was stupid?

01:06:24 Like, whatever.

01:06:25 But China's gone the other direction.

01:06:27 And so they, like, exactly like you were saying, they have all of these ways in which they are scoring and rating people based on the social behaviors that they can observe, either through surveillance or people's behavior online, their social media behavior.

01:06:41 And so the Shanghai city in China, you may be familiar with it, is going to pool data from several departments.

01:06:49 And they're going to reward and punish about 22 million citizens based on their actions and reputations by the end of 2020.

01:06:55 So social.

01:06:57 Right.

01:06:57 Like a year and a half.

01:06:58 Yeah, exactly.

01:06:59 Like.

01:06:59 It's not in the future very far.

01:07:01 We're.

01:07:02 Yeah, this is not something like far future dystopia.

01:07:05 This is today, people.

01:07:06 And so pro-social behaviors are rewarded.

01:07:09 So if you do volunteer work, if you donate blood, which actually sounds cool, like I'd like to be rewarded for those things.

01:07:15 This doesn't seem that bad.

01:07:16 But people who violate traffic laws or charge under the table fees are punished.

01:07:21 OK.

01:07:22 If we're getting in.

01:07:24 I mean, in a way, like we're already socially rewarded for doing good things and we're socially harmed for doing mean things, I guess.

01:07:32 Or things that aren't cool.

01:07:34 But and so in a way, it's like China wants to use technology to optimize the social systems that we already have in place.

01:07:42 Like if I'm rude to you, then I suffer social consequences for that.

01:07:46 But it's not really captured anywhere.

01:07:47 I guess like that damage is pretty localized.

01:07:50 Like it doesn't go on your permanent record.

01:07:52 Exactly.

01:07:55 This is everybody's permanent record is now captured digitally and managed by artificial intelligence.

01:08:00 That's welcome to.

01:08:01 Welcome to China.

01:08:03 Yeah.

01:08:03 So, yeah, it's pretty interesting.

01:08:05 I mean, on one hand, I can see this benefiting society, but it just seems Black Mirror-esque to me as well.

01:08:12 And so you might be wondering, like, well, what the heck is a punishment, right?

01:08:16 So they say in another city, Hangzhou, they rolled out a credit system earlier this year and rewarding pro-social behaviors such as volunteer work or blood donations and punish those who violate traffic laws.

01:08:29 They charge under the table fees and whatnot.

01:08:32 And statistically, they said by the end of May, they've been blocking more than 11 million flights and 4 million high-speed train trips from the bad guys.

01:08:42 Right.

01:08:43 Right.

01:08:44 And not bad guys, like, broke a law.

01:08:47 Not.

01:08:48 Bad guys, like, doesn't volunteer enough.

01:08:52 Doesn't seem like, you know, heart's probably not in the right place.

01:08:55 We're not going to let you take this train trip, buddy.

01:08:57 Just denied.

01:09:00 I'm sorry.

01:09:01 Yeah, you're like, I was wrong.

01:09:02 You got three and a half.

01:09:03 Yeah, totally.

01:09:04 Like, I was going to go visit my family.

01:09:05 I wasn't going to be able to sit with them on the plane, but at least I was going to take a trip home for the holiday or whatever.

01:09:10 And that, you know, trip was thwarted.

01:09:12 Like, actual real-life consequences for things that we've all come to accept as, like, pretty much being a right if we can afford to pay for them.

01:09:19 Right.

01:09:20 Yeah.

01:09:20 Yeah, kind of bizarre.

01:09:22 Well, it's going to be a very interesting social experiment.

01:09:25 Yeah.

01:09:25 I don't want to be part of it.

01:09:26 I'm glad that I won't be part of it either, although I wonder if everybody will just be, like, super friendly.

01:09:30 Well, but hey, like, maybe that's the outcome that we're just not seeing.

01:09:33 It's like, we'll go to Shanghai and be like, man, people are just, this is like a Leave it to Beaver episode.

01:09:40 Exactly.

01:09:41 Well, you know, how much of that would be disingenuous?

01:09:44 You know, like, oh, bless your heart.

01:09:46 You know, that type of stuff.

01:09:47 You know?

01:09:48 Then we just need to optimize the algorithm, man.

01:09:50 Like, downvotes.

01:09:51 Exactly.

01:09:52 Downvotes for disingenuous social pleasantries.

01:09:54 It's a double downvote.

01:09:55 You were mean and you were disingenuous about your nitheness.

01:09:57 Boom.

01:09:58 All right.

01:09:59 So, the last one is, I thought it'd be fun to leave people with something practical on this one.

01:10:04 Have you seen this new data set search from Google?

01:10:07 Yes.

01:10:08 I am so, so into this.

01:10:10 Partially because it's useful and partially because I've worked on a project where we tried to accomplish something similar.

01:10:16 And it is super freaking hard.

01:10:20 So, props to Google.

01:10:21 Of course, if someone's going to get searched right.

01:10:23 Google.

01:10:24 Yeah, for sure.

01:10:25 So, tell people what this is.

01:10:26 Well, I mean, it's just like it sounds.

01:10:27 It's a data set search.

01:10:29 So, sometimes the data is like an actual data set that got captured somewhere as like a CSV.

01:10:35 And or it's data that was extracted from sort of less structured material, like a table and a web page or something.

01:10:43 But the way that you can search for it is like by topic.

01:10:46 And I just want, as somebody who suffered through this very difficult problem, I just want people to understand what that actually would mean.

01:10:54 Like to know what a data set is about in air quotes.

01:10:57 And a data set in this case, if you're not like a data person, it's like a spreadsheet with column names at the top and a bunch of rows, for example, in the simplest case.

01:11:07 And so, maybe you can look at the column headers and perform some machine learning or something and get a sense that the column headers are really well named.

01:11:14 But they never are.

01:11:15 Trust me.

01:11:15 They never are.

01:11:16 They always have some random name, like the name of like the column in the database table that was named in the 1970s that like this thing ultimately spit out in the first place.

01:11:25 And there's like random characters in there that have no place in any data set ever.

01:11:29 And there's a bunch of random missing values.

01:11:30 And everything is a number.

01:11:32 And so, you can't actually make any guesses about what's in there because it's just a bunch of random numbers.

01:11:35 And so, you end up looking at weird things like the structure of the data set.

01:11:39 And like you go to some strange places, my friends.

01:11:42 But the fact that Google's figured out how to catalog this type of information and make it accessible to people, this like almost like semantic search for data sets, it really is a real feat of machine learning and engineering.

01:11:53 Yeah, it's awesome.

01:11:54 So, you just go to the data search page, which is toolbox.google.com slash data set search, at least for the time being.

01:12:03 And you just type in a search and it'll give you a list, like here's all the places we found.

01:12:06 And some of them will be like legitimate just raw data.

01:12:09 And sometimes it's just embedded tables in a web page and all sorts of stuff.

01:12:13 It's really well done.

01:12:13 Yeah.

01:12:14 It kind of lets you look for data in the way that you would look for any other content on the internet.

01:12:19 Just the way that you would, well, the way that you would search normal Google.

01:12:21 Yeah.

01:12:23 Which is pretty impressive because there's usually not that much context around data sets on the internet.

01:12:28 But like that kind of semantic real world context.

01:12:30 So.

01:12:30 Yeah, it's great.

01:12:31 So, if people are out there looking for data sets, definitely drop in on Google data set search and throw some stuff in there.

01:12:37 There's some good answers.

01:12:38 I also really like the 538 data.

01:12:40 Have you seen that?

01:12:41 Oh, yeah, totally.

01:12:42 So, the website 538, for folks who aren't familiar with it, kind of a data journalism focused website.

01:12:47 A lot about sports, a lot about politics.

01:12:49 But most of the work is driven by some type of data analysis.

01:12:53 Yeah.

01:12:53 And over at github.com/538, all spelled out, slash data, they have all of the data sets they use to drive their journalism,

01:12:59 which is like hundreds of different data sets.

01:13:02 So, I'd love to go and grab data there for various things I'm looking into.

01:13:05 Yeah, absolutely.

01:13:06 It's a great resource.

01:13:07 Especially because then you can go see how they use the data and what questions they asked.

01:13:11 And then it's kind of an easy way to, well, it's a great way to learn.

01:13:14 Especially if you're kind of toying around with data for the first time and just getting comfortable with some of the amazing data exploration and modeling tools in Python.

01:13:23 Best software language.

01:13:24 Take that.

01:13:25 Our users.

01:13:26 That's right.

01:13:27 Stick it.

01:13:28 So, if you're looking for data sets, where do you go?

01:13:33 Now, I go to data set search.

01:13:36 But there's also a lot of great, it depends on kind of what you're looking for.

01:13:40 But there's a company based in Austin, actually, called data.world that is trying to do kind of a similar thing to Google's data set search.

01:13:46 But it's more curated.

01:13:48 So, it's kind of community-based.

01:13:49 People are sharing interesting data sets that they found and contextualizing them.

01:13:53 So, it's a great, a lot of data for democracy volunteers when they're working on projects, they'll upload them to data.world to make the data available.

01:13:58 But then there's also a lot of kind of open data portals, both at the national level.

01:14:04 So, there's still data.gov.

01:14:06 I believe that's still up.

01:14:07 There was some rumors that it might all come down eventually, but it's still around, I think.

01:14:11 And a lot of cities that have had open data projects or open data initiatives have collected all of those into a portal that's specific to where you live.

01:14:19 So, if you want to find out, you know, how many dogs and cats are in your local animal shelter from one month to the next or whatever the analysis is that you're curious about that's super relevant to your city,

01:14:28 you can probably find one that is for the city closest to where you live.

01:14:32 I know there's one here in Austin, New York, Chicago, L.A., San Francisco.

01:14:38 There's a lot of cities have this now, and it's a great way to find data that answer interesting questions and just toy around a little bit and learn more about where you live.

01:14:44 Yeah, that's cool.

01:14:45 And data.world's cool as well.

01:14:46 I haven't seen that.

01:14:47 All right, Jonathan.

01:14:48 Well, that's the 10 items for the year in review.

01:14:51 And I definitely think there's an interesting trend, and it's been super fun to talk to you about it.

01:14:56 Yeah, of course.

01:14:56 Thanks so much for having me on the show.

01:14:58 And I hope everybody had a wonderful 2018 and is looking forward to an exciting year ahead.

01:15:02 Absolutely.

01:15:03 So, I have a few quick questions before you get to go, though.

01:15:05 First, when are you coming back to podcasting?

01:15:08 Are you just going to make these guest appearances?

01:15:10 Do you have any plans to come back, or are you going to just keep working on your projects?

01:15:13 You know, I'm thinking maybe towards the end of 2019, I'll try and find a year in review podcast where I can sit in.

01:15:19 Well, you can definitely come back in 2019.

01:15:21 That'll be good.

01:15:22 I'm just pre-positioning for the next year.

01:15:25 I love doing it.

01:15:26 It's really fun.

01:15:27 And I have no timetable for actually returning, though, because even the thought of committing to the amount of work that it takes to put on an episode.

01:15:34 Yeah, it's a crazy amount of work per episode.

01:15:36 That's for sure.

01:15:37 Well, I'm really glad you came back for, you came out of retirement to do this one.

01:15:41 Oh, thanks, man.

01:15:42 I really appreciate it.

01:15:42 And I'm super happy to come on the show and do it.

01:15:44 I mean, especially because I won't be responsible for any of the work once this is recorded.

01:15:49 Exactly.

01:15:51 That's the way to do it.

01:15:52 Be a guest.

01:15:53 All right.

01:15:54 Last two questions, which I know you answered a couple years ago, but it could have changed.

01:15:57 So if you write some Python code, data science or otherwise, what editor do you use?

01:16:01 Well, I use Jupyter Notebooks pretty heavily.

01:16:04 And actually, more recently than not, I write much less software.

01:16:09 Like almost everything I write now is some type of like exploratory data analysis.

01:16:13 So it's almost entirely in Python.

01:16:14 But when I do code, I still use sublime text a lot, which I feel like is kind of old school.

01:16:21 I hear that the world has moved on.

01:16:22 There's like Python specific, like quasi IDEs.

01:16:26 And I'm out of the game, Michael.

01:16:28 What can I tell you?

01:16:29 It's all right.

01:16:30 No, it's all good.

01:16:30 It's a great one.

01:16:31 And then a notable package on PyPI.

01:16:35 Oh, man.

01:16:36 Well, I mean, given what we've been talking about, I would strongly encourage everybody to check out a package called Keras, K-E-R-A-S, or TensorFlow,

01:16:48 which is a both are very popular machine learning libraries or kind of machine learning frameworks.

01:16:54 Keras is more high level.

01:16:55 It's kind of it's a very accessible way to get into exploring neural networks, basically.

01:17:00 So there's lots of kind of popular machine learning techniques that are sort of more traditional and super effective.

01:17:07 There's nothing wrong with them.

01:17:08 But the world has kind of moved into these like deep learning and AI is largely based on neural networks.

01:17:15 And Keras is a great way to explore those without getting too much into the weeds.

01:17:18 And then once you get into the weeds and you find it kind of interesting and you want to get down lower level and really play around with some of the network structures yourself,

01:17:24 then you can get into TensorFlow, which is one of the underlying libraries that Keras is both on top of.

01:17:28 So highly recommend both of those.

01:17:31 Right on.

01:17:31 Yeah, I've definitely heard nothing but good things about them.

01:17:33 All right.

01:17:34 Final call to action.

01:17:34 Maybe especially around this whole democracy data pledge and stuff.

01:17:38 People heard all your stories.

01:17:40 They're interested.

01:17:41 What can they do?

01:17:42 They can.

01:17:42 If you're like, I don't want to live in a Black Mirror episode, know that you have the power to change it.

01:17:47 Python programmer and podcast listener.

01:17:49 The power is in your hands.

01:17:51 First up, go to datafordemocracy.org and you can sign the ethics pledge and you can let the world know that you believe in ethical technology.

01:18:01 And together, we'll make our voices heard.

01:18:04 We'll make sure the practitioners have a voice in this whole thing.

01:18:05 And so datafordemocracy.org, we really would love to both have you sign the pledge but then also participate in the conversation.

01:18:12 Because there's a whole community there that's really hashing out these ethical principles, making sure they work for real-world technologists who have real-world jobs.

01:18:20 So you can contribute to that process.

01:18:22 It's open source.

01:18:23 It's happening on GitHub.

01:18:24 We'd love to have you participate.

01:18:25 All right.

01:18:26 It's a great project.

01:18:26 Hopefully people go and check it out.

01:18:28 Jonathan, thanks for being on the show.

01:18:30 It's been great to have you back, if just for an hour this year.

01:18:33 Thanks, Michael.

01:18:34 I really appreciate the opportunity.

01:18:35 It was super fun.

01:18:35 You bet.

01:18:36 All right.

01:18:36 Bye-bye.

01:18:36 This has been another episode of Talk Python To Me.

01:18:40 Our guest on this episode was Jonathan Morgan.

01:18:42 And it's been brought to you by us over at Talk Python Training.

01:18:46 Want to level up your Python?

01:18:48 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

01:18:53 Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python.

01:19:01 And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle.

01:19:06 It's like a subscription that never expires.

01:19:08 Be sure to subscribe to the show.

01:19:10 Open your favorite podcatcher and search for Python.

01:19:12 We should be right at the top.

01:19:13 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:19:22 This is your host, Michael Kennedy.

01:19:25 Thanks so much for listening.

01:19:26 I really appreciate it.

01:19:27 Now get out there and write some Python code.

01:19:29 I'll see you next time.

01:19:29 Bye.