WEBVTT

00:00:00.001 --> 00:00:04.920
This year, 2018, is the year that the number of data scientists doing Python

00:00:04.920 --> 00:00:09.300
equals, maybe even exceeds, the number of web developers doing Python.

00:00:09.300 --> 00:00:13.680
That's why I've invited Jonathan Morgan to join me to count down the top 10 stories

00:00:13.680 --> 00:00:14.940
in the data science space.

00:00:14.940 --> 00:00:19.640
You'll find many accessible and interesting stories mixed in with a bunch of laughs.

00:00:19.640 --> 00:00:22.120
We hope you enjoyed it as much as we did.

00:00:22.120 --> 00:00:26.460
This is Talk Python To Me, recorded November 25th, 2018.

00:00:26.460 --> 00:00:44.820
Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the

00:00:44.820 --> 00:00:46.280
ecosystem, and the personalities.

00:00:46.280 --> 00:00:48.220
This is your host, Michael Kennedy.

00:00:48.220 --> 00:00:50.340
Follow me on Twitter, where I'm @mkennedy.

00:00:50.340 --> 00:00:54.100
Keep up with the show and listen to past episodes at talkpython.fm.

00:00:54.100 --> 00:00:56.540
And follow the show on Twitter via at Talk Python.

00:00:56.540 --> 00:00:58.980
Jonathan, welcome back to Talk Python.

00:00:58.980 --> 00:00:59.740
Hey, Michael.

00:00:59.740 --> 00:01:00.540
Thank you so much.

00:01:00.540 --> 00:01:01.560
It's super awesome to be back.

00:01:01.560 --> 00:01:02.980
And it is great to have you back.

00:01:02.980 --> 00:01:03.620
Where have you been?

00:01:03.620 --> 00:01:04.960
I've been like a thousand places.

00:01:04.960 --> 00:01:06.480
I have not been podcasting.

00:01:06.480 --> 00:01:08.620
I feel a little bit guilty about it.

00:01:08.620 --> 00:01:11.640
Very much miss the partially derivative audience.

00:01:11.640 --> 00:01:13.400
Very much miss being on the show.

00:01:13.400 --> 00:01:16.380
I don't think we did this last year, so I'm super pumped to be doing this again.

00:01:16.380 --> 00:01:20.060
But it's because I've been doing two things.

00:01:20.060 --> 00:01:22.100
I have a company, New Knowledge.

00:01:22.100 --> 00:01:26.180
We're focused on disinformation defense, so a lot of data science.

00:01:26.180 --> 00:01:30.800
Ultimately, we like to think that we're protecting public discourse, improving democracy, big

00:01:30.800 --> 00:01:31.360
things like that.

00:01:31.360 --> 00:01:32.420
Not at all pretentious.

00:01:32.420 --> 00:01:34.100
That is a really awesome goal.

00:01:34.100 --> 00:01:36.400
And I suspect you've probably been busy, right?

00:01:36.400 --> 00:01:37.880
I'm just saying.

00:01:37.880 --> 00:01:38.200
It's a lot.

00:01:38.200 --> 00:01:38.640
It's a lot.

00:01:38.640 --> 00:01:41.500
I mean, we've perhaps bitten off more than we can chew.

00:01:41.500 --> 00:01:45.420
It's like every month we expect that to be fading from public consciousness.

00:01:45.420 --> 00:01:48.320
Like, all right, this is the month when people are going to get tired of talking about

00:01:49.380 --> 00:01:51.360
online manipulation and Facebook and Twitter and everything.

00:01:51.360 --> 00:01:55.280
And then every month, it's like it gets worse, which is, you know, it's like it's a thing.

00:01:55.280 --> 00:01:56.320
But it's great.

00:01:56.320 --> 00:01:59.880
And then there's also Data for Democracy, which is the nonprofit that I work on as well.

00:01:59.880 --> 00:02:04.500
And that's a community of about now 4,000 technologists and data scientists who are working

00:02:04.500 --> 00:02:05.720
on social impact projects.

00:02:05.720 --> 00:02:07.580
So also kind of mission aligned.

00:02:07.580 --> 00:02:08.860
You know, democracy is cool.

00:02:08.860 --> 00:02:13.140
But yeah, between those two things, I haven't had as much time to podcast as I would have

00:02:13.140 --> 00:02:13.420
liked.

00:02:14.000 --> 00:02:16.260
But I'm glad, you know, back on the saddle.

00:02:16.260 --> 00:02:19.560
Well, I'll say if you're going to hang up your headphones and your microphone, you've hung

00:02:19.560 --> 00:02:20.560
them up for pretty good reasons.

00:02:20.560 --> 00:02:22.000
Like, those are pretty awesome projects.

00:02:22.000 --> 00:02:23.200
Thanks, man.

00:02:23.200 --> 00:02:23.400
Thanks.

00:02:23.400 --> 00:02:24.900
Yeah, it's exciting times.

00:02:24.900 --> 00:02:26.040
Lots of lots of good stuff to work on.

00:02:26.040 --> 00:02:26.280
Yeah.

00:02:26.280 --> 00:02:27.580
Well, you know what else is exciting?

00:02:27.580 --> 00:02:30.200
I would say data science is as popular as ever, wouldn't you?

00:02:30.200 --> 00:02:31.200
I agree.

00:02:31.200 --> 00:02:33.460
I feel like data science is coming into its own a little bit.

00:02:33.460 --> 00:02:39.700
It's actually it's been interesting to see some of the transition towards just some more

00:02:39.700 --> 00:02:44.360
like established workflow driven team based data science, like a lot of things that I

00:02:44.360 --> 00:02:48.080
think software engineers are probably super familiar with and comfortable with that were

00:02:48.080 --> 00:02:51.680
still pretty nascent the last time that we talked even a couple of years ago.

00:02:51.680 --> 00:02:54.360
So, yeah, it seems it might be here to stay.

00:02:54.360 --> 00:02:54.940
I don't know.

00:02:54.940 --> 00:02:57.640
I'm not going to call it, but I think it's possible that it will be here for a while.

00:02:57.640 --> 00:02:58.920
I definitely think it's a thing.

00:02:58.920 --> 00:03:03.380
You know, it's starting to show up as like full on courses at Berkeley and things like that,

00:03:03.380 --> 00:03:04.360
which is pretty awesome.

00:03:04.360 --> 00:03:05.040
We'll come back to that.

00:03:05.540 --> 00:03:10.380
But I found a cool definition of a data scientist, an individual who does data science.

00:03:10.380 --> 00:03:13.780
And I thought I'd throw that out there and just see what you thought in light of what

00:03:13.780 --> 00:03:14.540
we're about to talk to.

00:03:14.540 --> 00:03:17.740
So this guy named Josh Wills, I don't know him, but Twitter knows him.

00:03:17.740 --> 00:03:19.840
And he said a data scientist is defined.

00:03:19.840 --> 00:03:21.300
He is.

00:03:21.300 --> 00:03:22.520
At least this tweet is.

00:03:22.520 --> 00:03:26.760
He said that a data scientist is defined as a person who is better at statistics than

00:03:26.760 --> 00:03:30.640
any software engineer and better at software engineering than any statistician.

00:03:30.640 --> 00:03:31.380
What would you say?

00:03:31.380 --> 00:03:33.780
Just like burn left and right.

00:03:34.780 --> 00:03:36.180
I think those are both good.

00:03:36.180 --> 00:03:39.060
I think it's a positive representation of a data scientist.

00:03:39.060 --> 00:03:40.740
I think that that's actually true.

00:03:40.740 --> 00:03:45.760
I don't think that Josh is trying to demean anybody with that tweet, although it kind of

00:03:45.760 --> 00:03:49.020
sounds like it, you know, like take that statisticians.

00:03:49.020 --> 00:03:51.560
My software engineering skills totally blow yours out of the water.

00:03:51.560 --> 00:03:56.740
But I think that's right, because it is like this weird hybrid where the you're producing

00:03:56.740 --> 00:03:57.460
software.

00:03:57.460 --> 00:04:01.700
Ultimately, I think software engineers might disagree with that.

00:04:01.700 --> 00:04:03.200
The data scientists are producing software.

00:04:03.200 --> 00:04:03.840
And that's fair.

00:04:04.060 --> 00:04:04.400
That's fair.

00:04:04.400 --> 00:04:06.940
You can debate whether a notebook is a piece of software.

00:04:06.940 --> 00:04:08.620
I think it's a very interesting debate.

00:04:08.620 --> 00:04:09.500
That's a good point.

00:04:09.500 --> 00:04:14.620
And actually, it's interesting how much people are trying to like turn notebooks into runtime,

00:04:14.620 --> 00:04:17.780
like actual production execution environments.

00:04:17.780 --> 00:04:20.420
But that's probably a whole we could go down that rabbit hole all day.

00:04:20.420 --> 00:04:21.480
But it's true.

00:04:21.480 --> 00:04:23.020
I think that's actually that really captures it.

00:04:23.020 --> 00:04:27.880
It gets that blend right in the middle of the Venn diagram between stats and software engineering.

00:04:28.100 --> 00:04:28.740
Yeah, it's cool.

00:04:28.740 --> 00:04:29.020
Yeah.

00:04:29.380 --> 00:04:36.840
So with that, that very precise definition, what we have done is, is mostly you, Jonathan,

00:04:36.840 --> 00:04:43.320
have gathered up a bunch of topics that represent some of the bigger pieces of news from 2018.

00:04:43.440 --> 00:04:46.480
And we're going to go through them for all the data science fans out there.

00:04:46.480 --> 00:04:46.800
Yes.

00:04:46.800 --> 00:04:47.860
I mean, it was a joint effort.

00:04:47.860 --> 00:04:48.580
It was a collaboration.

00:04:49.020 --> 00:04:51.900
I feel like we've touched on some important things here in our list.

00:04:51.900 --> 00:04:56.880
I don't know if you were thinking along these lines, but I had a little bit of a theme with

00:04:56.880 --> 00:04:58.300
a lot of the stories that I was choosing.

00:04:58.300 --> 00:05:02.920
There's kind of a AI may be coming to kill us all or save us all.

00:05:02.920 --> 00:05:04.860
It's like, it's one of the right now.

00:05:04.860 --> 00:05:07.020
It's like, it's like dark times for machine learning.

00:05:07.020 --> 00:05:08.360
It's very interesting.

00:05:08.360 --> 00:05:11.320
It might accidentally kill us while trying to save us.

00:05:11.320 --> 00:05:11.540
Right.

00:05:11.540 --> 00:05:13.160
It's like really well intentioned.

00:05:13.160 --> 00:05:14.960
Like, oh, I can see what you're trying to do there.

00:05:14.960 --> 00:05:17.640
I mean, I'm dead, but I can, I can see.

00:05:18.260 --> 00:05:20.260
You know, you had my best interest in mind.

00:05:20.260 --> 00:05:21.480
All right.

00:05:21.480 --> 00:05:22.480
Thanks for thinking of me, man.

00:05:22.480 --> 00:05:22.880
Right.

00:05:22.880 --> 00:05:24.060
Good try.

00:05:24.060 --> 00:05:24.600
Good try.

00:05:24.600 --> 00:05:25.540
A forever.

00:05:25.540 --> 00:05:26.020
That's right.

00:05:26.020 --> 00:05:29.840
So would you say that the AIs need a babysitter or maybe the other way around?

00:05:29.840 --> 00:05:31.100
Whoa, look at that.

00:05:31.100 --> 00:05:32.840
That was like smooth like butter.

00:05:32.840 --> 00:05:37.020
Speaking of babysitters, I'm not sure how many people have seen this.

00:05:37.020 --> 00:05:42.900
It was kind of an odd little story, but I wanted to, it's like perfectly encapsulates

00:05:42.900 --> 00:05:47.600
well intentioned, but perhaps unforeseen consequences, air quotes AI.

00:05:47.980 --> 00:05:54.420
So a software company called Predictum thought to themselves, you know, it's really tough

00:05:54.420 --> 00:05:55.220
to find a babysitter.

00:05:55.220 --> 00:05:56.740
And it's true.

00:05:56.740 --> 00:05:57.820
You know, you're a parent.

00:05:57.820 --> 00:05:58.540
I'm a parent.

00:05:58.540 --> 00:06:02.520
When you're trying to find somebody to watch your kids, you're like, well, maybe my friends

00:06:02.520 --> 00:06:07.260
use somebody who worked out or maybe I use some online service and they do background checks

00:06:07.260 --> 00:06:07.660
or whatever.

00:06:07.660 --> 00:06:12.380
But it's tough to feel comfortable and confident that the person who's going to come into your

00:06:12.380 --> 00:06:16.720
home and be responsible for your child is a good person, or at least not somebody who

00:06:16.720 --> 00:06:17.400
will put them in danger.

00:06:17.400 --> 00:06:19.320
Like somewhere in that like sweet spot.

00:06:19.320 --> 00:06:24.840
Especially when the baby, when it's a baby, when the baby doesn't speak, it can't report

00:06:24.840 --> 00:06:25.140
to you.

00:06:25.140 --> 00:06:25.640
Yeah.

00:06:25.640 --> 00:06:27.780
The babysitter beat me or the boyfriend came.

00:06:27.780 --> 00:06:30.100
Like it's, you know, you don't even write.

00:06:30.100 --> 00:06:30.740
It's just a baby.

00:06:30.740 --> 00:06:31.380
It doesn't know.

00:06:31.380 --> 00:06:31.920
Exactly.

00:06:31.920 --> 00:06:32.880
It can't, it can't say.

00:06:32.880 --> 00:06:33.160
Right.

00:06:33.240 --> 00:06:34.300
So how do you know?

00:06:34.300 --> 00:06:39.420
Well, the old fashioned way is to, is to use some of the social signals that I mentioned

00:06:39.420 --> 00:06:39.920
before.

00:06:39.920 --> 00:06:45.580
But the new AI way is to use social signals from social media.

00:06:45.920 --> 00:06:49.280
So this gets into kind of creepy territory, I think.

00:06:49.280 --> 00:06:52.160
So, but basically parents have started to turn to this application.

00:06:52.160 --> 00:06:57.580
And, and what, what the system does is that it crawls the social media history of potential

00:06:57.580 --> 00:07:02.000
babysitters and it ranks them on like a risk scale.

00:07:02.000 --> 00:07:03.140
It gives them a risk rating.

00:07:03.400 --> 00:07:07.120
And so it risks them on all sorts of, or it rates them on all sorts of things like

00:07:07.120 --> 00:07:13.620
drug abuse, bullying, harassment, disrespectful, bad attitude, all sorts of things that I guess

00:07:13.620 --> 00:07:15.300
you could get from social media.

00:07:15.300 --> 00:07:19.120
Although I think for most of us, like we would get a five for all of those things because like

00:07:19.120 --> 00:07:20.900
that's just how social media works.

00:07:20.900 --> 00:07:22.660
Like that's what we want out of it, you know?

00:07:22.660 --> 00:07:26.100
But nevertheless, like relatively speaking, I guess you get some kind of rating.

00:07:26.100 --> 00:07:30.840
So it's caused a little bit of controversy for reasons that you might expect.

00:07:30.840 --> 00:07:32.560
Like how would you classify?

00:07:32.680 --> 00:07:36.240
So I should take a step back and say, I'm not sure if most of your listeners will

00:07:36.240 --> 00:07:40.720
be familiar with how an AI system might even go about determining these things.

00:07:40.720 --> 00:07:46.340
So how would I read your social media content and make a judgment that you were at risk of

00:07:46.340 --> 00:07:50.260
bullying or harassment or disrespectful, but there's no way of knowing because the system

00:07:50.260 --> 00:07:50.920
doesn't explain.

00:07:50.920 --> 00:07:53.820
So it might be something really simple.

00:07:53.820 --> 00:07:58.560
Like it just looks for quote unquote bullying words, like some keywords that they pulled out

00:07:58.560 --> 00:08:00.060
of a dictionary that are related to bullying.

00:08:00.060 --> 00:08:01.380
So really simple.

00:08:01.960 --> 00:08:07.820
Or it might be that it's trained a machine learning classifier that it's somehow got a hold of

00:08:07.820 --> 00:08:11.540
a bunch of example, bullying tweets or bullying Facebook posts or whatever it is.

00:08:11.540 --> 00:08:16.520
And it said, Oh, I can now recognize bullying content versus non bullying content.

00:08:16.520 --> 00:08:18.260
And it's trying to use that as a rating system.

00:08:18.260 --> 00:08:18.520
Who knows?

00:08:18.520 --> 00:08:18.900
Nobody knows.

00:08:18.900 --> 00:08:23.300
But the same, the idea basically is that it's scoring any potential babysitters and it's

00:08:23.300 --> 00:08:25.580
giving these to parents on a scale of one to five.

00:08:26.060 --> 00:08:34.020
So somehow there's a difference between a risk assessment of one on the bullying scale, a risk assessment

00:08:34.020 --> 00:08:35.760
of two on the bullying scale.

00:08:35.760 --> 00:08:40.540
And so as parents, we'll have to decide in this kind of like arbitrary scale, am I comfortable

00:08:40.540 --> 00:08:45.620
with a disrespectful score of three, but a bad attitude score of one?

00:08:45.620 --> 00:08:46.720
I'm not really sure.

00:08:46.720 --> 00:08:47.740
What are you teaching my child?

00:08:47.740 --> 00:08:50.080
What kind of disrespect are you teaching my child?

00:08:50.080 --> 00:08:54.620
But AI system has warned me that based on your social medias that maybe, you know, you're not

00:08:54.620 --> 00:08:56.240
like a voice god or whatever it is.

00:08:56.400 --> 00:09:00.600
So in any case, it's like it kind of gets at this idea that like, I think there's this

00:09:00.600 --> 00:09:05.540
dream that we can look at people's digital exhaust on the internet, what they say on social media,

00:09:05.540 --> 00:09:10.900
how they spend their money, where the places that they've been and get some kind of picture

00:09:10.900 --> 00:09:12.460
about who they are as a person.

00:09:12.460 --> 00:09:13.980
So that's the big leap.

00:09:13.980 --> 00:09:18.840
Like you could probably make a guess about how people will behave on social media based

00:09:18.840 --> 00:09:20.240
on how they behave on social media.

00:09:20.240 --> 00:09:23.640
Or you could probably get a sense of what people are likely to buy in the future based on what

00:09:23.640 --> 00:09:24.780
they purchased in the past.

00:09:24.920 --> 00:09:30.280
But that like leap to say, now I know something about you as a person, like how you'll behave

00:09:30.280 --> 00:09:32.520
in other environments where I've never observed you before.

00:09:32.520 --> 00:09:34.380
That's what this application is doing.

00:09:34.380 --> 00:09:36.860
And I think it's like a generally a trend in AI.

00:09:36.860 --> 00:09:43.320
And I'm not sure anybody believes that's actually possible, which is where it's kind of tricky.

00:09:43.320 --> 00:09:47.280
Like, should we use these types of tools and hiring and recruiting and other types of assessments

00:09:47.280 --> 00:09:52.580
that we're making about really, you know, delicate, sensitive things like babysitters or maybe

00:09:52.580 --> 00:09:54.640
less delicate and sensitive things?

00:09:54.640 --> 00:09:59.020
Like, should you be my tax accountant if you get a four out of five for bullying on social

00:09:59.020 --> 00:09:59.260
media?

00:09:59.260 --> 00:10:00.340
Like, I don't know.

00:10:00.340 --> 00:10:01.160
Maybe it's a good thing.

00:10:01.160 --> 00:10:02.640
Maybe you want an aggressive tax account.

00:10:02.640 --> 00:10:02.960
I don't know.

00:10:02.960 --> 00:10:03.260
Exactly.

00:10:03.260 --> 00:10:03.540
Exactly.

00:10:03.540 --> 00:10:07.220
But what is a four, relatively speaking, to a three for bullying?

00:10:07.220 --> 00:10:08.060
Like, I don't know.

00:10:08.060 --> 00:10:13.300
Could you give me something on like a, you know, who's the most harassing person on social

00:10:13.300 --> 00:10:13.540
media?

00:10:13.540 --> 00:10:14.660
And we'll just call them a five.

00:10:14.700 --> 00:10:16.920
And then everybody else is ranked on that person's scale.

00:10:16.920 --> 00:10:18.880
I'm like, I'm trying not to call out anybody in particular.

00:10:18.880 --> 00:10:22.820
Like, I'm really, really trying not to call out anybody in particular.

00:10:22.820 --> 00:10:24.420
But I mean, I don't know.

00:10:24.420 --> 00:10:28.080
So anyway, that's the debate that it's that this is stirred.

00:10:28.080 --> 00:10:28.960
So like, predict them.

00:10:28.960 --> 00:10:29.200
I'm sure.

00:10:29.200 --> 00:10:30.000
Well-intentioned.

00:10:30.000 --> 00:10:33.160
I'd love, you know, to know more about it.

00:10:33.320 --> 00:10:36.500
It sounds so like Orwellian and creepy.

00:10:36.500 --> 00:10:40.360
But let me read you just the quote from one of the founders of people that works there.

00:10:40.360 --> 00:10:43.540
Because it sounds so like something you would want.

00:10:43.540 --> 00:10:47.400
It says, if you search, this is like one of the people from Predict I'm speaking.

00:10:47.400 --> 00:10:51.520
It says, if you search for abusive babysitters on Google, you'll see hundreds of results right

00:10:51.520 --> 00:10:51.780
now.

00:10:51.780 --> 00:10:55.340
There are people out there who either have mental illness or just born evil.

00:10:55.340 --> 00:10:57.660
Our goal is to do anything we can to stop them.

00:10:57.660 --> 00:11:03.800
And when you think of that description and your brand new baby alone, like, you really

00:11:03.800 --> 00:11:05.080
don't want to put them together.

00:11:05.080 --> 00:11:06.320
So it sounds so good.

00:11:06.320 --> 00:11:09.060
But it's also got this really dark side, right?

00:11:09.060 --> 00:11:10.580
It does.

00:11:10.580 --> 00:11:11.640
Yeah, I agree.

00:11:11.640 --> 00:11:15.620
Like, I think that the Predictum folks are pretty well-intentioned.

00:11:15.620 --> 00:11:24.460
But it does highlight what is the unintended consequences of maybe giving too much weight to the output

00:11:24.460 --> 00:11:26.380
of a data science model.

00:11:26.380 --> 00:11:31.780
Like, just because we can package it in a data science workflow, just because it kind

00:11:31.780 --> 00:11:37.300
of walks and talks like something that is algorithmically decided and therefore objective, like, it's

00:11:37.300 --> 00:11:39.040
not necessarily objective at all.

00:11:39.040 --> 00:11:43.340
Like, it's just as easy for me to encode my subjective bias into a machine learning model

00:11:43.340 --> 00:11:45.400
as it is for me to act on that bias in real life.

00:11:45.400 --> 00:11:50.420
So it's difficult for people who aren't familiar with data science to, I think, recognize that.

00:11:50.420 --> 00:11:52.960
And it has these really strange implications.

00:11:52.960 --> 00:11:53.860
What about all those babies?

00:11:53.860 --> 00:11:55.920
Not the babysitters who are actually abusive.

00:11:56.180 --> 00:11:57.660
Like, for sure, let's get rid of those.

00:11:57.660 --> 00:12:03.660
But like, but what about the, you know, what about the people who have, who are otherwise excellent

00:12:03.660 --> 00:12:11.060
babysitters, but at one point, you know, said something mean about a TV show they didn't like on social media, and their

00:12:11.060 --> 00:12:12.540
bullying ranking went through the roof.

00:12:12.540 --> 00:12:15.000
And so now they can't get jobs as a babysitter anymore.

00:12:15.580 --> 00:12:22.200
Like, what are we, I think we need to think about how we're being transparent about the algorithms that we're using to make these sorts of choices.

00:12:22.200 --> 00:12:29.680
And, and of course, that's a difficult thing for those developing those algorithms, because you want to, you know, it's your, it's your, it's your sort of

00:12:29.680 --> 00:12:32.300
special sauce, your, you know, your secret sauce for your product.

00:12:32.300 --> 00:12:33.740
So there's a real tension there.

00:12:33.740 --> 00:12:35.540
And I'm not sure we're getting it right just yet.

00:12:35.640 --> 00:12:36.920
Yeah, it's, it's pretty crazy.

00:12:36.920 --> 00:12:40.100
I guess a few parting thoughts is like, okay, for babysitters, right?

00:12:40.100 --> 00:12:42.720
That's usually a part time small thing for most folks.

00:12:42.720 --> 00:12:43.720
It's not right.

00:12:43.720 --> 00:12:47.780
If you're babysitting career goes a little sideways, you could do something else.

00:12:47.920 --> 00:12:50.780
But a lot of this is applied to all sorts of jobs.

00:12:50.780 --> 00:13:04.460
Like they talked about a company called Fama that uses AI to police all of the workers at like these companies or how Amazon actually canceled what became a clearly biased algorithm for hiring across all of Amazon.

00:13:04.460 --> 00:13:07.960
Then it becomes a sort of have more serious social effects, right?

00:13:07.960 --> 00:13:08.360
Yeah.

00:13:08.360 --> 00:13:16.100
And I think this is coming at a moment when it's like a reckoning for Silicon Valley and technology and machine learning in general.

00:13:16.680 --> 00:13:26.720
We've been, I think, almost naively assuming that everybody who uses our technology will use it with the best intentions or the intentions that we had when we developed it.

00:13:26.720 --> 00:13:31.300
And I think the theme of 2018 is like actually not.

00:13:31.300 --> 00:13:37.020
It's actually possible for things to go horribly wrong in ways that you didn't intend.

00:13:37.020 --> 00:13:38.100
And I think it's good.

00:13:38.100 --> 00:13:45.240
I think it's a good time to have this debate about how close are we actually to encoding our real life values into our digital technologies.

00:13:45.440 --> 00:13:48.080
Like maybe, maybe we're just not that good at it yet.

00:13:48.080 --> 00:13:48.380
Yeah.

00:13:48.380 --> 00:13:48.740
Yeah.

00:13:48.740 --> 00:13:50.500
It's, we're definitely new on it.

00:13:50.500 --> 00:13:50.780
Yeah.

00:13:50.780 --> 00:13:52.900
And I feel like the goodwill tours.

00:13:52.900 --> 00:13:54.260
Exactly.

00:13:54.260 --> 00:13:57.460
I feel like a lot of the goodwill has kind of left that, that space a little bit.

00:13:57.460 --> 00:13:58.340
So we've got to be careful.

00:13:58.340 --> 00:14:05.180
Something that is not new was invented in the 1600s and the Renaissance, right?

00:14:05.200 --> 00:14:07.660
Like we're talking the scientific paper.

00:14:07.660 --> 00:14:08.060
What?

00:14:08.060 --> 00:14:13.780
You know, Kepler's orbiting of the planets and all those things, right?

00:14:13.780 --> 00:14:14.400
Written up.

00:14:14.400 --> 00:14:18.140
So actually the scientific paper turned out to be a big invention.

00:14:18.140 --> 00:14:22.860
It was, it used to be you have to write books or you just wouldn't write it down your, your inventions.

00:14:22.860 --> 00:14:29.400
Or they'd be like private papers, like Einstein wrote to Niels Bohr or something like, we just have to run across those papers, right?

00:14:29.460 --> 00:14:36.720
But it turns out that the scientific paper has been around and basically unchanged for like 400 years.

00:14:36.720 --> 00:14:42.600
I would say science has changed and certainly the dependence upon computing has changed, right?

00:14:42.600 --> 00:14:43.580
Yeah, absolutely.

00:14:43.580 --> 00:14:48.140
I mean, I'm, I'm not in most of the like quote unquote hard sciences and even in social science.

00:14:48.140 --> 00:14:53.260
I think almost the majority of the work is, I have no basis of that.

00:14:53.260 --> 00:14:57.320
That's probably fake news, but, but it does, you know, a lot of it is data driven.

00:14:57.320 --> 00:14:59.260
It requires some type of computation for sure.

00:14:59.340 --> 00:15:03.540
And difficult to reproduce papers without having access to that data and computation.

00:15:03.540 --> 00:15:03.960
Right.

00:15:03.960 --> 00:15:07.100
And a lot of times the computations were not published, just the results.

00:15:07.100 --> 00:15:09.760
But I think that's starting to give way.

00:15:09.760 --> 00:15:16.580
And one of the articles, this was published in the Atlantic and it's, it's a serious research project really.

00:15:16.580 --> 00:15:19.700
It's called the scientific paper is obsolete.

00:15:19.700 --> 00:15:20.460
Okay.

00:15:20.460 --> 00:15:21.520
That's a big statement.

00:15:21.520 --> 00:15:23.040
That is a big statement, right?

00:15:23.040 --> 00:15:26.920
And the graphic is super intense, right?

00:15:26.920 --> 00:15:27.500
On that homepage.

00:15:27.500 --> 00:15:27.860
Yes.

00:15:27.860 --> 00:15:28.420
Yes.

00:15:28.940 --> 00:15:37.800
So it's, it's this sort of traditional scientific paper with lots of, there's probably 15 authors and all sorts of, you know, stuff on there.

00:15:37.800 --> 00:15:39.320
And it's literally on fire.

00:15:39.320 --> 00:15:57.060
And it turns out to be a really interesting historical analysis, like sort of a storytelling of how do we go from scientific paper to closed source software with egomaniac sort of folks leading it like Mathematica to Python and open source and Jupyter and all that.

00:15:57.140 --> 00:15:57.580
Right.

00:15:57.580 --> 00:15:57.580
Yeah.

00:15:57.580 --> 00:15:57.880
Yeah.

00:15:57.880 --> 00:16:08.100
And I mean, I think it's interesting to see even the shift away from the traditional publishing model for anybody who's gotten into any type of research before that.

00:16:08.260 --> 00:16:14.580
It used to be that research only happened primarily at academic institutions and everything goes through peer reviewed journals.

00:16:14.580 --> 00:16:15.320
But almost like in the tradition of open source.

00:16:15.320 --> 00:16:15.760
And I think it's like, yeah.

00:16:15.760 --> 00:16:15.760
Yeah.

00:16:15.760 --> 00:16:15.760
Yeah.

00:16:15.760 --> 00:16:16.280
So it's like, yeah.

00:16:16.280 --> 00:16:16.720
So it's like, yeah.

00:16:16.720 --> 00:16:17.160
Yeah.

00:16:17.160 --> 00:16:17.600
So it's like, yeah.

00:16:17.600 --> 00:16:18.040
So it's like, yeah.

00:16:18.040 --> 00:16:18.040
So it's like, yeah.

00:16:18.040 --> 00:16:19.040
So it's like, yeah.

00:16:19.040 --> 00:16:19.480
So it's like, yeah.

00:16:19.480 --> 00:16:22.580
It's like a lot of people posting on something called archive.

00:16:22.580 --> 00:16:32.740
So they'll write like in the style of a peer reviewed paper, but before it's peer reviewed, because partially because the technology changes so quickly, but also because they want to be open and transparent about their work.

00:16:32.920 --> 00:16:43.100
They are uploading basically to this website called archive where you can search academic papers prior to them being published in some type of peer reviewed journal or in a conference or whatever, which is super interesting.

00:16:43.100 --> 00:16:46.080
And I think it gives people a lot more access to a lot more techniques.

00:16:46.080 --> 00:16:48.060
It feels kind of like posting your code to GitHub.

00:16:48.060 --> 00:16:55.080
And this is anecdotal, but at least what I find is that the code that sits behind some of these papers is usually available on GitHub.

00:16:55.080 --> 00:17:02.380
Like the same types of authors who post to archive, I find are the ones that also say, oh, by the way, here's the GitHub repository so you can go run it yourself.

00:17:02.380 --> 00:17:17.320
And like, here's the sample data set that I used, which is like really gets at this idea, which really gets at this idea that there's something to be said about reproducing these findings for these, especially for these like complicated.

00:17:17.320 --> 00:17:24.220
And I mean, I'm thinking mostly from the perspective of new machine learning developments, but these like these complicated new modeling techniques have to be replicable to be usable.

00:17:24.580 --> 00:17:29.340
And in the same way that like your open source project is kind of only as good as it is usable.

00:17:29.340 --> 00:17:34.500
So you might have the best new like JavaScript NBC framework.

00:17:34.500 --> 00:17:35.420
Those probably aren't cool anymore.

00:17:35.420 --> 00:17:40.000
But like when I was a software developer, everybody, blah, blah, blah.

00:17:40.000 --> 00:17:42.220
But like, whatever, you might have the coolest new like JavaScript app.

00:17:42.220 --> 00:17:45.080
But like if nobody uses it, it doesn't really matter how good it is.

00:17:45.080 --> 00:17:49.860
Like the ones that ultimately the community rallies around are the ones that are the most usable, the most accessible, the most transparent.

00:17:49.860 --> 00:17:53.480
And I think it's interesting to see that creeping into research as well.

00:17:53.660 --> 00:17:59.020
So, I mean, I really dig the idea that perhaps it's a model that just model that needs to change.

00:17:59.020 --> 00:17:59.260
Yeah.

00:17:59.260 --> 00:18:03.040
And they talk a lot of this is a pretty deep research piece here.

00:18:03.040 --> 00:18:03.940
It's quite the article.

00:18:03.940 --> 00:18:05.040
It's not just a few pages.

00:18:05.040 --> 00:18:10.920
And it really digs into the history of Mathematica, how it became really important for computation

00:18:10.920 --> 00:18:11.840
and research.

00:18:12.000 --> 00:18:13.860
But it just didn't really get accepted.

00:18:13.860 --> 00:18:25.040
And then along comes Perez and Granger and those folks with their IPython and their open source and their not centralized way of doing things.

00:18:25.040 --> 00:18:29.620
And it's just really interesting how that's become embraced by science and data science.

00:18:29.780 --> 00:18:32.200
And I think how it's actually influencing science.

00:18:32.200 --> 00:18:39.040
Like I hear all these scientists who I speak to talking about embracing open source principles and styles and more engineering.

00:18:39.040 --> 00:18:41.080
And I think all of that is being brought to them from this.

00:18:41.080 --> 00:18:41.480
Oh, yeah.

00:18:41.560 --> 00:18:47.540
And I think it's even being embedded in the way that students are educated now, which is totally different.

00:18:48.080 --> 00:18:56.540
I mean, I think in the article they talk about one of the authors of one of the open source projects, open source Python projects, like has a faculty appointment in the stats department at Berkeley now.

00:18:56.540 --> 00:19:06.540
And I know that Brian Granger's work and his team focused on Jupyter Notebook is I think they're also based at Berkeley or not at Stanford.

00:19:06.540 --> 00:19:10.140
But I mean, they're embedded inside a university department and they do some teaching as well.

00:19:10.140 --> 00:19:16.740
And they're starting to teach courses that are entirely based around this like open source, like open source science, open science workflow.

00:19:16.740 --> 00:19:27.700
So like Python as a programming language and then Jupyter Notebooks, which is if you're not familiar with Jupyter Notebooks, it's basically a way to like execute snippets of code sequentially, but in a web based environment.

00:19:27.700 --> 00:19:32.760
So, you know, you can write a little bit of code, you can run it, you can see what the output is, and then you can build on that.

00:19:32.760 --> 00:19:34.700
And then you can share your notebooks as you go.

00:19:34.700 --> 00:19:44.320
So in the same way that like it basically captures the rough draft process of finding your way towards a data science solution or really any type of programming solution.

00:19:44.320 --> 00:19:45.960
But often they're used by data scientists.

00:19:46.560 --> 00:20:01.180
And these sorts of tools, I think, have made it possible to share like your entire thought process, which is a really important part, like kind of like showing your work to getting to the results that you need to, which I think is maybe more specific to data science and machine learning than it is to most types of software engineering.

00:20:01.180 --> 00:20:04.360
Because like in software engineering, it works or it doesn't.

00:20:04.360 --> 00:20:05.620
And it works fast enough.

00:20:05.620 --> 00:20:07.220
Then like, all right, dope.

00:20:07.220 --> 00:20:07.980
Like, let's move on.

00:20:07.980 --> 00:20:09.780
Like, check that box and move on to the next task.

00:20:09.780 --> 00:20:10.700
Push the button.

00:20:10.700 --> 00:20:11.280
It did the thing.

00:20:11.280 --> 00:20:11.560
We're good.

00:20:11.560 --> 00:20:12.100
All right.

00:20:12.100 --> 00:20:12.700
Fantastic.

00:20:12.700 --> 00:20:14.640
I'm going to close that Jira ticket and march right along.

00:20:15.100 --> 00:20:17.820
That is true to a certain extent in data science.

00:20:17.820 --> 00:20:19.700
Like, you know, your code runs or it doesn't.

00:20:19.700 --> 00:20:26.480
But often we're trying to evaluate like what's the quality of the findings or like what's the quality of the predictions that you're making and what tradeoffs is your model making.

00:20:26.480 --> 00:20:29.720
And like all these like more like fine grained decisions.

00:20:29.720 --> 00:20:36.220
It really matters like what's behind the curtain with most of the data science work and most of these like most of these research papers.

00:20:36.220 --> 00:20:43.800
And so I think that's why these tools have basically like figured out how to capture all of that in a way that makes it really useful.

00:20:43.800 --> 00:20:48.380
Like really usable, really easy to share, really open and transparent, which is why I think they've caught on.

00:20:48.380 --> 00:20:54.380
Like they've caught on because they're usable and they have this great byproduct, like this great knock on effect that they make all of our work more transparent.

00:20:54.780 --> 00:20:57.380
I see a lot of promise and I definitely see things going this way.

00:20:57.380 --> 00:20:58.380
I think it's really good.

00:20:58.380 --> 00:21:10.620
One of the things is you were describing that to me that really resonates is I feel like a lot of science is dull and boring to people who are not super into it because it's been massively sterilized down to its essence.

00:21:11.180 --> 00:21:11.340
All right.

00:21:11.340 --> 00:21:12.200
Here's the formula.

00:21:12.200 --> 00:21:17.000
You plug this in and you get the volume of gas by temperature or whatever.

00:21:17.000 --> 00:21:17.280
Right.

00:21:17.280 --> 00:21:18.640
You're like, well, who cares about that?

00:21:18.640 --> 00:21:19.560
That's so boring.

00:21:19.560 --> 00:21:19.820
Right.

00:21:19.820 --> 00:21:34.800
But if the story is like embedded with it and the thinking process that led up to it's so interesting, like one of the most interesting classes I ever had was this combination of sort of the history of mathematics and something called real analysis for the math people out there where we basically recreate calculus.

00:21:35.280 --> 00:21:43.160
But from like the thinking about the building blocks and it was just so interesting because it had all the history of the people who created it in there.

00:21:43.160 --> 00:21:45.640
Whereas if you just learn the formulas, you're like, well, this is boring.

00:21:45.640 --> 00:21:46.040
Forget this.

00:21:47.140 --> 00:21:54.720
This has the possibility of keeping more of the thinking and the exploration in the paper and in the reporting.

00:21:54.720 --> 00:21:56.020
Oh, yeah, absolutely.

00:21:56.020 --> 00:21:59.760
Which I agree is like is fascinating because it is an investigation at the end of the day.

00:21:59.760 --> 00:22:14.900
And, you know, one of the authors of the Jupyter, like one of the leads of the Jupyter team, it's kind of meta because like his development of Jupyter is now part of this like upper level data science course at Berkeley in which all the students use Jupyter for all of the data science work that they're doing.

00:22:15.000 --> 00:22:16.160
Like it's cool.

00:22:16.160 --> 00:22:18.640
It's like, you know, it's it's it's notebooks all the way down.

00:22:18.640 --> 00:22:18.980
Yeah.

00:22:18.980 --> 00:22:19.260
Yeah.

00:22:19.260 --> 00:22:20.360
Although I didn't know.

00:22:20.360 --> 00:22:28.600
I mean, I know that you like pulled this out of the article, which I thought was is really spot on that the name Jupyter is actually in honor of Galileo.

00:22:28.600 --> 00:22:32.980
So like going back to an early scientist, like going going way back into history.

00:22:33.120 --> 00:22:45.900
So it's like we haven't forgotten where we came from, like the scientific method and how that gets encapsulated in these this structure that we've all accepted as, you know, research papers that we're standing on the shoulders of giants.

00:22:45.900 --> 00:22:51.440
We're just moving forward into this new like kind of rapidly iterative, open source, more transparent era, which is cool.

00:22:51.520 --> 00:22:54.780
Like why shouldn't research be democratized with all other types of information?

00:22:54.780 --> 00:22:55.960
Like thank you, Internet.

00:22:55.960 --> 00:22:59.880
And we're not forgetting where we came from, which I think is really important.

00:22:59.880 --> 00:23:01.420
Like we don't want to throw the baby out with the backwater.

00:23:01.420 --> 00:23:02.140
Yeah, absolutely.

00:23:02.140 --> 00:23:07.880
There's so many interesting parallels between sort of early scientific discovery and open source versus closed source.

00:23:07.880 --> 00:23:09.300
We'll come back to this actually.

00:23:09.540 --> 00:23:14.560
But like part of his quote that you were pointing out is like Galloway couldn't go anywhere and buy a telescope.

00:23:14.560 --> 00:23:15.660
So he had to build his own.

00:23:15.660 --> 00:23:19.200
It's sort of like, you know, we just put it on GitHub and we had to just make it.

00:23:19.200 --> 00:23:19.740
It wasn't there.

00:23:19.740 --> 00:23:20.180
It's awesome.

00:23:20.180 --> 00:23:21.760
You got to scratch your own itch.

00:23:21.760 --> 00:23:24.980
You know, if you're fine, I'll build it myself.

00:23:24.980 --> 00:23:26.080
I was going to take the weekend off.

00:23:26.080 --> 00:23:27.960
But, you know, whatever world.

00:23:27.960 --> 00:23:28.940
Come on, we're doing this.

00:23:28.940 --> 00:23:35.100
This portion of Talk Python To Me is brought to you by us.

00:23:35.100 --> 00:23:39.360
Have you heard that Python is not good for concurrent programming problems?

00:23:39.860 --> 00:23:44.940
Whoever told you that is living in the past because it's prime time for Python's asynchronous features.

00:23:44.940 --> 00:23:49.500
With the widespread adoption of async methods and the async and await keywords,

00:23:49.500 --> 00:23:54.620
Python's ecosystem has a ton of new and exciting frameworks based on async and await.

00:23:54.620 --> 00:23:59.720
That's why we created a course for anyone who wants to learn all of Python's async capabilities,

00:23:59.720 --> 00:24:02.180
async techniques and examples in Python.

00:24:02.180 --> 00:24:08.080
Just visit talkpython.fm/async and watch the intro video to see if this course is for you.

00:24:08.400 --> 00:24:10.540
It's only $49 and you own it forever.

00:24:10.540 --> 00:24:11.580
No subscriptions.

00:24:11.580 --> 00:24:14.320
And there are discounts for teams as well.

00:24:16.540 --> 00:24:24.680
So this next one, we were speaking about advanced machine learning and AI and analyzing social media sentiment analysis.

00:24:24.680 --> 00:24:28.580
This next one is more about algorithms and less about AI.

00:24:28.580 --> 00:24:31.620
It probably could be implemented with if statements, but it's actually pretty evil.

00:24:31.620 --> 00:24:34.380
And just the focus on algorithms at the core is pretty interesting.

00:24:34.380 --> 00:24:35.120
I thought so.

00:24:35.120 --> 00:24:38.500
Although I would like to point out that I think half the time that people are talking about AI,

00:24:38.500 --> 00:24:42.680
in air quotes, they're talking about a thing that could have been implemented with an if statement.

00:24:42.820 --> 00:24:46.740
And in fact, I would argue, I know that there's technical definitions of AI,

00:24:46.740 --> 00:24:49.980
but what most people mean is software making a decision for me.

00:24:49.980 --> 00:24:51.480
And it's like, well, there is a way that software does that.

00:24:51.480 --> 00:24:52.240
It is an if statement.

00:24:52.240 --> 00:24:54.840
It's AI.

00:24:54.840 --> 00:24:58.520
If you've written an if statement in your code, your freshman programming class,

00:24:58.520 --> 00:24:59.520
you've written AI.

00:24:59.520 --> 00:25:00.600
You just hang your head up.

00:25:00.600 --> 00:25:03.520
It could go crazy and do a switch statement.

00:25:03.520 --> 00:25:04.520
It may seem sort of wow.

00:25:04.520 --> 00:25:07.380
But yeah, one of these branching things.

00:25:07.380 --> 00:25:08.680
The decisions are endless.

00:25:08.680 --> 00:25:10.420
You know, theoretically endless.

00:25:10.420 --> 00:25:13.720
As many decisions as you want to spend time programming by hand.

00:25:13.720 --> 00:25:13.960
Yeah.

00:25:13.960 --> 00:25:17.500
So this one goes back to the sort of everything's going to be used for good, right?

00:25:17.500 --> 00:25:21.600
I'm not sure how anybody thought this could have been used for good, actually.

00:25:21.600 --> 00:25:25.040
Like this algorithm was just intentionally designed to screw everybody over.

00:25:25.160 --> 00:25:29.500
So, you know, this episode is going to come out around the holidays and people will be traveling.

00:25:29.500 --> 00:25:30.100
People will be traveling.

00:25:30.100 --> 00:25:36.240
And if you are traveling with your family, you may have had this experience where you're like

00:25:36.240 --> 00:25:39.600
trying to buy cheap plane tickets because there's like you and like all your kids and you're

00:25:39.600 --> 00:25:41.940
traveling across the country at an expensive time to travel.

00:25:41.940 --> 00:25:45.320
And you're trying to like get to your parents' house in time for Christmas.

00:25:45.320 --> 00:25:47.180
And it's really stressful and you're annoyed.

00:25:47.180 --> 00:25:52.260
And then you book your tickets and you try and get seats together and you can't do it.

00:25:52.260 --> 00:25:53.540
And you're like, wait a second.

00:25:53.540 --> 00:25:54.420
Come on, airline.

00:25:54.620 --> 00:25:57.060
Like, you know, I booked all these tickets at the same time.

00:25:57.060 --> 00:26:00.160
I'm clearly traveling with like a couple of kids like under 10 years old.

00:26:00.160 --> 00:26:00.860
Come on.

00:26:00.860 --> 00:26:03.120
Like this is this is hard enough as it is.

00:26:03.120 --> 00:26:03.900
Shake fists.

00:26:03.900 --> 00:26:06.880
Surely they know they should put the kids with the parents, right?

00:26:06.880 --> 00:26:07.160
Right.

00:26:07.160 --> 00:26:08.100
That's like a common.

00:26:08.100 --> 00:26:11.780
It's like, why is this system not smart enough to recognize that?

00:26:11.780 --> 00:26:13.380
But it turns out it is smart.

00:26:13.380 --> 00:26:17.760
It's just smart in exactly the evil opposite way that you don't want it to be.

00:26:19.260 --> 00:26:31.900
Because it turns out that at least in the UK, some airlines are using algorithms not to put families together, which is what we all assume they would be doing, but to intentionally split us up.

00:26:32.120 --> 00:26:34.560
And you might ask yourself, why?

00:26:35.160 --> 00:26:38.020
Like, are you just like, you know, like a sadist?

00:26:38.020 --> 00:26:38.420
I don't know.

00:26:38.420 --> 00:26:40.460
Like, what's the somewhere?

00:26:40.460 --> 00:26:46.240
There's like, you know, the developer of this algorithm at the back of every airplane twiddling his thumbs like Mr. Burns and like laughing maniacally.

00:26:46.640 --> 00:26:47.080
Exactly.

00:26:47.080 --> 00:26:53.560
But it's because that way they can charge people more money so that they pay to sit together.

00:26:53.560 --> 00:27:08.740
They're like, oh, do you not want the inconvenience of asking 47 people whether or not they're going to switch with you so that you can sit with your like your child who may or may not have the like emotional fortitude and maturity to to not like freak out by themselves on an airplane going across the country.

00:27:09.280 --> 00:27:09.680
Yes.

00:27:09.680 --> 00:27:13.500
So this apparently this is common practice with a number of airlines.

00:27:13.500 --> 00:27:17.900
They are algorithmically looking at people who share last names.

00:27:17.900 --> 00:27:26.640
So if you have a common surname, you are you will not be seated with each other when the seats are assigned, which seems really uncool.

00:27:26.640 --> 00:27:27.700
I just come on.

00:27:27.700 --> 00:27:32.240
So you can pay for the reserve seat, the extra twenty seven dollars per traveler or whatever.

00:27:32.240 --> 00:27:32.580
Right.

00:27:32.580 --> 00:27:32.780
Right.

00:27:32.780 --> 00:27:39.180
How much do you care about your kids or like how uncomfortable are you with asking a bunch of strangers during the holidays?

00:27:39.180 --> 00:27:46.580
To switch with you when it becomes that like that like really complicated calculus problem where you're like, well, my wife's sitting seven seats up and and she's at the window.

00:27:46.580 --> 00:27:49.960
But we're also traveling with my son who's four seats up on the aisle.

00:27:49.960 --> 00:27:56.560
So if you switch with her and this person in row 47, which is with him, then I actually think that we could get, you know, approximately close to each other.

00:27:56.560 --> 00:27:57.820
It's like it's absurd.

00:27:57.820 --> 00:28:00.800
So if you don't want to go through that, yeah, you can just like pay an extra 50 bucks.

00:28:00.800 --> 00:28:03.040
But like wouldn't the decent thing to do just be.

00:28:04.600 --> 00:28:09.880
It's so evil to split the families apart so that they'll pay to go back together.

00:28:09.880 --> 00:28:13.580
Although, to be clear, I'm well out of the range where this matters.

00:28:13.580 --> 00:28:13.800
Right.

00:28:13.800 --> 00:28:16.160
My kids could sit alone and it wouldn't be that big of a deal.

00:28:16.160 --> 00:28:20.600
But my thought is, all right, evil airline and your algorithms.

00:28:20.840 --> 00:28:26.520
I see your play and I raise you in a lone child at three in the back by the business traveler.

00:28:26.520 --> 00:28:28.040
I'm going to bed now.

00:28:28.040 --> 00:28:29.040
Thank you very much.

00:28:29.040 --> 00:28:33.780
I love it.

00:28:33.780 --> 00:28:35.520
Just like probably shouldn't implement it.

00:28:35.520 --> 00:28:38.160
But it seems like, you know, you could just turn it around.

00:28:38.160 --> 00:28:38.780
I love it.

00:28:38.780 --> 00:28:39.440
It's like, yeah.

00:28:39.440 --> 00:28:40.180
Oh, you know what we did?

00:28:40.180 --> 00:28:41.020
We skipped nap.

00:28:41.020 --> 00:28:48.400
You know who had and I just like dumped like three, you know, pop rock sugar packets down their throat right before we got on the plane.

00:28:48.560 --> 00:28:50.840
Because, you know, in 30 minutes they're going to freak out.

00:28:50.840 --> 00:28:52.360
Yeah, exactly.

00:28:52.360 --> 00:28:52.940
I like it.

00:28:52.940 --> 00:28:53.620
I like it.

00:28:53.620 --> 00:28:58.680
Let's see like how many kids we can stack up, how many screaming children we can stack up on airplane at holiday.

00:28:58.680 --> 00:28:59.420
It's really bad.

00:28:59.420 --> 00:29:00.120
It's really bad.

00:29:00.120 --> 00:29:01.760
Something like this actually happened to my daughter.

00:29:01.760 --> 00:29:04.900
Not from algorithms, just other bad, bad stuff.

00:29:04.900 --> 00:29:06.180
So it's not great.

00:29:06.180 --> 00:29:08.880
And I do think it's really evil, the airlines, to do this.

00:29:08.880 --> 00:29:09.300
It is.

00:29:09.300 --> 00:29:09.560
It is.

00:29:09.600 --> 00:29:13.240
But what was interesting is that it actually got referred to in the UK.

00:29:13.240 --> 00:29:18.440
There's a new government organization called the Center for Data Science Ethics and Innovation.

00:29:18.440 --> 00:29:19.900
So they actually have.

00:29:19.900 --> 00:29:20.500
That's crazy.

00:29:20.500 --> 00:29:21.180
It's cool, right?

00:29:21.180 --> 00:29:21.500
It's cool.

00:29:21.500 --> 00:29:21.780
So like.

00:29:21.780 --> 00:29:22.760
Yeah, it's very cool.

00:29:22.760 --> 00:29:23.000
Yeah.

00:29:23.000 --> 00:29:24.700
And like we have similar types of stuff here in the US.

00:29:24.820 --> 00:29:29.880
There's an Office of Science and Technology Policy that's in like in the administration, like it's part of the White House.

00:29:29.880 --> 00:29:39.320
And so anybody who's kind of follows that sort of stuff or is interested in data science, like that's where our first US chief data scientist sits is in the OSTP.

00:29:39.320 --> 00:29:44.320
I don't think there is a data science, chief data scientist in the current administration yet.

00:29:44.320 --> 00:29:45.980
But there's a chief technology officer.

00:29:45.980 --> 00:29:48.280
And anyway, it's where like it's where the geeks sit.

00:29:48.280 --> 00:29:50.980
So but in the UK, there's a Center for Data Science Ethics and Innovation.

00:29:50.980 --> 00:29:53.700
And this case was actually referred to them.

00:29:54.060 --> 00:29:57.020
So I think they've just formed and they just formed.

00:29:57.020 --> 00:30:00.720
And basically I got handed like this is the most offensive thing that algorithms have done.

00:30:00.720 --> 00:30:02.000
Good luck with it.

00:30:02.000 --> 00:30:03.560
Center for Data Science Ethics and Innovation.

00:30:03.560 --> 00:30:05.700
Like fix this, man.

00:30:05.700 --> 00:30:07.380
They're like, great.

00:30:07.380 --> 00:30:08.380
This is why we exist.

00:30:08.380 --> 00:30:09.020
Oh, my gosh.

00:30:09.020 --> 00:30:09.380
It is.

00:30:09.380 --> 00:30:09.700
It is.

00:30:09.700 --> 00:30:12.600
I guess like, you know, it's like because it's a real softball.

00:30:12.600 --> 00:30:16.720
You know, it's like what would the ethical thing to do it be to do in this situation?

00:30:16.720 --> 00:30:17.780
Like, I don't I don't know.

00:30:17.780 --> 00:30:19.000
Gosh, I'm I'm spent.

00:30:19.000 --> 00:30:23.700
I couldn't I couldn't possibly come up with a better, more ethical alternative.

00:30:24.020 --> 00:30:25.520
than splitting parents up from their children.

00:30:25.520 --> 00:30:28.160
Even bureaucrats could totally solve this.

00:30:28.160 --> 00:30:30.040
Yeah, for sure.

00:30:30.040 --> 00:30:30.260
Yeah.

00:30:30.260 --> 00:30:31.980
So, you know, that's a thing.

00:30:31.980 --> 00:30:35.000
I mean, it seems like one of those like open and shut cases.

00:30:35.000 --> 00:30:37.900
I think government ministers are calling it exploitative.

00:30:38.280 --> 00:30:40.940
So that's usually not a good sign for your business practices.

00:30:40.940 --> 00:30:43.100
No, but I.

00:30:43.100 --> 00:30:44.520
Yeah, no, that's a bad start.

00:30:44.520 --> 00:30:45.000
It's a bad start.

00:30:45.000 --> 00:30:45.500
It's a bad start.

00:30:45.500 --> 00:30:46.200
I think they may.

00:30:46.200 --> 00:30:47.700
You know, they've got nowhere to go but up.

00:30:47.700 --> 00:30:48.900
We can say that, you know.

00:30:48.900 --> 00:30:52.440
That's kind of like a pun, I guess, because there are airplanes that go up in the air.

00:30:52.480 --> 00:30:57.820
But anyway, but it brings me to something that I was really excited to talk about with all

00:30:57.820 --> 00:31:04.300
of your listeners because it's something that's important to me personally and something that

00:31:04.300 --> 00:31:11.820
I'm involved with is actually an ethics project for data scientists so that hopefully we could

00:31:11.820 --> 00:31:13.920
prevent these types of mishaps in the future.

00:31:14.340 --> 00:31:17.560
So as I mentioned at the top of the show, I'm involved in an organization called Data

00:31:17.560 --> 00:31:18.020
for Democracy.

00:31:18.020 --> 00:31:19.180
It's a nonprofit.

00:31:19.180 --> 00:31:26.320
And we have recently launched what we're calling our ethical principles for data practitioners.

00:31:26.320 --> 00:31:30.060
So the global data ethics principles.

00:31:30.060 --> 00:31:33.740
So this is like the Hippocratic oath, like the doctor's take, but for.

00:31:33.740 --> 00:31:34.860
Exactly, exactly.

00:31:34.860 --> 00:31:39.560
And so because like we mentioned, this has been kind of a bad year, I would say, for technology

00:31:39.560 --> 00:31:43.620
in general and technologists and and, you know, Silicon Valley and Silicon Valley culture

00:31:43.620 --> 00:31:45.420
and data science and machine learning and AI.

00:31:45.420 --> 00:31:49.800
And everybody's wondering, like, well, is this a good thing for society?

00:31:49.800 --> 00:31:50.560
Is it not?

00:31:50.560 --> 00:31:51.700
Like, how did we get here?

00:31:51.700 --> 00:31:57.060
Like, how did we kind of like stumble into this dystopia where our minds are being manipulated

00:31:57.060 --> 00:32:00.320
by propagandists on Facebook and in China?

00:32:00.320 --> 00:32:03.880
They're doing social credit like that terrible Dark Mirror episode that I saw that we'll get

00:32:03.880 --> 00:32:05.100
to like what's happening to us.

00:32:05.100 --> 00:32:10.520
And, you know, fundamentally, as the people who are implementing this technology, we have a real

00:32:10.520 --> 00:32:15.460
opportunity to think about our values, think about our ethics, like think about the way

00:32:15.460 --> 00:32:19.060
that our technology might be used in ways that we hadn't intended because, you know, we're

00:32:19.060 --> 00:32:21.060
a pretty optimistic group technologists.

00:32:21.060 --> 00:32:25.280
I think we assume that like we want to put something really useful and meaningful out in

00:32:25.280 --> 00:32:25.640
the world.

00:32:25.640 --> 00:32:27.520
Maybe not this like family splitting algorithm.

00:32:27.520 --> 00:32:28.220
That was probably.

00:32:28.220 --> 00:32:29.760
There's probably the business department.

00:32:29.760 --> 00:32:30.940
They say, hey, guys, can you?

00:32:30.940 --> 00:32:31.360
Yeah.

00:32:31.360 --> 00:32:35.040
They're like, well, this will, you know, we'll improve revenue by like 17 percent in

00:32:35.040 --> 00:32:35.400
Q4.

00:32:35.400 --> 00:32:39.540
Like, that'll be good for the world, you know, which and there's nothing wrong with improving

00:32:39.540 --> 00:32:39.880
revenue.

00:32:39.880 --> 00:32:41.120
I'm great.

00:32:41.120 --> 00:32:42.100
Businesses are fantastic.

00:32:42.100 --> 00:32:44.120
Not in the back of three year olds, maybe.

00:32:44.120 --> 00:32:44.660
Right.

00:32:44.660 --> 00:32:47.080
Maybe maybe not in this way, you know.

00:32:47.080 --> 00:32:51.180
But I think also like the stuff that we do is actually pretty complicated.

00:32:51.180 --> 00:32:55.960
People don't really understand at a deep level like what the software is doing and all

00:32:55.960 --> 00:32:59.300
the potential ways that it might be used other than the intended use case.

00:32:59.420 --> 00:33:03.280
Like that's something that really only we think about or are in a good position to think

00:33:03.280 --> 00:33:04.580
about as technologists.

00:33:04.580 --> 00:33:07.800
And so anyway, that was the idea behind this data for democracy project.

00:33:07.800 --> 00:33:13.880
It's a global initiative and it's basically like what's a framework for thinking through

00:33:13.880 --> 00:33:16.700
like putting ethics into your process.

00:33:16.700 --> 00:33:21.120
So how do you incorporate these principles just in your everyday data and technology work?

00:33:21.120 --> 00:33:23.380
What does it look like for data science?

00:33:23.380 --> 00:33:24.460
I know it looks like for doctors.

00:33:24.460 --> 00:33:26.340
You won't do any harm and that kind of thing.

00:33:26.340 --> 00:33:27.980
How about for data scientists?

00:33:27.980 --> 00:33:28.600
Yeah, exactly.

00:33:28.700 --> 00:33:30.560
So like we have a we call it forts.

00:33:30.560 --> 00:33:33.180
There's the high level, which is kind of like do no harm.

00:33:33.180 --> 00:33:36.420
So you think about fairness, openness, reliability, trust and social behavior.

00:33:36.420 --> 00:33:41.120
There's a handful of principles that is kind of like a checklist.

00:33:41.120 --> 00:33:43.140
And so you can kind of go through this checklist.

00:33:43.140 --> 00:33:49.500
And as you're developing a new feature or maybe developing a new model, if you're a data scientist

00:33:49.500 --> 00:33:55.340
or, you know, some system for processing data, like anything that touches data, any of the

00:33:55.340 --> 00:33:59.300
kind of technology work that you're doing, you can go through and, you know, it may be that

00:33:59.300 --> 00:34:02.560
some of these principles like you don't have to check every box every single time.

00:34:02.560 --> 00:34:07.020
But it's a nice like it's a framework for thinking about catching potential blind spots.

00:34:07.180 --> 00:34:12.400
And so what's your intention when you're coding, like building this feature or developing this

00:34:12.400 --> 00:34:12.700
model?

00:34:12.700 --> 00:34:16.900
Have you made your best effort to guarantee the security of any data that you're going to use?

00:34:16.900 --> 00:34:18.580
I mean, that seems like a no brainer.

00:34:18.580 --> 00:34:22.200
But, you know, it's easy to forget when you're moving fast and you're maybe thinking about,

00:34:22.200 --> 00:34:25.440
you know, the deadline or the fact that, you know, you're trying to ship this really important

00:34:25.440 --> 00:34:27.180
feature because your customer really needs it.

00:34:27.320 --> 00:34:31.700
Like, you know, give a second to remind yourself whether the data security is important.

00:34:31.700 --> 00:34:36.360
Have you made your best effort to protect anonymous data subjects, which is really important in a lot

00:34:36.360 --> 00:34:37.260
of data science research?

00:34:37.260 --> 00:34:43.560
Sometimes if we don't think about this, we can inadvertently leak private data to the public

00:34:43.560 --> 00:34:47.920
when they consume our research, even though that was never our intention and potentially is,

00:34:47.920 --> 00:34:48.740
you know, irresponsible.

00:34:49.300 --> 00:34:54.160
You can practice transparency, which is a lot of times understanding how our algorithms work.

00:34:54.160 --> 00:35:00.160
So in this case, there was no transparency around how the algorithm that shows seat assignments

00:35:00.160 --> 00:35:01.140
was functioning.

00:35:01.140 --> 00:35:05.120
And then after an investigation, it was revealed that, well, it was actually examining whether

00:35:05.120 --> 00:35:06.180
or not you shared a last name.

00:35:06.180 --> 00:35:07.880
And if you do, it was splitting you up.

00:35:07.880 --> 00:35:12.340
And if we if that was a transparent algorithm, we would have said, wait a second, this is totally

00:35:12.340 --> 00:35:12.680
uncool.

00:35:12.680 --> 00:35:17.040
You can't explain this algorithm transparently and people still accept it.

00:35:17.040 --> 00:35:20.960
So, you know, like that would that's kind of like the light is great disinfectant, right?

00:35:20.960 --> 00:35:22.920
In politics and all that in business.

00:35:22.920 --> 00:35:23.880
Yeah, yeah, absolutely.

00:35:23.880 --> 00:35:28.220
Another principle that maybe would have mitigated this problem was to communicate responsibly.

00:35:28.220 --> 00:35:33.580
If the engineer who was responsible for implementing that went, hey, this is going to split up families.

00:35:33.580 --> 00:35:39.300
And like during holiday travel, you can respect relevant tensions of stakeholders, which I also

00:35:39.300 --> 00:35:42.040
think is really interesting because this is exactly what probably happened here.

00:35:42.040 --> 00:35:45.780
The business team said, hey, but it'll improve revenue by 17 percent throughout the year.

00:35:46.160 --> 00:35:48.520
And or whatever the number is, I'm making that up.

00:35:48.520 --> 00:35:50.080
So it's a set of principles.

00:35:50.080 --> 00:35:54.800
You can sign on to it, which is so if you go to data for democracy dot org, you will find

00:35:54.800 --> 00:35:56.380
a link to sign the ethics pledge.

00:35:56.380 --> 00:35:59.980
If you think that ethics are important, then you should totally sign up for it.

00:35:59.980 --> 00:36:01.840
And here's why you should sign the pledge.

00:36:01.840 --> 00:36:03.100
A, I think it's an important.

00:36:03.100 --> 00:36:07.780
It's a cool thing to do if you think that ethics are important and you want to have this

00:36:07.780 --> 00:36:08.860
like kind of mental checklist.

00:36:09.060 --> 00:36:13.680
But it's also important because, you know, we all work for organizations or we're students

00:36:13.680 --> 00:36:18.220
at academic institutions or we're otherwise doing this as our profession.

00:36:18.220 --> 00:36:25.920
And the organizations that we work for are starting to adopt more ethical practices as companies

00:36:25.920 --> 00:36:27.560
and academic institutions and governments.

00:36:27.560 --> 00:36:29.440
like this is becoming more prominent.

00:36:29.440 --> 00:36:36.360
But what's the way that we can make sure that our values are encoded in these kind of larger

00:36:36.360 --> 00:36:41.160
business processes and these larger institutions is by making ourselves heard by sort of showing

00:36:41.160 --> 00:36:41.660
our numbers.

00:36:42.020 --> 00:36:47.800
And so if we show up as a technology community and we sign a pledge or we kind of communicate

00:36:47.800 --> 00:36:51.640
to our manager or whatever it is, I mean, ultimately, like this isn't going to come from the top down.

00:36:51.640 --> 00:36:53.440
And I don't think we want it to come from the top down.

00:36:53.440 --> 00:36:57.940
I don't think we want this to come from people who aren't doing the work every day, who hear

00:36:57.940 --> 00:37:01.640
about this ethics thing and they wanted this like a stamp of approval on whatever products

00:37:01.640 --> 00:37:02.060
they're making.

00:37:02.060 --> 00:37:03.180
That's fine.

00:37:03.180 --> 00:37:07.800
But ultimately, the systems that they design won't actually accommodate the technology work

00:37:07.800 --> 00:37:08.920
that we have to do every day.

00:37:09.360 --> 00:37:13.340
So I think for those of us who are like writing software, for those of us who are developing

00:37:13.340 --> 00:37:18.180
models, who are doing data science and software engineering every day, I think we need to make

00:37:18.180 --> 00:37:23.060
our voices heard about the ethical principles that we want to see applied to the work that

00:37:23.060 --> 00:37:23.380
we're doing.

00:37:23.380 --> 00:37:25.200
So anyway, that's why I think it's important.

00:37:25.200 --> 00:37:26.880
Data for democracy.org.

00:37:26.880 --> 00:37:31.460
Don't be the developer who splits up families at the holidays.

00:37:31.460 --> 00:37:32.980
You're better than that.

00:37:32.980 --> 00:37:34.540
That's right.

00:37:34.540 --> 00:37:35.160
That's awesome.

00:37:35.160 --> 00:37:36.260
So I think this is great.

00:37:36.360 --> 00:37:41.220
And I think it ties in really well to a lot of the themes that seem to be happening around

00:37:41.220 --> 00:37:41.820
tech companies.

00:37:41.820 --> 00:37:45.060
Like it used to just be, oh, tech companies are amazing.

00:37:45.060 --> 00:37:46.720
Of course, we want to just encourage them.

00:37:46.720 --> 00:37:51.840
And now there's some real skepticism around Facebook and Uber and all these types of companies.

00:37:51.840 --> 00:37:53.880
And they kind of have to earn it.

00:37:53.880 --> 00:37:56.760
And this oath is part of earning it, I think.

00:37:56.760 --> 00:37:57.100
It's cool.

00:37:57.100 --> 00:37:58.060
Yeah, I think so, too.

00:37:58.060 --> 00:37:58.800
I think so, too.

00:37:59.060 --> 00:38:04.340
And it's a good opportunity, I think, for, like I was saying, the actual doers, like

00:38:04.340 --> 00:38:07.180
those of us who are doing the work, to participate in that conversation.

00:38:07.180 --> 00:38:10.760
Because it's happening, like you're saying, it's happening whether we might want it to or

00:38:10.760 --> 00:38:11.020
not.

00:38:11.020 --> 00:38:13.420
The kind of attitude has changed.

00:38:13.420 --> 00:38:17.900
People are thinking about legislating Silicon Valley, which would have been a totally foreign,

00:38:17.900 --> 00:38:21.320
like bizarre idea almost, even a year or two years ago.

00:38:21.500 --> 00:38:23.620
Yes, that's exactly what I was thinking of, is that kind of stuff.

00:38:23.620 --> 00:38:24.320
It's like, wait, what?

00:38:24.320 --> 00:38:25.420
What do you mean?

00:38:25.420 --> 00:38:25.660
Right.

00:38:25.660 --> 00:38:28.320
It's like, we're just over here doing good, making cool stuff.

00:38:28.320 --> 00:38:30.360
Come on, like pull out that iPhone, buddy.

00:38:30.360 --> 00:38:31.840
Like, let's play some Angry Birds.

00:38:31.840 --> 00:38:35.860
And times have changed.

00:38:35.860 --> 00:38:38.780
And more people are aware of the potential negative consequences.

00:38:38.780 --> 00:38:43.520
And so, you know, now's our time to have a conversation and make the industry that we want

00:38:43.520 --> 00:38:44.100
to be a part of.

00:38:44.100 --> 00:38:44.320
Yeah.

00:38:44.320 --> 00:38:46.840
So some of the things we've covered have been a little creepy.

00:38:46.840 --> 00:38:47.360
Indeed.

00:38:47.360 --> 00:38:49.120
Like the babysitter one.

00:38:49.560 --> 00:38:52.980
But this next one is, I think it's just pure goodness.

00:38:52.980 --> 00:38:53.880
100%.

00:38:53.880 --> 00:38:54.940
I couldn't agree more.

00:38:54.940 --> 00:39:00.820
So we know that Python is being used more in general scientific work.

00:39:00.820 --> 00:39:03.740
And it's probably being used more in sort of the hard sciences.

00:39:03.740 --> 00:39:06.160
Would you consider economics hard science?

00:39:06.160 --> 00:39:06.860
Ooh.

00:39:06.860 --> 00:39:07.720
Oh.

00:39:07.720 --> 00:39:09.600
I'd put it right on the edge, right?

00:39:09.600 --> 00:39:11.080
I mean, that's a lot of math in there.

00:39:11.080 --> 00:39:12.080
There's numbers.

00:39:12.080 --> 00:39:14.060
There's a lot of math in there.

00:39:14.060 --> 00:39:16.980
I guess because I never really took economics.

00:39:16.980 --> 00:39:21.200
I've only listened to like a couple economics textbooks in adulthood to try and something

00:39:21.200 --> 00:39:22.580
I feel like I should know a little bit about.

00:39:22.580 --> 00:39:25.480
And like, I feel like they were behavioral economics.

00:39:25.480 --> 00:39:30.300
And it seems like a lot of correlation without causation.

00:39:30.300 --> 00:39:32.880
Personally, no offense, economists out there.

00:39:32.880 --> 00:39:36.300
But it's really data driven, which I think is really cool.

00:39:36.300 --> 00:39:41.340
So there is a huge amount of information to consume when you're doing good economics.

00:39:41.340 --> 00:39:43.380
So in any case, yeah, let's do it.

00:39:43.380 --> 00:39:44.040
I'm on board.

00:39:44.040 --> 00:39:44.940
I'm going to go one step further.

00:39:45.040 --> 00:39:46.220
Let's call it a hard science.

00:39:46.220 --> 00:39:46.540
I'm in.

00:39:46.540 --> 00:39:46.880
All right.

00:39:46.880 --> 00:39:47.260
Right on.

00:39:47.260 --> 00:39:50.180
I certainly think some aspects of it are.

00:39:50.180 --> 00:39:56.200
Now, we talked about the scientific papers obsolete, the move from just PDF or written

00:39:56.200 --> 00:39:58.980
paper over to Mathematica to Jupyter.

00:39:58.980 --> 00:40:03.300
And that's sort of a high level conversation of a trend.

00:40:03.300 --> 00:40:10.080
But this year, the Nobel Prize in economics was basically won with Jupyter and Python, which

00:40:10.080 --> 00:40:10.500
is awesome.

00:40:10.880 --> 00:40:13.040
That is amazing and not surprising.

00:40:13.040 --> 00:40:17.680
I mean, if you're going to do some scientific research, what other programming language would

00:40:17.680 --> 00:40:18.200
you choose?

00:40:18.200 --> 00:40:19.060
Yeah, absolutely.

00:40:19.060 --> 00:40:21.300
Especially if you're like an economist.

00:40:21.300 --> 00:40:27.700
So there was a bunch of folks who did a bunch of mathematics, Mathematica, and included in them

00:40:27.700 --> 00:40:30.280
were these two guys named Nordhaus and Romer.

00:40:31.020 --> 00:40:34.580
I think they're both American university professors.

00:40:34.580 --> 00:40:41.360
And they're, let me see if I can mess up, poorly described what their Nobel Prize was, their work

00:40:41.360 --> 00:40:42.040
generally is about.

00:40:42.040 --> 00:40:48.580
But it was like, basically looking at how do you create long-term sustainable growth in a

00:40:48.580 --> 00:40:53.860
global economy that improves the world for everybody through things like capitalism and

00:40:53.860 --> 00:40:57.460
whatnot that mostly focus on very, very narrow self-interest.

00:40:57.460 --> 00:41:01.080
They think they cracked that nut, which is pretty interesting.

00:41:01.080 --> 00:41:03.100
And they cracked it with Jupyter.

00:41:03.100 --> 00:41:05.540
I mean, that sounds Nobel Prize worthy to me.

00:41:05.540 --> 00:41:08.940
And also the economic discoveries are probably useful.

00:41:09.460 --> 00:41:16.300
No, but what I think interesting, because we talked about this in one of the previous stories,

00:41:16.300 --> 00:41:20.500
is that the team, or Romer in particular, was the Python user.

00:41:20.500 --> 00:41:23.220
And he wanted to make his research transparent and open.

00:41:23.220 --> 00:41:24.660
That was like a key part of the research.

00:41:24.660 --> 00:41:26.940
So that people could understand how he was reaching his conclusion.

00:41:26.940 --> 00:41:31.740
So like all of my, you know, joking about behavioral economists a minute ago aside, like this is

00:41:31.740 --> 00:41:32.940
actually an important part of the work.

00:41:32.940 --> 00:41:37.100
So that you can understand his assumptions and, you know, at least understand the choices that

00:41:37.100 --> 00:41:41.520
he's making and the data that he's choosing and that he tried to do it with Mathematica,

00:41:41.520 --> 00:41:46.480
which is another way that people use or sort of perform computation for their research.

00:41:46.480 --> 00:41:52.560
And apparently that just made it too difficult to share his work in a way that anybody who wanted

00:41:52.560 --> 00:41:56.260
to try and understand his work would have to also use this proprietary software, which is a really,

00:41:56.260 --> 00:41:57.140
really high bar.

00:41:57.140 --> 00:41:58.380
Like it's really expensive.

00:41:58.380 --> 00:42:00.020
Not everybody knows how to use it.

00:42:00.020 --> 00:42:04.840
And it's not as simple, intuitive, open, and transparent as Python and Jupyter Notebooks.

00:42:04.920 --> 00:42:08.740
So it was really core to the work that he did that ultimately won him the Nobel Prize,

00:42:08.740 --> 00:42:10.020
which is pretty awesome.

00:42:10.020 --> 00:42:11.060
Yeah, it's super cool.

00:42:11.060 --> 00:42:15.980
And I do think that there's, if your goal is to share your work, having it in these super

00:42:15.980 --> 00:42:19.980
expensive proprietary systems is not amazing, right?

00:42:19.980 --> 00:42:21.540
I mean, we're talking about Mathematica here.

00:42:21.540 --> 00:42:27.200
My experience was I did a bunch of stuff in MATLAB at one point and we worked with some

00:42:27.200 --> 00:42:28.440
extra toolkits.

00:42:28.500 --> 00:42:32.920
And these toolkits were like $2,000 a user just to run the code.

00:42:32.920 --> 00:42:36.360
And if you wanted to run the code and check it out, you also paid $2,000.

00:42:36.360 --> 00:42:37.860
Like that's really prohibitive.

00:42:37.860 --> 00:42:38.540
It really is.

00:42:38.540 --> 00:42:42.780
And I mean, I think that was my experience with proprietary software.

00:42:42.780 --> 00:42:47.740
I mean, not only do those like programming languages or applications make it, it's just

00:42:47.740 --> 00:42:48.580
difficult to collaborate.

00:42:48.580 --> 00:42:50.960
And I think this will be of no surprise.

00:42:51.060 --> 00:42:54.360
I would think that this is something that everybody accepts as fact amongst all of your

00:42:54.360 --> 00:42:54.640
listeners.

00:42:54.640 --> 00:42:59.440
But the open source community has made this current software revolution possible.

00:42:59.440 --> 00:43:06.120
The fact that we are able to collaborate at this scale and share ideas and share code and

00:43:06.120 --> 00:43:10.960
build on top of each other's work, I think is the reason that we've had this explosion in

00:43:10.960 --> 00:43:15.140
entrepreneurship, this explosion in the kind of energy and excitement that comes out of Silicon

00:43:15.140 --> 00:43:18.300
Valley, even though we were just talking about it, you know, maybe being a bad thing.

00:43:19.000 --> 00:43:23.880
But this level of sort of rapid innovation that we've been going through is because programming

00:43:23.880 --> 00:43:27.700
and it's just it's so much more accessible than it's ever been before.

00:43:27.700 --> 00:43:31.740
And I think that that's largely because we transitioned away from these proprietary software

00:43:31.740 --> 00:43:32.100
models.

00:43:32.100 --> 00:43:32.340
Yeah.

00:43:32.340 --> 00:43:34.660
And you think about the people who that benefits.

00:43:34.660 --> 00:43:36.060
Obviously, it benefits everyone.

00:43:36.060 --> 00:43:41.260
But if you live in a country where the average monthly income is a tenth of what it is in the

00:43:41.260 --> 00:43:44.240
U.S. and you have to pay $2,000, it doesn't mean it's expensive.

00:43:44.240 --> 00:43:45.560
It means that you can't have it.

00:43:45.560 --> 00:43:46.080
Right.

00:43:46.080 --> 00:43:47.600
I mean, it's just inaccessible to you.

00:43:47.600 --> 00:43:54.120
And so these sort of opening ups of this research and these capabilities, I think it benefits the

00:43:54.120 --> 00:43:55.920
people who need the most benefit as well.

00:43:55.920 --> 00:43:56.720
Yeah, absolutely.

00:43:56.720 --> 00:44:01.520
And it almost like distributes creativity much more broadly than we would have been able to

00:44:01.520 --> 00:44:01.820
before.

00:44:01.820 --> 00:44:08.280
Like we can we can tap sources of creativity and innovation that, like you're saying, would

00:44:08.280 --> 00:44:13.180
have been just taken off the taken off the board because proprietary software is only accessible

00:44:13.180 --> 00:44:18.940
to the small portion of the global population that can afford to spend thousands of dollars

00:44:18.940 --> 00:44:21.180
annually on licensing it.

00:44:21.180 --> 00:44:21.820
Yeah, absolutely.

00:44:21.820 --> 00:44:23.280
So there's there's a couple more thoughts.

00:44:23.280 --> 00:44:27.420
I just want to share really quickly from this before we move on to something I guess I'll call

00:44:27.420 --> 00:44:27.740
positive.

00:44:27.740 --> 00:44:32.380
So this you got it in 2018.

00:44:32.380 --> 00:44:34.200
You got to put it up there positive or not.

00:44:34.200 --> 00:44:37.640
So one thing I think it's really awesome about this is this guy is 62.

00:44:37.640 --> 00:44:40.520
He transitioned into Python recently.

00:44:40.520 --> 00:44:41.220
Right.

00:44:41.220 --> 00:44:43.860
You feel a lot of people are like, oh, I'm 32.

00:44:44.060 --> 00:44:46.420
I couldn't possibly learn programming or get into this.

00:44:46.420 --> 00:44:51.980
Like this guy made this Nobel research change into Python programming and the data science

00:44:51.980 --> 00:44:54.180
tools at 60s, late 50s.

00:44:54.180 --> 00:44:54.560
That's awesome.

00:44:54.560 --> 00:44:55.160
It is awesome.

00:44:55.160 --> 00:44:59.020
And what's interesting is that he's now been like exploring how software works.

00:44:59.020 --> 00:45:03.640
And you pulled out a really great quote where he says, the more I learn about proprietary

00:45:03.640 --> 00:45:09.560
software, the more I worry that, wait for it, objective truth might perish from the earth,

00:45:09.560 --> 00:45:12.300
which is an insanely powerful statement.

00:45:12.400 --> 00:45:15.080
That is a powerful statement from a guy who won the Nobel Prize.

00:45:15.080 --> 00:45:15.920
So he's probably right.

00:45:15.920 --> 00:45:16.280
Yeah.

00:45:16.280 --> 00:45:18.800
He definitely sees a change in the world.

00:45:18.800 --> 00:45:22.180
And I think this opens, this is a little bit of what I was saying, like thinking when

00:45:22.180 --> 00:45:27.140
I said, look, we've got open source being used by scientists, but also changing science,

00:45:27.140 --> 00:45:27.840
science.

00:45:27.840 --> 00:45:28.500
Right.

00:45:28.500 --> 00:45:28.860
Right.

00:45:28.860 --> 00:45:29.580
Absolutely.

00:45:29.580 --> 00:45:35.720
Because it's not only making the work more approachable and accessible, but it's making

00:45:35.720 --> 00:45:36.560
it more repeatable.

00:45:36.560 --> 00:45:40.940
And it's allowing us, especially now that so much research is based on computation, like we've

00:45:40.940 --> 00:45:43.200
been saying before, it really does make it possible.

00:45:43.200 --> 00:45:50.600
It would be impossible to come to a consensus about whether or not something is correct, acceptable,

00:45:50.600 --> 00:45:52.620
whether we can say that this is an established fact.

00:45:52.620 --> 00:45:55.460
You can't really say it if you can't show your work.

00:45:55.460 --> 00:45:58.000
And in this case, showing your work is the computation in the data.

00:45:58.000 --> 00:45:58.220
Right.

00:45:58.220 --> 00:46:00.780
It's like just writing the final number down in calculus.

00:46:00.780 --> 00:46:02.020
You're like, no, no, you got to show your work.

00:46:02.020 --> 00:46:02.480
Exactly.

00:46:02.480 --> 00:46:03.020
Exactly.

00:46:03.020 --> 00:46:03.680
It's very important.

00:46:03.680 --> 00:46:06.340
Like science and objective truth depends on it.

00:46:06.340 --> 00:46:07.060
Show your work.

00:46:07.060 --> 00:46:08.380
That's right.

00:46:08.380 --> 00:46:09.100
Show your work.

00:46:09.100 --> 00:46:09.660
All right.

00:46:09.660 --> 00:46:15.720
So the next one is more like a mundane day to day thing, but actually makes a big difference.

00:46:15.720 --> 00:46:19.960
So what's the story with Waze and Waze reducing crashes?

00:46:19.960 --> 00:46:21.220
So this is pretty cool.

00:46:21.220 --> 00:46:25.940
So we've talked about using quote unquote AI to predict things.

00:46:25.940 --> 00:46:26.260
Sorry.

00:46:26.260 --> 00:46:27.960
Maybe we should tell people what Waze is.

00:46:27.960 --> 00:46:29.580
I don't know how global Waze is.

00:46:29.580 --> 00:46:30.300
Maybe they don't have experience.

00:46:30.300 --> 00:46:31.540
Just give us a real quick summary.

00:46:31.680 --> 00:46:32.500
That is a really good point.

00:46:32.500 --> 00:46:41.480
So Waze is a mapping application that helps you find the shortest route from one point

00:46:41.480 --> 00:46:42.440
to another in your car.

00:46:42.440 --> 00:46:44.980
And so there's a whole thing about Waze.

00:46:44.980 --> 00:46:46.500
Like it's a community of people.

00:46:46.500 --> 00:46:49.960
And so you're kind of collaborating to point out things on the road.

00:46:49.960 --> 00:46:55.560
So I'm stuck in traffic and you can tap a button or there's a police car trying to catch people

00:46:55.560 --> 00:46:58.820
for speeding and you can tap the button or there's been an accident.

00:46:58.820 --> 00:46:59.580
You can tap a button.

00:46:59.880 --> 00:47:02.320
And so it's this way for like drivers to communicate with each other on the road.

00:47:02.320 --> 00:47:03.320
And it's so big.

00:47:03.320 --> 00:47:06.500
That community is so large that it can help you.

00:47:06.500 --> 00:47:11.540
It routes you to one point from one point to another in a city that helps you avoid obstacles

00:47:11.540 --> 00:47:13.820
that might be pretty dynamic and changing all the time.

00:47:13.820 --> 00:47:15.580
So it's kind of like Google Maps.

00:47:15.580 --> 00:47:17.720
It's more of a two-way street for sure, right?

00:47:17.720 --> 00:47:20.000
Like the users send info back a lot more to it.

00:47:20.000 --> 00:47:20.720
Okay, cool.

00:47:20.720 --> 00:47:23.580
With that in mind, how does Waze play into this story?

00:47:23.580 --> 00:47:29.760
Well, as you can imagine, it captures a ton of data about driving patterns.

00:47:29.760 --> 00:47:33.140
So not only does it know what's happening in real time, but all that data is stored.

00:47:33.140 --> 00:47:37.020
And so you start to get a sense about how people move through cities in general.

00:47:37.540 --> 00:47:44.700
And once you have data that captures a behavior in general, in data science, you can start to make predictions using that data.

00:47:44.700 --> 00:47:49.000
So you can train a model and say, generally, this is how things work.

00:47:49.000 --> 00:47:52.760
And then maybe you could make some predictions about how things work in the future.

00:47:52.980 --> 00:47:57.800
And assuming that historical data is accurate, then you can usually make pretty decent predictions.

00:47:57.800 --> 00:48:07.700
And so the cool thing about an application like Waze that captures not only traffic patterns, but also events like accidents or car crashes,

00:48:07.700 --> 00:48:13.220
is that you could predict when and where car crashes are likely to occur, which is kind of mind-blowing.

00:48:13.220 --> 00:48:18.080
Like I think traffic seems like this total, super complex, impossible to understand, messy organic...

00:48:18.080 --> 00:48:22.340
If you ever thought of chaos, it should totally apply to this, right?

00:48:22.340 --> 00:48:27.060
Like if a butterfly can affect weather, like people should just be crazy in cars.

00:48:27.060 --> 00:48:28.220
They can be crazy in cars.

00:48:28.220 --> 00:48:31.540
And I don't know if everybody will be familiar with this.

00:48:31.540 --> 00:48:37.600
There's this like famous experiment in where they, scientists set up a circular track and put cars,

00:48:37.600 --> 00:48:41.540
they had cars drive around the circular track, all like equidistant from one another.

00:48:41.540 --> 00:48:43.400
So in theory, they could all maintain their speed.

00:48:43.400 --> 00:48:48.320
But because there are human beings driving the cars, they would occasionally make these little choices.

00:48:48.320 --> 00:48:51.020
Like they would feel like they got a little bit too close to the car in front of them,

00:48:51.080 --> 00:48:53.280
where they put a little bit too much gas and they would get too close.

00:48:53.280 --> 00:48:53.880
And so they'd brake.

00:48:53.880 --> 00:48:56.200
And then as soon as they put their foot on the brake,

00:48:56.200 --> 00:48:57.880
then the car behind them will put their foot on the brake.

00:48:57.880 --> 00:48:59.580
And then there was this like cascading effect.

00:48:59.580 --> 00:49:02.960
And no matter what, even though there was enough room for all of these cars on the road,

00:49:02.960 --> 00:49:05.360
they wound up in a traffic jam driving around in the circle.

00:49:05.360 --> 00:49:05.820
It's like...

00:49:05.820 --> 00:49:06.260
That's awesome.

00:49:06.260 --> 00:49:06.740
Yeah.

00:49:06.740 --> 00:49:11.100
It's like a beautifully designed system that humans cannot participate in fully.

00:49:11.620 --> 00:49:14.160
Because we're just too human.

00:49:14.160 --> 00:49:15.180
So we could cause traffic.

00:49:15.180 --> 00:49:15.780
We're flawed.

00:49:15.780 --> 00:49:16.960
We're flawed.

00:49:16.960 --> 00:49:17.580
We're flawed.

00:49:17.580 --> 00:49:20.680
And all those flaws are captured by Waze.

00:49:20.680 --> 00:49:26.360
And so what Waze basically started doing is that they have all this data from connected

00:49:26.360 --> 00:49:28.180
cars and road cameras and apps and whatever.

00:49:28.180 --> 00:49:31.140
And they have an overview of how the city works.

00:49:31.140 --> 00:49:37.600
And they've shared that data with local authorities who then basically have developed models that

00:49:37.600 --> 00:49:39.840
predict when and where these accidents are going to occur.

00:49:39.840 --> 00:49:45.620
And so the city's traffic and safety management agencies were able to take that data and say,

00:49:45.620 --> 00:49:49.440
oh, there's likely to be an accident in this area at this time.

00:49:49.440 --> 00:49:50.480
And then they sent...

00:49:50.480 --> 00:49:53.060
They went to those areas to take preventative measures.

00:49:53.060 --> 00:49:56.980
So they basically identified what are the most high risk areas for where accidents are likely

00:49:56.980 --> 00:50:00.020
to occur at certain times of day or in certain conditions or whatever.

00:50:00.620 --> 00:50:06.620
And if you can then take action to make those areas more safe, then, of course, you can reduce

00:50:06.620 --> 00:50:07.800
the number of crashes.

00:50:07.800 --> 00:50:09.260
And they reduce crashes by 20%.

00:50:09.260 --> 00:50:14.360
So if on a normal day there's 100 car crashes, there's only 80, which, I mean, that's pretty

00:50:14.360 --> 00:50:14.720
amazing.

00:50:14.720 --> 00:50:20.040
So not only that, but if you're in an area where an accident's likely to occur, all of the other

00:50:20.040 --> 00:50:23.080
services that happen around that accident are more readily available.

00:50:23.080 --> 00:50:26.160
So you can get faster treatment for anybody who was injured.

00:50:26.160 --> 00:50:30.560
You can more quickly clear and restore normal traffic flow.

00:50:30.560 --> 00:50:37.420
And so in addition to actually making the health implications, like the public health implications

00:50:37.420 --> 00:50:41.420
of having fewer car crashes, you can actually make it easier for people to get around the

00:50:41.420 --> 00:50:46.180
city more quickly by moving, by sort of quickly dealing with accidents as they occur, because

00:50:46.180 --> 00:50:50.400
you kind of knew or had a pretty strong indication that that accident was going to happen in advance.

00:50:50.620 --> 00:50:53.280
So it's like minority report, but for car crashes.

00:50:53.280 --> 00:50:54.280
It's like a pre-crash.

00:50:54.280 --> 00:50:55.780
It is.

00:50:55.780 --> 00:50:57.140
So they've got the pre-crash.

00:50:57.140 --> 00:50:59.360
If that system is not called pre-crash.

00:50:59.360 --> 00:51:00.580
Sir, do you know why I'm pulling you over?

00:51:00.580 --> 00:51:03.880
No, you were going to crash up there.

00:51:03.880 --> 00:51:04.440
What?

00:51:05.220 --> 00:51:05.640
Right.

00:51:05.640 --> 00:51:08.340
You were the cause of an accident that was about to occur.

00:51:08.340 --> 00:51:09.040
Like what?

00:51:09.040 --> 00:51:12.480
I actually, I kind of like this idea.

00:51:12.480 --> 00:51:17.360
I mean, we're getting into territory that sounds like the wrong, like the unintended use of AI,

00:51:17.360 --> 00:51:19.200
like we've been talking about this entire episode.

00:51:19.200 --> 00:51:21.260
But I kind of love it.

00:51:21.260 --> 00:51:25.200
I would love for the AI to tell me in advance, even if it meant I got a ticket.

00:51:25.200 --> 00:51:27.320
Like you were going to cause an accident.

00:51:27.320 --> 00:51:30.480
You didn't because we told you and we've changed the future.

00:51:30.860 --> 00:51:32.860
But you still have a $25 ticket.

00:51:32.860 --> 00:51:35.160
I'd pay that.

00:51:35.160 --> 00:51:36.740
I would pay that $25 ticket.

00:51:36.740 --> 00:51:37.460
That's so interesting.

00:51:37.460 --> 00:51:39.080
That's like the 2025 edition.

00:51:39.080 --> 00:51:45.380
The 2018 edition is they were able to cause fewer crashes, address the ones that happened better,

00:51:45.380 --> 00:51:48.880
and basically just improve life for all the drivers.

00:51:48.880 --> 00:51:49.300
That's awesome.

00:51:49.300 --> 00:51:50.280
Yeah, it's totally cool.

00:51:50.280 --> 00:51:54.640
And I think this is one of those like almost like classic data science problems now,

00:51:54.640 --> 00:51:58.260
where it's weird to say that, that there are like such a thing as a classic data science problem,

00:51:58.260 --> 00:52:00.020
but like maybe like an archetypal data science problem,

00:52:00.080 --> 00:52:02.720
where there's like where you have a bunch of data about the past,

00:52:02.720 --> 00:52:07.600
and whatever system it is that you're interested in understanding is so predictable and repeatable

00:52:07.600 --> 00:52:10.800
that all you really need to do is understand how it behaved in the past,

00:52:10.800 --> 00:52:13.420
and then you can have a pretty good idea about how it's going to behave in the future.

00:52:13.420 --> 00:52:15.660
Like if only you could crunch enough numbers.

00:52:15.660 --> 00:52:20.940
So it's kind of like we could have a grand theory of the universe if only we could compute the universe.

00:52:20.940 --> 00:52:23.860
Sorry, not to get too philosophical,

00:52:23.860 --> 00:52:29.480
but like, but the systems that we have the computing power to understand are increasingly large and complex.

00:52:29.760 --> 00:52:32.980
So like a city's traffic flow is pretty large and complex,

00:52:32.980 --> 00:52:38.200
and probably too much data to really process in any meaningful way until now.

00:52:38.200 --> 00:52:41.960
And now we not only have the data, but we have the ability to process it and make predictions.

00:52:41.960 --> 00:52:45.240
And so this is a real significant, tangible improvement on human life.

00:52:45.240 --> 00:52:45.840
Pretty awesome.

00:52:46.080 --> 00:52:47.000
Yeah, that's really awesome.

00:52:47.000 --> 00:52:56.480
I think another thing that would make an improvement on life is if more people participated in speaking about what they would like to happen in their country,

00:52:56.480 --> 00:52:58.580
and what they would not like to happen in their country.

00:52:58.580 --> 00:52:59.240
What do you think?

00:52:59.240 --> 00:53:00.340
That seems like a good idea.

00:53:00.340 --> 00:53:02.500
You know, like participating in the process.

00:53:02.760 --> 00:53:04.360
I think that'd be a win all around.

00:53:04.360 --> 00:53:06.040
You know, you know, way better than I do.

00:53:06.040 --> 00:53:13.160
I feel like the average voting turnout is something like 65% or something in the US, maybe 70 on a crazy period.

00:53:13.160 --> 00:53:15.100
But then that's just among registered voters.

00:53:15.100 --> 00:53:21.160
What about all the people who aren't even registered, or they've become unregistered because they moved and they forgot to update it?

00:53:21.280 --> 00:53:28.020
Right. This is, as we probably heard about in the 2018 election cycle that we just had, a big part of the process.

00:53:28.020 --> 00:53:37.420
And anybody who, well, you probably met somebody who was like standing in front of your office or your church or your school or walked around your neighborhood and was saying,

00:53:37.420 --> 00:53:38.520
hey, are you registered to vote?

00:53:38.520 --> 00:53:41.920
Like there's these get out the vote campaigns that are really important because what?

00:53:41.920 --> 00:53:44.160
Twitter asked me at least five times if I was registered.

00:53:44.160 --> 00:53:44.980
Twitter did.

00:53:44.980 --> 00:53:47.260
I think I Googled something the day before the election.

00:53:47.380 --> 00:53:52.220
They were like, hey, here's your local polling location where you can go get registered to vote right now.

00:53:52.220 --> 00:53:53.460
It's kind of cool.

00:53:53.460 --> 00:53:56.400
I think there's people have recognized that there's low voter turnout.

00:53:56.400 --> 00:54:08.200
And so there's all of these interesting initiatives to try and get people to go out, make sure that they follow the appropriate procedures and they actually cast their vote, which, you know, ultimately, you know, democracy dies in darkness.

00:54:08.200 --> 00:54:08.520
Right.

00:54:08.520 --> 00:54:12.740
Like it's important that we're, you know, participating in the process.

00:54:12.740 --> 00:54:15.140
So technologists are, of course, trying to help.

00:54:15.760 --> 00:54:18.020
And there's a guy called Jeff Jonas.

00:54:18.020 --> 00:54:19.260
He's a prominent data scientist.

00:54:19.260 --> 00:54:24.420
And he is interested in the integrity of voter rolls.

00:54:24.420 --> 00:54:27.440
And so this is an interesting aspect of it.

00:54:27.440 --> 00:54:32.260
So in most places, I think in all places, you can't just show up to the polling place and vote.

00:54:32.260 --> 00:54:33.860
You have to be registered to vote.

00:54:33.860 --> 00:54:37.300
And there's lots of reasons that you might get unregistered like you were just talking about.

00:54:37.300 --> 00:54:42.620
So I recently moved from one place to another in Austin and the people were coming around my neighborhood.

00:54:42.620 --> 00:54:46.120
And I actually didn't know that this impacted my ability to vote here where I live.

00:54:46.120 --> 00:54:48.000
They said, hey, are you registered to vote?

00:54:48.000 --> 00:54:48.980
I said, of course I am.

00:54:48.980 --> 00:54:52.000
And I mentioned offhand like, oh, just move to the neighborhood.

00:54:52.000 --> 00:54:52.660
Love it so much.

00:54:52.860 --> 00:54:56.380
And they were like, oh, well, then you're not registered to vote because you've just changed your address.

00:54:56.380 --> 00:55:01.220
And so there's all these like little details that are important, sometimes hard to understand.

00:55:01.220 --> 00:55:03.560
Like it's a big government bureaucratic process at the end of the day.

00:55:03.560 --> 00:55:17.400
And so this guy, Jeff Jonas, used his software for this multi-state project that he called the Electronic Registration Information Center that basically uses machine learning to identify eligible voters.

00:55:17.400 --> 00:55:20.100
And then it cleans up the existing voter rolls.

00:55:20.100 --> 00:55:22.500
So, OK, great.

00:55:22.860 --> 00:55:30.760
Because the reason that you might need machine learning for this and not just like normal string matching is because people's names are sometimes slightly different on one form or another.

00:55:30.760 --> 00:55:34.880
Or maybe their address has changed or maybe multiple people have the same name.

00:55:34.880 --> 00:55:47.060
And so there's like all sorts of ways where like it's sometimes difficult to know whether a record in one database that represents a human being represents the same human being in another database, even if their names are identical or similar.

00:55:47.060 --> 00:55:49.380
So there's it gets a little bit complicated.

00:55:49.500 --> 00:55:59.520
And so that's why, you know, machine learning and AI are useful here because they can recognize these kind of subtle variations and patterns that ultimately lead you to be able to kind of triangulate a bunch of different data to point at the same human being.

00:55:59.520 --> 00:56:12.360
And so this nonprofit, the Electronic Registration Information Center, identified 26 million people who are eligible to vote but unregistered and then 10 million registered voters who have moved.

00:56:12.360 --> 00:56:15.100
Who maybe became uneligible because of that.

00:56:15.100 --> 00:56:15.680
Yeah, that's right.

00:56:15.680 --> 00:56:17.500
So it somehow became uneligible.

00:56:17.500 --> 00:56:18.320
So they've moved.

00:56:18.540 --> 00:56:21.580
They appear on more than one list for whatever reason.

00:56:21.580 --> 00:56:22.280
This does happen.

00:56:22.280 --> 00:56:23.060
Or they died.

00:56:23.060 --> 00:56:26.080
So important to nobody.

00:56:26.080 --> 00:56:28.620
After your death, you remain registered to vote.

00:56:28.620 --> 00:56:32.700
And because who are you going to even you just died.

00:56:32.700 --> 00:56:34.520
It's not the thing that you think of on your deathbed.

00:56:34.520 --> 00:56:36.320
It's like, wait, take me off the voter rolls.

00:56:36.320 --> 00:56:37.600
The end is near.

00:56:37.600 --> 00:56:39.140
So, you know, this is actually pretty common.

00:56:39.140 --> 00:56:47.860
But again, like, matching a record of death with a name on a voter registration roll is much more difficult than it sounds.

00:56:48.340 --> 00:56:55.080
So anyway, super interesting project because I think in every sense, like we want our voter rolls to be authentic.

00:56:55.080 --> 00:56:58.220
We want them to have integrity because we want our democracy to have integrity.

00:56:58.220 --> 00:57:00.820
Like, you know, we should make sure that one person, one vote.

00:57:00.820 --> 00:57:06.600
But at the same time, we want to make sure that as many people who can vote do because that's how the whole process works.

00:57:06.600 --> 00:57:09.640
We want to make sure that the people are represented in their and their elected leader.

00:57:09.640 --> 00:57:12.240
So, yeah, really, really interesting project.

00:57:12.240 --> 00:57:16.220
And I think hard to imagine this going wrong.

00:57:16.220 --> 00:57:18.520
Yeah, it's really good.

00:57:18.520 --> 00:57:28.200
There was some conversation in the article about this project where it talks about trying to thread the needle of not looking partisan.

00:57:28.200 --> 00:57:29.660
We just want people to vote.

00:57:29.660 --> 00:57:30.640
How do we do that?

00:57:30.640 --> 00:57:31.880
And that's an interesting challenge.

00:57:31.880 --> 00:57:32.340
Right.

00:57:32.340 --> 00:57:32.560
Right.

00:57:32.560 --> 00:57:32.720
Right.

00:57:32.720 --> 00:57:35.660
Because, of course, that's become a polarized topic of conversation.

00:57:35.920 --> 00:57:40.200
There's a lot of concern about whether or not voting rolls are authentic.

00:57:40.200 --> 00:57:44.700
Nobody wants this concern on one side of the aisle about voting fraud.

00:57:44.700 --> 00:57:48.420
And then there's concern about from the other side of the aisle about voter suppression.

00:57:48.420 --> 00:57:51.220
So how do we get to some common ground?

00:57:51.220 --> 00:57:58.820
Because I think ultimately the common ground that everybody has is that we want an authentic process, whereas many people who are eligible to vote can.

00:57:58.820 --> 00:58:01.340
I think we all believe in democracy at the end of the day.

00:58:01.460 --> 00:58:06.920
But there's these two kind of opposing points of view on what to prioritize when making sure that our process has integrity.

00:58:06.920 --> 00:58:15.040
So it was great to thread the needle and find a technology driven, sort of a nonpartisan approach to making sure that the system functions as best as it can.

00:58:15.040 --> 00:58:15.280
Right.

00:58:15.320 --> 00:58:26.520
Let's take this technology applied to a problem that is inherently political and try to make our political system better without, you know, raising the hairs and anger of any particular group.

00:58:26.520 --> 00:58:27.780
It's pretty good.

00:58:27.780 --> 00:58:28.000
Yeah.

00:58:28.000 --> 00:58:28.600
It's a challenge.

00:58:28.600 --> 00:58:28.900
It's a challenge.

00:58:28.940 --> 00:58:35.160
So it's almost like these days it's hard to find anything that AI can do that makes everybody happy.

00:58:35.160 --> 00:58:42.740
However, I think this next story that you found actually is something that I think universally we can all agree on.

00:58:42.740 --> 00:58:43.560
I think it's pretty awesome.

00:58:43.560 --> 00:58:47.100
Apolitical and universally beneficial, sort of unequivocally beneficial.

00:58:47.100 --> 00:58:47.560
Yeah.

00:58:47.780 --> 00:58:50.460
So I think there's two really interesting angles to this.

00:58:50.460 --> 00:58:52.200
Let's take the super positive one first.

00:58:52.200 --> 00:59:02.960
So this is about using machine learning and computer vision to fight breast cancer, which there's been several projects around this.

00:59:02.960 --> 00:59:05.680
And they're universally good, I think.

00:59:05.680 --> 00:59:05.940
Right.

00:59:05.940 --> 00:59:07.840
You have these mammograms.

00:59:07.840 --> 00:59:10.940
They have pictures of potentially cancerous regions.

00:59:11.160 --> 00:59:13.840
But the key to catch cancer is to catch it early.

00:59:13.840 --> 00:59:17.040
But to catch it early means to see things that are utterly subtle.

00:59:17.040 --> 00:59:17.600
Right.

00:59:17.600 --> 00:59:26.060
So they were saying something like 38% of radiologists or whatever group of doctors is called that looks at this.

00:59:26.060 --> 00:59:35.360
They were doing like a 38% catch rate on these really super early cancers.

00:59:35.360 --> 00:59:38.040
So Google came up with an AI.

00:59:38.580 --> 00:59:41.740
It's even just like, well, they took one off the shelf and they applied it.

00:59:41.740 --> 00:59:43.020
It's like something ridiculous.

00:59:43.020 --> 00:59:44.260
Like off the cuff.

00:59:44.260 --> 00:59:45.280
You're just like off.

00:59:45.280 --> 00:59:47.820
We pointed at these and asked it some questions.

00:59:47.820 --> 00:59:52.820
And apparently it found the cancerous regions 99% of the time.

00:59:52.820 --> 00:59:53.680
Yeah, it is.

00:59:53.680 --> 00:59:55.200
And for exactly the reasons that you mentioned.

00:59:55.200 --> 00:59:59.960
Because it's capable of seeing much subtler patterns than human beings can.

00:59:59.960 --> 01:00:03.880
Which is, I mean, exactly what machine vision is good for.

01:00:03.880 --> 01:00:04.200
Yeah.

01:00:04.460 --> 01:00:09.260
So first it started out like we're just going to show the AI all the pictures and just get its opinion.

01:00:09.260 --> 01:00:10.220
Like forget the doctors.

01:00:10.220 --> 01:00:13.560
And then they said, well, what if, like we have doctors.

01:00:13.560 --> 01:00:15.680
And I think there still always will be doctors.

01:00:15.680 --> 01:00:18.660
But what if we give the doctors a stethoscope?

01:00:18.660 --> 01:00:20.660
We give them a tongue depressor.

01:00:20.740 --> 01:00:21.860
We give them a camera.

01:00:21.860 --> 01:00:28.820
What if we gave them AI as one of the tools they could point at the people and ask, you know, and analyze what the results are coming out of that machine, right?

01:00:28.820 --> 01:00:32.340
So they basically said, all right, the second one would take six pathologists.

01:00:32.340 --> 01:00:46.480
And they said they found it easier to detect small cancerous regions.

01:00:46.480 --> 01:00:48.500
And it only took half the time.

01:00:48.500 --> 01:00:59.860
At least for me, like in terms of where AI can be helpful in health care, especially these life-saving moments when the earlier that you detect cancer, for example, the more likely you are to be able to treat it.

01:00:59.860 --> 01:01:01.800
Like it's so complicated.

01:01:01.800 --> 01:01:13.540
Like I feel like the medical profession, the amount that doctors are expected to hold in their minds in order to like recognize relevant pieces of information and put together like a theory of the case.

01:01:13.660 --> 01:01:16.880
You know, what might this be based on all the pieces of information that I'm seeing?

01:01:16.880 --> 01:01:19.000
It is really like a marvel.

01:01:19.000 --> 01:01:33.260
I have nothing but respect for doctors who go through the amount of training that they go through, the amount of information that they're able to retain, and the kind of creativity that's required to recognize these different symptoms and put together, you know, some likely candidates for what might be ailing the patient.

01:01:33.260 --> 01:01:42.400
However, that type of pattern matching, assuming that you can capture accurate data, is exactly what machines have become really, really, really good at.

01:01:43.000 --> 01:02:03.860
And so I do think this is an area where, again, assuming that good data is available, and I think when it comes to something that is more kind of binary to determine, like whether or not there's cancer tissue, like cancer tissue present, and we can take an image of a region of the body and sort of reliably provide that to the algorithm that's trying to make a determination about whether or not cancer is present.

01:02:03.860 --> 01:02:22.560
Like, in examples like that, where the data is readily available, I think there's no reason that AI can't be a really powerful assistant so that doctors with all their knowledge and creativity can't augment that and sort of help recognize where they might have blind spots or where there might be technical limitations to their ability as human beings.

01:02:22.560 --> 01:02:52.540
Like, our eyes only work so well.

01:02:52.540 --> 01:03:22.520
I think that's open source earlier.

01:03:22.520 --> 01:03:27.440
Aggregate knowledge of the medical community can be available to any practitioner, which is kind of amazing.

01:03:27.440 --> 01:03:28.240
Yeah, it's pretty amazing.

01:03:28.240 --> 01:03:33.740
So you spoke about the computers detecting these really careful, nuanced details and images.

01:03:34.360 --> 01:03:40.320
And I would say China is doing a whole lot of interesting stuff like that to sort of assess people.

01:03:40.320 --> 01:03:45.240
So they've got crazy facial recognition, the stuff going on over there.

01:03:45.240 --> 01:03:50.300
They have a system that will detect your gait and identify you by the way you walk.

01:03:50.300 --> 01:03:51.180
That's, yeah.

01:03:51.180 --> 01:03:55.560
And all of these types of things are generally around this idea of like a social credit.

01:03:55.560 --> 01:04:00.560
Like, can we just like turn cameras and machines on our population and find the good ones in the background?

01:04:00.560 --> 01:04:05.520
It's true.

01:04:05.520 --> 01:04:06.880
I mean, it's a really interesting.

01:04:06.880 --> 01:04:08.260
It sounds crazy, right?

01:04:08.260 --> 01:04:11.660
Like you're laughing, but it's kind of like you're going to laugh or you're going to cry.

01:04:11.660 --> 01:04:12.200
Pick one.

01:04:12.200 --> 01:04:12.980
Well, it's true.

01:04:12.980 --> 01:04:13.340
It's true.

01:04:13.400 --> 01:04:19.240
So, A, I think China is a really interesting counterpoint to the debate that we're having here in the U.S.

01:04:19.240 --> 01:04:25.500
where here our values are very much privacy, individual freedom.

01:04:25.500 --> 01:04:30.000
And whenever technology encroaches on that privacy, we're very suspicious.

01:04:30.720 --> 01:04:37.600
So even the fact that advertisers can target me based on my browsing history, like, you know, makes a lot of people uncomfortable.

01:04:37.600 --> 01:04:39.520
Like maybe that's a violation of my privacy.

01:04:39.520 --> 01:04:46.180
Even though the advertiser doesn't know who I am as an individual, it's possible to make a pretty good guess about who I might be based on my behavior.

01:04:46.180 --> 01:04:49.620
They know you want a Leatherman pocket knife, but they just don't know your name.

01:04:49.620 --> 01:04:50.240
Right, right.

01:04:50.520 --> 01:04:55.740
And they know that people like me also like other products that people who like Leathermans like.

01:04:55.740 --> 01:04:56.100
I don't know.

01:04:56.100 --> 01:04:57.320
We're outside of my comfort zone.

01:04:57.320 --> 01:04:59.040
I'm sorry.

01:04:59.040 --> 01:04:59.980
I'm just making this up.

01:04:59.980 --> 01:05:01.880
But you get the idea, right?

01:05:01.880 --> 01:05:02.180
Yeah, exactly.

01:05:02.180 --> 01:05:05.580
But here, you know, a lot of the conversation is like, is that okay?

01:05:05.580 --> 01:05:07.260
Whereas in China, they've gone the other direction.

01:05:07.260 --> 01:05:12.500
They were like, I mean, it's a way that we can find the best people and maybe maintain social order.

01:05:12.500 --> 01:05:13.280
I mean, I don't know.

01:05:13.280 --> 01:05:13.980
I don't know what they're thinking.

01:05:13.980 --> 01:05:19.720
Just quickly, because not everybody will have seen this, but there's a Netflix TV show called Black Mirror.

01:05:20.140 --> 01:05:28.380
If you are into kind of a dystopian technology future or technology enabled dystopian future or whatever, I highly recommend Black Mirror.

01:05:28.380 --> 01:05:32.060
There is an entire episode dedicated to a social credit system.

01:05:32.060 --> 01:05:34.920
And everybody in the episode is ranked.

01:05:34.920 --> 01:05:40.760
Like every social interaction you have, you can basically like rate the person you had the interaction with.

01:05:40.760 --> 01:05:43.960
So did you get a smile from the barista at your favorite coffee shop?

01:05:43.960 --> 01:05:44.800
Five stars.

01:05:44.800 --> 01:05:45.760
Did you tip?

01:05:45.760 --> 01:05:47.500
You get five stars back from the barista.

01:05:47.500 --> 01:05:49.940
Like, did you take an Uber and the person was friendly?

01:05:50.020 --> 01:05:50.360
Five stars.

01:05:50.360 --> 01:05:52.460
And some of these things we already do.

01:05:52.460 --> 01:05:55.020
Like, we rate our Uber drivers and our Uber drivers rate us.

01:05:55.020 --> 01:06:01.140
But they're usually kept within the little space of Uber or Starbucks or whatever.

01:06:01.140 --> 01:06:02.940
It's not like your global rating.

01:06:02.940 --> 01:06:05.520
Like, you don't get a mortgage based on how you treated your Uber driver.

01:06:05.520 --> 01:06:06.640
Or not.

01:06:06.640 --> 01:06:07.020
Right.

01:06:07.020 --> 01:06:11.480
Well, I mean, you get your babysitting gig depending on what you posted on social media.

01:06:11.960 --> 01:06:14.420
But that's, again, here, like, it's like, oh, we're debating.

01:06:14.420 --> 01:06:24.900
Is it okay to judge potential babysitters based on that rage post they made about the ending of the Dark Mirror episode that they thought was stupid?

01:06:24.900 --> 01:06:25.340
Like, whatever.

01:06:25.740 --> 01:06:27.060
But China's gone the other direction.

01:06:27.060 --> 01:06:40.920
And so they, like, exactly like you were saying, they have all of these ways in which they are scoring and rating people based on the social behaviors that they can observe, either through surveillance or people's behavior online, their social media behavior.

01:06:41.500 --> 01:06:49.520
And so the Shanghai city in China, you may be familiar with it, is going to pool data from several departments.

01:06:49.520 --> 01:06:55.920
And they're going to reward and punish about 22 million citizens based on their actions and reputations by the end of 2020.

01:06:55.920 --> 01:06:57.280
So social.

01:06:57.280 --> 01:06:57.600
Right.

01:06:57.600 --> 01:06:58.440
Like a year and a half.

01:06:58.440 --> 01:06:59.220
Yeah, exactly.

01:06:59.220 --> 01:06:59.640
Like.

01:06:59.640 --> 01:07:01.040
It's not in the future very far.

01:07:01.040 --> 01:07:02.740
We're.

01:07:02.740 --> 01:07:05.520
Yeah, this is not something like far future dystopia.

01:07:05.520 --> 01:07:06.480
This is today, people.

01:07:06.480 --> 01:07:09.540
And so pro-social behaviors are rewarded.

01:07:09.540 --> 01:07:15.320
So if you do volunteer work, if you donate blood, which actually sounds cool, like I'd like to be rewarded for those things.

01:07:15.320 --> 01:07:16.680
This doesn't seem that bad.

01:07:16.680 --> 01:07:21.760
But people who violate traffic laws or charge under the table fees are punished.

01:07:21.760 --> 01:07:22.740
OK.

01:07:22.740 --> 01:07:24.000
If we're getting in.

01:07:24.000 --> 01:07:32.220
I mean, in a way, like we're already socially rewarded for doing good things and we're socially harmed for doing mean things, I guess.

01:07:32.220 --> 01:07:34.360
Or things that aren't cool.

01:07:34.980 --> 01:07:42.940
But and so in a way, it's like China wants to use technology to optimize the social systems that we already have in place.

01:07:42.940 --> 01:07:46.260
Like if I'm rude to you, then I suffer social consequences for that.

01:07:46.260 --> 01:07:47.980
But it's not really captured anywhere.

01:07:47.980 --> 01:07:50.060
I guess like that damage is pretty localized.

01:07:50.060 --> 01:07:52.060
Like it doesn't go on your permanent record.

01:07:52.060 --> 01:07:54.100
Exactly.

01:07:55.100 --> 01:08:00.580
This is everybody's permanent record is now captured digitally and managed by artificial intelligence.

01:08:00.580 --> 01:08:01.720
That's welcome to.

01:08:01.720 --> 01:08:03.240
Welcome to China.

01:08:03.240 --> 01:08:03.780
Yeah.

01:08:03.780 --> 01:08:05.740
So, yeah, it's pretty interesting.

01:08:05.740 --> 01:08:12.240
I mean, on one hand, I can see this benefiting society, but it just seems Black Mirror-esque to me as well.

01:08:12.920 --> 01:08:16.780
And so you might be wondering, like, well, what the heck is a punishment, right?

01:08:16.780 --> 01:08:29.220
So they say in another city, Hangzhou, they rolled out a credit system earlier this year and rewarding pro-social behaviors such as volunteer work or blood donations and punish those who violate traffic laws.

01:08:29.220 --> 01:08:31.380
They charge under the table fees and whatnot.

01:08:32.520 --> 01:08:42.520
And statistically, they said by the end of May, they've been blocking more than 11 million flights and 4 million high-speed train trips from the bad guys.

01:08:42.520 --> 01:08:43.520
Right.

01:08:43.520 --> 01:08:44.080
Right.

01:08:44.080 --> 01:08:47.940
And not bad guys, like, broke a law.

01:08:47.940 --> 01:08:48.320
Not.

01:08:48.320 --> 01:08:52.260
Bad guys, like, doesn't volunteer enough.

01:08:52.260 --> 01:08:55.500
Doesn't seem like, you know, heart's probably not in the right place.

01:08:55.500 --> 01:08:57.560
We're not going to let you take this train trip, buddy.

01:08:57.560 --> 01:09:00.700
Just denied.

01:09:00.700 --> 01:09:01.120
I'm sorry.

01:09:01.120 --> 01:09:02.440
Yeah, you're like, I was wrong.

01:09:02.440 --> 01:09:03.600
You got three and a half.

01:09:03.600 --> 01:09:04.360
Yeah, totally.

01:09:04.360 --> 01:09:05.740
Like, I was going to go visit my family.

01:09:05.740 --> 01:09:10.360
I wasn't going to be able to sit with them on the plane, but at least I was going to take a trip home for the holiday or whatever.

01:09:10.360 --> 01:09:12.520
And that, you know, trip was thwarted.

01:09:12.520 --> 01:09:19.500
Like, actual real-life consequences for things that we've all come to accept as, like, pretty much being a right if we can afford to pay for them.

01:09:19.500 --> 01:09:20.280
Right.

01:09:20.280 --> 01:09:20.860
Yeah.

01:09:20.860 --> 01:09:22.880
Yeah, kind of bizarre.

01:09:22.880 --> 01:09:25.600
Well, it's going to be a very interesting social experiment.

01:09:25.600 --> 01:09:25.840
Yeah.

01:09:25.840 --> 01:09:26.980
I don't want to be part of it.

01:09:26.980 --> 01:09:30.940
I'm glad that I won't be part of it either, although I wonder if everybody will just be, like, super friendly.

01:09:30.940 --> 01:09:33.920
Well, but hey, like, maybe that's the outcome that we're just not seeing.

01:09:33.920 --> 01:09:40.260
It's like, we'll go to Shanghai and be like, man, people are just, this is like a Leave it to Beaver episode.

01:09:40.260 --> 01:09:41.860
Exactly.

01:09:41.860 --> 01:09:44.760
Well, you know, how much of that would be disingenuous?

01:09:44.760 --> 01:09:46.760
You know, like, oh, bless your heart.

01:09:46.760 --> 01:09:47.900
You know, that type of stuff.

01:09:47.900 --> 01:09:48.240
You know?

01:09:48.880 --> 01:09:50.720
Then we just need to optimize the algorithm, man.

01:09:50.720 --> 01:09:51.520
Like, downvotes.

01:09:51.520 --> 01:09:52.140
Exactly.

01:09:52.140 --> 01:09:54.360
Downvotes for disingenuous social pleasantries.

01:09:54.360 --> 01:09:55.640
It's a double downvote.

01:09:55.640 --> 01:09:57.960
You were mean and you were disingenuous about your nitheness.

01:09:57.960 --> 01:09:58.640
Boom.

01:09:58.640 --> 01:09:59.360
All right.

01:09:59.360 --> 01:10:04.680
So, the last one is, I thought it'd be fun to leave people with something practical on this one.

01:10:04.680 --> 01:10:07.500
Have you seen this new data set search from Google?

01:10:07.860 --> 01:10:08.180
Yes.

01:10:08.180 --> 01:10:10.800
I am so, so into this.

01:10:10.800 --> 01:10:16.320
Partially because it's useful and partially because I've worked on a project where we tried to accomplish something similar.

01:10:16.320 --> 01:10:17.860
And it is super freaking hard.

01:10:20.560 --> 01:10:21.640
So, props to Google.

01:10:21.640 --> 01:10:23.960
Of course, if someone's going to get searched right.

01:10:23.960 --> 01:10:24.540
Google.

01:10:24.540 --> 01:10:25.260
Yeah, for sure.

01:10:25.260 --> 01:10:26.420
So, tell people what this is.

01:10:26.420 --> 01:10:27.920
Well, I mean, it's just like it sounds.

01:10:27.920 --> 01:10:29.080
It's a data set search.

01:10:29.080 --> 01:10:35.220
So, sometimes the data is like an actual data set that got captured somewhere as like a CSV.

01:10:35.220 --> 01:10:43.820
And or it's data that was extracted from sort of less structured material, like a table and a web page or something.

01:10:43.820 --> 01:10:46.580
But the way that you can search for it is like by topic.

01:10:46.760 --> 01:10:54.160
And I just want, as somebody who suffered through this very difficult problem, I just want people to understand what that actually would mean.

01:10:54.160 --> 01:10:57.720
Like to know what a data set is about in air quotes.

01:10:57.720 --> 01:11:07.000
And a data set in this case, if you're not like a data person, it's like a spreadsheet with column names at the top and a bunch of rows, for example, in the simplest case.

01:11:07.000 --> 01:11:14.080
And so, maybe you can look at the column headers and perform some machine learning or something and get a sense that the column headers are really well named.

01:11:14.080 --> 01:11:15.180
But they never are.

01:11:15.180 --> 01:11:15.740
Trust me.

01:11:15.740 --> 01:11:16.500
They never are.

01:11:16.500 --> 01:11:25.420
They always have some random name, like the name of like the column in the database table that was named in the 1970s that like this thing ultimately spit out in the first place.

01:11:25.420 --> 01:11:29.120
And there's like random characters in there that have no place in any data set ever.

01:11:29.120 --> 01:11:30.620
And there's a bunch of random missing values.

01:11:30.620 --> 01:11:32.100
And everything is a number.

01:11:32.100 --> 01:11:35.940
And so, you can't actually make any guesses about what's in there because it's just a bunch of random numbers.

01:11:35.940 --> 01:11:39.300
And so, you end up looking at weird things like the structure of the data set.

01:11:39.300 --> 01:11:42.000
And like you go to some strange places, my friends.

01:11:42.000 --> 01:11:53.540
But the fact that Google's figured out how to catalog this type of information and make it accessible to people, this like almost like semantic search for data sets, it really is a real feat of machine learning and engineering.

01:11:53.780 --> 01:11:54.380
Yeah, it's awesome.

01:11:54.380 --> 01:12:02.820
So, you just go to the data search page, which is toolbox.google.com slash data set search, at least for the time being.

01:12:03.040 --> 01:12:06.680
And you just type in a search and it'll give you a list, like here's all the places we found.

01:12:06.680 --> 01:12:09.460
And some of them will be like legitimate just raw data.

01:12:09.460 --> 01:12:13.120
And sometimes it's just embedded tables in a web page and all sorts of stuff.

01:12:13.120 --> 01:12:13.800
It's really well done.

01:12:13.800 --> 01:12:14.040
Yeah.

01:12:14.040 --> 01:12:19.060
It kind of lets you look for data in the way that you would look for any other content on the internet.

01:12:19.060 --> 01:12:21.820
Just the way that you would, well, the way that you would search normal Google.

01:12:21.820 --> 01:12:23.240
Yeah.

01:12:23.240 --> 01:12:28.080
Which is pretty impressive because there's usually not that much context around data sets on the internet.

01:12:28.080 --> 01:12:30.520
But like that kind of semantic real world context.

01:12:30.520 --> 01:12:30.780
So.

01:12:30.780 --> 01:12:31.380
Yeah, it's great.

01:12:31.380 --> 01:12:37.260
So, if people are out there looking for data sets, definitely drop in on Google data set search and throw some stuff in there.

01:12:37.260 --> 01:12:38.140
There's some good answers.

01:12:38.140 --> 01:12:40.740
I also really like the 538 data.

01:12:40.740 --> 01:12:41.680
Have you seen that?

01:12:41.680 --> 01:12:42.420
Oh, yeah, totally.

01:12:42.420 --> 01:12:47.680
So, the website 538, for folks who aren't familiar with it, kind of a data journalism focused website.

01:12:47.680 --> 01:12:49.320
A lot about sports, a lot about politics.

01:12:49.320 --> 01:12:53.480
But most of the work is driven by some type of data analysis.

01:12:53.480 --> 01:12:53.780
Yeah.

01:12:53.780 --> 01:12:59.980
And over at github.com/538, all spelled out, slash data, they have all of the data sets they use to drive their journalism,

01:12:59.980 --> 01:13:02.460
which is like hundreds of different data sets.

01:13:02.460 --> 01:13:05.900
So, I'd love to go and grab data there for various things I'm looking into.

01:13:05.900 --> 01:13:06.640
Yeah, absolutely.

01:13:06.640 --> 01:13:07.480
It's a great resource.

01:13:07.480 --> 01:13:11.440
Especially because then you can go see how they use the data and what questions they asked.

01:13:11.440 --> 01:13:14.860
And then it's kind of an easy way to, well, it's a great way to learn.

01:13:14.860 --> 01:13:23.140
Especially if you're kind of toying around with data for the first time and just getting comfortable with some of the amazing data exploration and modeling tools in Python.

01:13:23.700 --> 01:13:24.420
Best software language.

01:13:24.420 --> 01:13:25.340
Take that.

01:13:25.340 --> 01:13:26.420
Our users.

01:13:26.420 --> 01:13:27.340
That's right.

01:13:27.340 --> 01:13:28.420
Stick it.

01:13:28.420 --> 01:13:33.600
So, if you're looking for data sets, where do you go?

01:13:33.600 --> 01:13:35.380
Now, I go to data set search.

01:13:36.340 --> 01:13:39.780
But there's also a lot of great, it depends on kind of what you're looking for.

01:13:40.280 --> 01:13:46.980
But there's a company based in Austin, actually, called data.world that is trying to do kind of a similar thing to Google's data set search.

01:13:46.980 --> 01:13:48.260
But it's more curated.

01:13:48.260 --> 01:13:49.500
So, it's kind of community-based.

01:13:49.500 --> 01:13:53.160
People are sharing interesting data sets that they found and contextualizing them.

01:13:53.260 --> 01:13:58.920
So, it's a great, a lot of data for democracy volunteers when they're working on projects, they'll upload them to data.world to make the data available.

01:13:58.920 --> 01:14:04.080
But then there's also a lot of kind of open data portals, both at the national level.

01:14:04.080 --> 01:14:06.420
So, there's still data.gov.

01:14:06.420 --> 01:14:07.860
I believe that's still up.

01:14:07.860 --> 01:14:11.480
There was some rumors that it might all come down eventually, but it's still around, I think.

01:14:11.820 --> 01:14:19.080
And a lot of cities that have had open data projects or open data initiatives have collected all of those into a portal that's specific to where you live.

01:14:19.080 --> 01:14:28.620
So, if you want to find out, you know, how many dogs and cats are in your local animal shelter from one month to the next or whatever the analysis is that you're curious about that's super relevant to your city,

01:14:28.620 --> 01:14:32.780
you can probably find one that is for the city closest to where you live.

01:14:32.780 --> 01:14:37.300
I know there's one here in Austin, New York, Chicago, L.A., San Francisco.

01:14:38.060 --> 01:14:44.780
There's a lot of cities have this now, and it's a great way to find data that answer interesting questions and just toy around a little bit and learn more about where you live.

01:14:44.780 --> 01:14:45.480
Yeah, that's cool.

01:14:45.480 --> 01:14:46.940
And data.world's cool as well.

01:14:46.940 --> 01:14:47.540
I haven't seen that.

01:14:47.540 --> 01:14:48.500
All right, Jonathan.

01:14:48.500 --> 01:14:51.020
Well, that's the 10 items for the year in review.

01:14:51.020 --> 01:14:56.200
And I definitely think there's an interesting trend, and it's been super fun to talk to you about it.

01:14:56.200 --> 01:14:56.720
Yeah, of course.

01:14:56.720 --> 01:14:58.000
Thanks so much for having me on the show.

01:14:58.000 --> 01:15:02.640
And I hope everybody had a wonderful 2018 and is looking forward to an exciting year ahead.

01:15:02.640 --> 01:15:03.040
Absolutely.

01:15:03.040 --> 01:15:05.620
So, I have a few quick questions before you get to go, though.

01:15:05.620 --> 01:15:07.920
First, when are you coming back to podcasting?

01:15:08.380 --> 01:15:10.440
Are you just going to make these guest appearances?

01:15:10.440 --> 01:15:13.720
Do you have any plans to come back, or are you going to just keep working on your projects?

01:15:13.720 --> 01:15:19.860
You know, I'm thinking maybe towards the end of 2019, I'll try and find a year in review podcast where I can sit in.

01:15:19.860 --> 01:15:21.680
Well, you can definitely come back in 2019.

01:15:21.680 --> 01:15:22.240
That'll be good.

01:15:22.240 --> 01:15:25.140
I'm just pre-positioning for the next year.

01:15:25.140 --> 01:15:26.380
I love doing it.

01:15:26.380 --> 01:15:27.080
It's really fun.

01:15:27.080 --> 01:15:34.240
And I have no timetable for actually returning, though, because even the thought of committing to the amount of work that it takes to put on an episode.

01:15:34.240 --> 01:15:36.640
Yeah, it's a crazy amount of work per episode.

01:15:36.640 --> 01:15:37.480
That's for sure.

01:15:37.800 --> 01:15:41.320
Well, I'm really glad you came back for, you came out of retirement to do this one.

01:15:41.320 --> 01:15:42.000
Oh, thanks, man.

01:15:42.000 --> 01:15:42.580
I really appreciate it.

01:15:42.580 --> 01:15:44.920
And I'm super happy to come on the show and do it.

01:15:44.920 --> 01:15:49.320
I mean, especially because I won't be responsible for any of the work once this is recorded.

01:15:49.320 --> 01:15:51.080
Exactly.

01:15:51.080 --> 01:15:52.660
That's the way to do it.

01:15:52.660 --> 01:15:53.540
Be a guest.

01:15:53.540 --> 01:15:54.180
All right.

01:15:54.180 --> 01:15:57.320
Last two questions, which I know you answered a couple years ago, but it could have changed.

01:15:57.920 --> 01:16:01.560
So if you write some Python code, data science or otherwise, what editor do you use?

01:16:01.560 --> 01:16:04.300
Well, I use Jupyter Notebooks pretty heavily.

01:16:04.300 --> 01:16:09.240
And actually, more recently than not, I write much less software.

01:16:09.240 --> 01:16:13.080
Like almost everything I write now is some type of like exploratory data analysis.

01:16:13.080 --> 01:16:14.260
So it's almost entirely in Python.

01:16:14.260 --> 01:16:20.840
But when I do code, I still use sublime text a lot, which I feel like is kind of old school.

01:16:21.260 --> 01:16:22.940
I hear that the world has moved on.

01:16:22.940 --> 01:16:26.420
There's like Python specific, like quasi IDEs.

01:16:26.420 --> 01:16:28.540
And I'm out of the game, Michael.

01:16:28.540 --> 01:16:29.220
What can I tell you?

01:16:29.220 --> 01:16:30.280
It's all right.

01:16:30.280 --> 01:16:30.960
No, it's all good.

01:16:30.960 --> 01:16:31.580
It's a great one.

01:16:31.580 --> 01:16:35.240
And then a notable package on PyPI.

01:16:35.240 --> 01:16:36.020
Oh, man.

01:16:36.020 --> 01:16:48.180
Well, I mean, given what we've been talking about, I would strongly encourage everybody to check out a package called Keras, K-E-R-A-S, or TensorFlow,

01:16:48.180 --> 01:16:54.200
which is a both are very popular machine learning libraries or kind of machine learning frameworks.

01:16:54.200 --> 01:16:55.960
Keras is more high level.

01:16:55.960 --> 01:17:00.160
It's kind of it's a very accessible way to get into exploring neural networks, basically.

01:17:00.160 --> 01:17:07.040
So there's lots of kind of popular machine learning techniques that are sort of more traditional and super effective.

01:17:07.040 --> 01:17:08.020
There's nothing wrong with them.

01:17:08.020 --> 01:17:14.680
But the world has kind of moved into these like deep learning and AI is largely based on neural networks.

01:17:15.440 --> 01:17:18.720
And Keras is a great way to explore those without getting too much into the weeds.

01:17:18.720 --> 01:17:24.540
And then once you get into the weeds and you find it kind of interesting and you want to get down lower level and really play around with some of the network structures yourself,

01:17:24.540 --> 01:17:28.660
then you can get into TensorFlow, which is one of the underlying libraries that Keras is both on top of.

01:17:28.660 --> 01:17:31.140
So highly recommend both of those.

01:17:31.140 --> 01:17:31.500
Right on.

01:17:31.500 --> 01:17:33.500
Yeah, I've definitely heard nothing but good things about them.

01:17:33.500 --> 01:17:34.200
All right.

01:17:34.200 --> 01:17:34.920
Final call to action.

01:17:34.920 --> 01:17:38.980
Maybe especially around this whole democracy data pledge and stuff.

01:17:38.980 --> 01:17:40.660
People heard all your stories.

01:17:40.660 --> 01:17:41.220
They're interested.

01:17:41.220 --> 01:17:42.280
What can they do?

01:17:42.280 --> 01:17:42.600
They can.

01:17:42.720 --> 01:17:47.720
If you're like, I don't want to live in a Black Mirror episode, know that you have the power to change it.

01:17:47.720 --> 01:17:49.920
Python programmer and podcast listener.

01:17:49.920 --> 01:17:51.920
The power is in your hands.

01:17:51.920 --> 01:18:01.520
First up, go to datafordemocracy.org and you can sign the ethics pledge and you can let the world know that you believe in ethical technology.

01:18:01.520 --> 01:18:04.080
And together, we'll make our voices heard.

01:18:04.080 --> 01:18:05.920
We'll make sure the practitioners have a voice in this whole thing.

01:18:05.920 --> 01:18:12.240
And so datafordemocracy.org, we really would love to both have you sign the pledge but then also participate in the conversation.

01:18:12.240 --> 01:18:20.020
Because there's a whole community there that's really hashing out these ethical principles, making sure they work for real-world technologists who have real-world jobs.

01:18:20.020 --> 01:18:22.540
So you can contribute to that process.

01:18:22.540 --> 01:18:23.280
It's open source.

01:18:23.280 --> 01:18:24.080
It's happening on GitHub.

01:18:24.080 --> 01:18:25.680
We'd love to have you participate.

01:18:25.680 --> 01:18:26.040
All right.

01:18:26.040 --> 01:18:26.720
It's a great project.

01:18:26.720 --> 01:18:28.500
Hopefully people go and check it out.

01:18:28.500 --> 01:18:30.200
Jonathan, thanks for being on the show.

01:18:30.200 --> 01:18:32.620
It's been great to have you back, if just for an hour this year.

01:18:33.500 --> 01:18:34.020
Thanks, Michael.

01:18:34.020 --> 01:18:35.160
I really appreciate the opportunity.

01:18:35.160 --> 01:18:35.740
It was super fun.

01:18:35.740 --> 01:18:36.100
You bet.

01:18:36.100 --> 01:18:36.380
All right.

01:18:36.380 --> 01:18:36.640
Bye-bye.

01:18:36.640 --> 01:18:40.040
This has been another episode of Talk Python To Me.

01:18:40.040 --> 01:18:42.580
Our guest on this episode was Jonathan Morgan.

01:18:42.580 --> 01:18:46.320
And it's been brought to you by us over at Talk Python Training.

01:18:46.320 --> 01:18:48.440
Want to level up your Python?

01:18:48.440 --> 01:18:53.280
If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

01:18:53.280 --> 01:19:01.440
Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python.

01:19:01.700 --> 01:19:06.100
And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle.

01:19:06.100 --> 01:19:08.000
It's like a subscription that never expires.

01:19:08.000 --> 01:19:10.300
Be sure to subscribe to the show.

01:19:10.300 --> 01:19:12.700
Open your favorite podcatcher and search for Python.

01:19:12.700 --> 01:19:13.920
We should be right at the top.

01:19:13.920 --> 01:19:22.920
You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:19:22.920 --> 01:19:25.020
This is your host, Michael Kennedy.

01:19:25.020 --> 01:19:26.520
Thanks so much for listening.

01:19:26.520 --> 01:19:27.580
I really appreciate it.

01:19:27.580 --> 01:19:29.340
Now get out there and write some Python code.

01:19:29.340 --> 01:19:29.480
I'll see you next time.

01:19:29.480 --> 01:19:32.940
Bye.

01:19:32.940 --> 01:19:34.980
Bye.

01:19:34.980 --> 01:19:36.980
Bye.

01:19:36.980 --> 01:19:38.980
Bye.

01:19:38.980 --> 01:19:40.980
Bye.

01:19:40.980 --> 01:19:42.980
Bye.

01:19:42.980 --> 01:19:44.980
Bye.

01:19:44.980 --> 01:19:46.980
Bye.

01:19:46.980 --> 01:19:47.480
you

01:19:47.480 --> 01:19:47.980
you

01:19:47.980 --> 01:19:49.980
Thank you.

01:19:49.980 --> 01:20:19.960
Thank you.

