WEBVTT

00:00:00.001 --> 00:00:02.840
You've often heard me talk about Python as a superpower.

00:00:02.840 --> 00:00:07.960
It can amplify whatever you're interested in or what you've specialized in for your career.

00:00:07.960 --> 00:00:10.260
This episode is an amazing example of this.

00:00:10.260 --> 00:00:11.740
You'll meet Cornelius van Litt.

00:00:11.740 --> 00:00:17.060
He's a scholar of medieval Islamic philosophy and works at Utrecht University in the Netherlands.

00:00:17.060 --> 00:00:19.260
What he's doing with Python is pretty awesome.

00:00:19.260 --> 00:00:22.480
Even if you aren't interested in digital humanities and that type of research,

00:00:22.480 --> 00:00:27.740
the example set by Cornelius is a blueprint for bringing Python into your world and for those around you.

00:00:27.740 --> 00:00:29.080
I think you'll enjoy this conversation.

00:00:29.400 --> 00:00:34.020
This is Talk Python To Me, episode 230, recorded August 27th, 2019.

00:00:34.020 --> 00:00:53.720
Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

00:00:53.720 --> 00:00:55.660
This is your host, Michael Kennedy.

00:00:55.660 --> 00:00:57.780
Follow me on Twitter where I'm @mkennedy.

00:00:57.980 --> 00:01:01.540
Keep up with the show and listen to past episodes at talkpython.fm.

00:01:01.540 --> 00:01:03.980
And follow the show on Twitter via at Talk Python.

00:01:03.980 --> 00:01:08.940
This episode is brought to you by the podcast Command Line Heroes from Red Hat and Linode.

00:01:08.940 --> 00:01:11.080
Please check out what they're offering during their segments.

00:01:11.080 --> 00:01:12.360
It really helps support the show.

00:01:12.360 --> 00:01:14.040
Hey, folks.

00:01:14.040 --> 00:01:16.420
Before we get to the interview, I have some exciting news.

00:01:16.420 --> 00:01:21.700
We've teamed up with Humble Bundle to launch a great bundle of Python educational goodness.

00:01:22.100 --> 00:01:30.680
For a couple of weeks, you can get three of our courses along with great content from RealPython, PyBytes, and many others for as little as just $1.

00:01:30.680 --> 00:01:36.820
If you've been on the fence about trying one of our courses, here's a chance to get three of them along with a bunch of other great stuff.

00:01:36.820 --> 00:01:41.020
Just visit talkpython.fm/HB2019.

00:01:41.020 --> 00:01:43.140
That's HB2019.

00:01:43.700 --> 00:01:45.740
And be sure to check it out before time runs out.

00:01:45.740 --> 00:01:47.340
Now, let's get to that interview.

00:01:47.340 --> 00:01:49.800
Cornelius, welcome to Talk Python To Me.

00:01:49.800 --> 00:01:50.640
Thanks for having me.

00:01:50.640 --> 00:01:51.800
It's great to have you here.

00:01:51.800 --> 00:01:59.700
I know we're going to have a fun conversation talking about digital humanities, which, honestly, I didn't know a whole lot about before we started talking.

00:01:59.840 --> 00:02:03.620
But it's really a cool intersection of, well, humanities and software.

00:02:03.620 --> 00:02:04.360
That's right.

00:02:04.360 --> 00:02:06.120
And it's only growing and growing.

00:02:06.120 --> 00:02:09.200
And soon it will just consume the entirety of the humanities.

00:02:09.200 --> 00:02:10.360
That's our goal.

00:02:10.360 --> 00:02:13.800
There was this article recently written that Python is eating the world.

00:02:13.800 --> 00:02:15.100
And it may be true.

00:02:15.100 --> 00:02:16.440
We'll see about that as we go.

00:02:16.440 --> 00:02:19.180
Before we get into all that, though, let's start with your story.

00:02:19.180 --> 00:02:20.580
How did you get into programming in Python?

00:02:20.580 --> 00:02:22.440
Well, this goes back many years.

00:02:22.440 --> 00:02:29.720
It was actually, if you will believe it, the last year of my elementary school, I programmed the website of the school.

00:02:29.720 --> 00:02:31.240
This is 1999.

00:02:31.240 --> 00:02:32.800
So I just did it in notebook.

00:02:32.800 --> 00:02:33.440
That's amazing.

00:02:33.440 --> 00:02:33.840
Yep.

00:02:33.840 --> 00:02:35.440
Just HTML tags.

00:02:35.440 --> 00:02:38.960
There weren't really any IDEs, at least not that I know of.

00:02:38.960 --> 00:02:40.660
There was like front page and Dreamweaver.

00:02:40.660 --> 00:02:43.020
But, you know, that was kind of cheating in a way.

00:02:43.020 --> 00:02:43.440
Yeah.

00:02:43.440 --> 00:02:46.980
It was like you write in Word and then like you publish it as a web page.

00:02:46.980 --> 00:02:47.700
That was weird, right?

00:02:47.700 --> 00:02:50.340
Everybody knew that that was not the right way to do it.

00:02:50.340 --> 00:02:50.840
That's right.

00:02:50.840 --> 00:02:51.820
Yes, they did.

00:02:51.820 --> 00:02:53.700
So, wow, that's really young.

00:02:53.700 --> 00:02:56.000
How did you even get that opportunity to do that?

00:02:56.000 --> 00:02:58.300
I don't know exactly how I got this book.

00:02:58.300 --> 00:03:01.140
It was some sort of, you know, HTML for dummies kind of book.

00:03:01.140 --> 00:03:02.900
And that's what I used.

00:03:02.900 --> 00:03:08.400
And I guess I just showed it to one of the teachers and he was like, well, actually, this school doesn't even have a website yet.

00:03:08.400 --> 00:03:09.800
So that's really cool.

00:03:09.800 --> 00:03:19.240
After that, my dad gave me this brick of a book on visual basic thinking that, well, okay, that visual elements to it, you know, but that was way too hard.

00:03:19.240 --> 00:03:24.820
And it's really so great to see that there's so many resources out there now for children to learn programming.

00:03:24.820 --> 00:03:25.860
That's really cool.

00:03:25.860 --> 00:03:26.980
It's really amazing.

00:03:26.980 --> 00:03:32.340
I talk to people who, like, what their experience was who are our age and generally that age.

00:03:32.340 --> 00:03:38.300
And it's like, well, I got a magazine and then I would type in, like, the C code into something.

00:03:38.300 --> 00:03:39.780
And then I would make that run.

00:03:39.780 --> 00:03:40.900
And that's how I learned programming.

00:03:40.900 --> 00:03:43.720
And then I see my daughters and stuff.

00:03:43.720 --> 00:03:51.140
And, you know, there's, like, adventure games where you program your way through dungeons and, you know, robots and all sorts of stuff.

00:03:51.180 --> 00:03:57.960
And I'm like, wow, that is a long way from, you know, typing in, hello world, you know, 10, hello world, 20, go to 10.

00:03:57.960 --> 00:03:59.380
That's right.

00:03:59.380 --> 00:04:00.000
That's right.

00:04:00.000 --> 00:04:03.880
But I guess I did feel very attracted to that kind of thing.

00:04:03.880 --> 00:04:09.380
So, eventually, I learned of Flash, which is dying as we speak.

00:04:09.380 --> 00:04:10.600
It's horrible.

00:04:10.600 --> 00:04:14.560
So, I got in that when it was still developed by Macromedia, Flash 4.

00:04:14.560 --> 00:04:19.600
And this was a really great combination for me to be both a designer and developer.

00:04:19.600 --> 00:04:21.600
So, I did this for my daughters.

00:04:21.600 --> 00:04:22.880
You can do really interesting stuff.

00:04:22.880 --> 00:04:26.120
And, like, Flash is properly the whipping boy these days.

00:04:26.120 --> 00:04:28.300
It gets beat up and is getting ostracized.

00:04:28.300 --> 00:04:29.560
And that's probably a good thing.

00:04:29.740 --> 00:04:33.820
But, you know, the timeframe that you're talking about learning it, it was really powerful.

00:04:33.820 --> 00:04:35.060
And it was really unique.

00:04:35.060 --> 00:04:39.440
And it, you know, we didn't have HTML5 and JavaScript that worked well everywhere.

00:04:39.440 --> 00:04:41.320
And you could do interesting stuff with that.

00:04:41.320 --> 00:04:42.320
A lot of things were done with it.

00:04:42.320 --> 00:04:45.860
I've only kind of recently come back into programming.

00:04:45.860 --> 00:04:50.320
And I was just stunned by what JavaScript had become.

00:04:50.320 --> 00:04:52.980
You know, in 1999, you could do an alert box.

00:04:52.980 --> 00:04:53.880
That was kind of it.

00:04:53.880 --> 00:04:54.520
Yeah, exactly.

00:04:54.520 --> 00:04:56.720
We can validate this text or whatever, yeah.

00:04:56.720 --> 00:04:59.460
It's been such a huge change in that world.

00:04:59.460 --> 00:05:06.320
And that was so recently, like 2016 or so, that's when I got sort of reintroduced to real programming.

00:05:06.320 --> 00:05:09.500
And Python quickly came on my radar.

00:05:09.500 --> 00:05:14.440
And I've been kind of incorporating it into my work as a Swiss army knife.

00:05:14.440 --> 00:05:15.800
I use it for all kinds of things.

00:05:15.800 --> 00:05:20.160
And I also noticed that I reach for it faster and faster.

00:05:20.160 --> 00:05:24.220
More quickly, I think Python has a good solution to my problem.

00:05:24.220 --> 00:05:25.460
Yeah, that's really interesting.

00:05:25.460 --> 00:05:28.840
And when you say you're incorporating it, I know the stuff that you're doing that we're going to talk about.

00:05:28.840 --> 00:05:31.320
It's deep and meaningful stuff.

00:05:31.320 --> 00:05:35.180
It's not like a little quick little automation of like some Excel file or something.

00:05:35.180 --> 00:05:40.260
There's real stuff that I think is really powerful programming that you're doing.

00:05:41.000 --> 00:05:46.520
So maybe that's a good way to segue into what do you do day to day as your main job.

00:05:46.520 --> 00:05:50.320
By trade and profession, I'm a scholar of Islamic studies.

00:05:50.320 --> 00:05:53.060
So that means that I spend most of my time alone.

00:05:53.060 --> 00:05:54.540
I do research.

00:05:54.540 --> 00:05:57.180
I'm a postdoctoral researcher at Utrecht University.

00:05:57.560 --> 00:06:01.440
And I've got my own project funded by the Dutch Research Council.

00:06:01.440 --> 00:06:09.660
And this is really about philosophy from the 12th century, from the Islamic world up until could be the 19th century even.

00:06:10.120 --> 00:06:15.000
And so I read a whole discussion of all kinds of people through these centuries.

00:06:15.000 --> 00:06:17.500
We're all talking in my project specifically.

00:06:17.500 --> 00:06:18.940
I work on the imagination.

00:06:18.940 --> 00:06:24.120
And a lot of these texts are in printed editions, but some of them are in manuscripts.

00:06:24.540 --> 00:06:29.340
So there's a lot of just sitting down with the text and reading them.

00:06:29.340 --> 00:06:31.820
That's what my job is supposed to be.

00:06:31.820 --> 00:06:34.040
But then came Python, basically.

00:06:34.040 --> 00:06:35.000
Yeah, of course.

00:06:35.000 --> 00:06:41.420
Well, I look at these manuscripts and I haven't read old Islamic ones because I don't read any of those languages.

00:06:41.420 --> 00:06:46.180
But I do remember trying to read some of the mathematical ones from like Newton and stuff when I was studying it.

00:06:46.180 --> 00:06:52.840
And these do not seem like writing that would easily be understood by computers.

00:06:53.060 --> 00:06:57.440
It's not like nearly printed text or something like that, right?

00:06:57.440 --> 00:07:01.140
It's handwritten, kind of one-off, unique stuff.

00:07:01.140 --> 00:07:03.320
The calligraphy was really beautiful back then.

00:07:03.320 --> 00:07:04.420
People could write really well.

00:07:04.420 --> 00:07:08.200
But still, it doesn't seem like a first blush.

00:07:08.200 --> 00:07:09.720
A computer should be able to just take it on.

00:07:09.720 --> 00:07:12.680
But working with manuscripts is a real skill.

00:07:12.680 --> 00:07:21.580
And it's a real complicated skill that needs a lot of training for very specific periods, very specific types of hands, of scripts, basically.

00:07:22.320 --> 00:07:28.080
And as far as what you just mentioned about, you called it calligraphy, and rightly so.

00:07:28.080 --> 00:07:29.700
And just think of it.

00:07:29.700 --> 00:07:38.900
Back then, when you wanted to put down knowledge, you just really had to take a pen, take ink, a paper, and then write it down by hand.

00:07:38.900 --> 00:07:45.800
And the fastest way to do that is to not release your hand, not release your pen from the paper.

00:07:45.960 --> 00:07:51.500
So that's why cursive is all connected, because that's simply the fastest way to put this down.

00:07:51.720 --> 00:07:57.960
I'm looking at text that are sometimes spanning like 600 folios of handwritten text.

00:07:57.960 --> 00:08:03.840
It's amazing to think that somebody 500, 600 years ago did this.

00:08:03.840 --> 00:08:05.460
It's really amazing, those old manuscripts.

00:08:05.480 --> 00:08:15.800
So maybe before we get into any of the technical side of things, it might be worth just setting out some of the questions that you try to answer, right?

00:08:15.800 --> 00:08:18.800
As part of your research, you know, put the tech aside for a minute.

00:08:18.800 --> 00:08:22.280
Like, what are some of the things, some of the outcomes you're looking for and stuff?

00:08:22.360 --> 00:08:35.360
In my real research, you're looking for, so the basic, sort of the very overarching thesis that I am approaching is the late medieval, early modern period in Islam, in the Islamic world.

00:08:35.360 --> 00:08:43.920
Is that a period of sort of intellectual decay and sort of darkness, or is there actually much activity going on?

00:08:43.920 --> 00:08:53.980
And the former has been argued for and also sort of made into a political argument and plays in a significant role in geopolitical discussions right now.

00:08:53.980 --> 00:08:57.920
So there's sort of the societal relevance of my research.

00:08:57.920 --> 00:09:08.120
But then to do that, I really go down to the actual evidence that we have, the texts, and where other people kind of discarded, not looked at it.

00:09:08.120 --> 00:09:11.340
I say, okay, let's look at what are these people talking about.

00:09:11.340 --> 00:09:14.220
And I do that mainly through what are called commentary traditions.

00:09:14.220 --> 00:09:25.900
This was a very much-used device in those centuries that you didn't really write a text of yourself, but you took a text from before, and then you copied it, and then you added comments to it.

00:09:25.900 --> 00:09:35.780
So this way, you have a very structural, very secure way of knowing that these two people, even though they're separated by centuries and continents, they're interacting.

00:09:35.780 --> 00:09:37.540
They're talking to each other somehow.

00:09:38.380 --> 00:09:42.460
And right now, I'm looking at a commentary tradition of 140 commentaries.

00:09:42.460 --> 00:09:49.080
So there's a lot of things to sort out, and usually they do not name the person that they're referring to.

00:09:49.080 --> 00:09:59.060
So you have to do a lot of triangulations, basically, to get to understand what is the movement of the discussion over the centuries.

00:09:59.060 --> 00:09:59.620
I see.

00:09:59.620 --> 00:10:08.000
So you almost can study how, like, thought around this common idea has evolved over time, or one thinker influenced the other, something like this.

00:10:08.000 --> 00:10:09.560
That's exactly what I'm after.

00:10:09.560 --> 00:10:16.880
So it's also to sort of fight off essentialism or some sort of idea that they're all talking about the same.

00:10:16.980 --> 00:10:21.500
Now, usually there is a very subtle movement in the discussion.

00:10:21.500 --> 00:10:27.200
And for me right now, this is about the imagination, about what does it mean that we can imagine?

00:10:27.200 --> 00:10:30.720
Are imagining things real or are they not real?

00:10:31.200 --> 00:10:34.760
And this is also then for these people placed in a religious context.

00:10:34.760 --> 00:10:41.020
Does the imagination play a role in prophecies and, for example, mystical experiences, say?

00:10:41.020 --> 00:10:41.380
Okay.

00:10:41.380 --> 00:10:42.880
Yeah, that sounds really interesting.

00:10:42.880 --> 00:10:46.760
And obviously, you need to understand the manuscripts and lots of them.

00:10:46.760 --> 00:10:49.400
I think at this keynote, I don't remember where it was.

00:10:49.400 --> 00:10:51.540
You'll have to maybe let everyone know.

00:10:51.540 --> 00:10:52.360
And we'll link to it.

00:10:52.460 --> 00:10:54.380
It's on YouTube, and it's really interesting.

00:10:54.380 --> 00:11:04.120
You had talked about how there's a common set of papers and manuscripts and stuff that people have studied over and over and are recorded and understood.

00:11:04.120 --> 00:11:08.860
But like 99% of the writings that people do just kind of vanish, right?

00:11:08.860 --> 00:11:09.360
Yeah.

00:11:09.360 --> 00:11:19.200
So in a manuscript world, when somebody thinks of a profound idea and he writes it down, he uses a pen, ink, paper to put it into writing.

00:11:19.980 --> 00:11:23.760
Now you have, you know, the idea is of population and is one.

00:11:23.760 --> 00:11:24.360
Yeah.

00:11:24.360 --> 00:11:34.460
And so you better hope that somebody comes along and says, oh, I'll take the time to copy it and, you know, copy it onto another piece of paper and then sort of distribute it from there.

00:11:34.460 --> 00:11:44.820
So it's very different from the print world where in one swoop you can have, you know, a hundred or a thousand copies of a text and distribute it across a whole continent.

00:11:44.820 --> 00:11:49.740
So this way, copying in the manuscript world is incredibly important.

00:11:49.740 --> 00:11:53.160
So that's why I try to sort of encapsulate my methodology.

00:11:53.160 --> 00:11:58.660
Instead of looking at all writings of one author, let's look at all authors of one text.

00:11:58.660 --> 00:11:59.040
Yeah.

00:11:59.040 --> 00:12:00.160
That's an interesting twist.

00:12:00.160 --> 00:12:10.740
So one of the things I really like about your story and that I think will become apparent is this interesting juxtaposition of the very old and the very new, right?

00:12:10.740 --> 00:12:18.900
Like we're talking about applying artificial intelligence to understand manuscripts from 900 years ago or something like this, right?

00:12:18.900 --> 00:12:26.200
These are both extremely cutting edge and like we don't even really use manuscripts in modern day things and bringing them together.

00:12:26.660 --> 00:12:35.360
And I think another interesting one is that you said that you are a friar of the order of preachers, which I think is pretty interesting.

00:12:35.360 --> 00:12:40.840
And also I think it's just another interesting aspect to what you're doing around technology.

00:12:41.080 --> 00:12:41.560
Well, yes.

00:12:41.560 --> 00:12:44.420
The order of preachers, also known as the Dominicans.

00:12:44.420 --> 00:12:49.100
It's a religious order of the Catholic Church founded in the 13th century.

00:12:49.100 --> 00:13:01.660
And in fact, I stand in a long tradition of Dominicans who have introduced technology and have sort of thought about what technology means when a printing press came along, the Dominicans were there.

00:13:01.660 --> 00:13:07.980
And now also in the digital world, there is a network of friars.

00:13:07.980 --> 00:13:09.180
It's called Optic.

00:13:09.180 --> 00:13:16.540
And they think a lot about the ethical aspects of sort of the digitization of our lives.

00:13:16.540 --> 00:13:17.720
What does that mean to us?

00:13:17.720 --> 00:13:21.540
How can we still be – what does it mean to be a human in a digital world really?

00:13:21.540 --> 00:13:24.860
So kind of my work is sort of close to that.

00:13:24.860 --> 00:13:36.120
But of course, I also have to sort of smile sometimes when I – late at night, I'm coding and then I look down and I notice that I'm wearing my habits.

00:13:36.120 --> 00:13:39.340
Yeah, it's a very interesting juxtaposition.

00:13:39.620 --> 00:13:52.880
And I think there probably is not exactly the same but in a similar sense, a kind of a culture clash when you think about using computer technology and computer programming to do digital humanities.

00:13:53.440 --> 00:14:01.420
So maybe let's try to define that term broadly because I think we've defined it – started to define it for what you're doing.

00:14:01.420 --> 00:14:04.040
But in a broad sense, what is digital humanities?

00:14:04.040 --> 00:14:15.100
Yes, it's a term that refers to the use of computer technology in humanities research where humanities research relies on human artifacts.

00:14:15.100 --> 00:14:22.880
That's basically the sort of the raw material of humanities research like architecture, art, texts, all of that.

00:14:23.240 --> 00:14:31.400
And so usually our laboratory has always been the library and we're just, you know, stack up a whole bunch of books and do our thing.

00:14:31.400 --> 00:14:35.220
And now we're seeing like, okay, how can we use computer technology?

00:14:35.220 --> 00:14:39.780
How can we unleash that computing power that we know is literally at our fingertips?

00:14:39.780 --> 00:14:49.300
And this is especially important because sort of unbeknownst to ourselves, so much of our workflow has already been coming into the digital world.

00:14:49.400 --> 00:14:51.900
Of course, we don't write our books by hand anymore.

00:14:51.900 --> 00:14:53.120
We write them on a computer.

00:14:53.120 --> 00:14:58.060
And most of our journal articles, we take them from online databases.

00:14:58.060 --> 00:15:03.140
But a lot of the issues with it have not been thought out exactly.

00:15:03.140 --> 00:15:11.180
And the people who have sort of gone into this direction usually have gone all the way, so to say.

00:15:11.180 --> 00:15:14.460
They have really made digital humanities into a field of its own.

00:15:14.460 --> 00:15:25.000
And their main purpose now is to really push the technological boundaries to see what kind of new technology could be possibly applied to the humanities.

00:15:25.000 --> 00:15:30.780
But without really coming back with sort of real results for other colleagues in the humanities.

00:15:30.780 --> 00:15:37.380
So that sort of caused a rift or a divide between what you can then call classical humanities and digital.

00:15:37.380 --> 00:15:38.900
Yeah, I can definitely see that.

00:15:38.900 --> 00:15:41.840
You know, you can just keep going down the computer side of things.

00:15:42.080 --> 00:15:53.080
But, you know, here you are trying to answer these very traditional sort of philosophical questions using source material that's, like I said, 900 years old, written by hand.

00:15:53.080 --> 00:15:56.900
And yet you're bringing in some really cool technology to do it.

00:15:56.900 --> 00:16:05.460
Maybe let's just set the stage quick by talking about some of the things you're actually studying, like these scrolls with their flaps and so on.

00:16:05.460 --> 00:16:07.900
And some of the technology that you're applying to it.

00:16:07.900 --> 00:16:08.620
Right, yeah.

00:16:08.620 --> 00:16:18.800
So I think sort of the most interesting aspects of my Python work applied to manuscript studies has been the automated image analysis of Kodishes.

00:16:19.340 --> 00:16:23.220
And this comes because we have digitized so many manuscripts.

00:16:23.220 --> 00:16:30.260
Actually, we didn't mention this before, but this is really a key element to all of this.

00:16:30.260 --> 00:16:38.780
Libraries around the world have digitized thousands, hundreds of thousands of manuscripts, meaning they take photos of the entire manuscript of every page.

00:16:38.780 --> 00:16:39.520
They take a photo.

00:16:39.520 --> 00:16:43.200
And now you have a folder with like 300 images.

00:16:43.340 --> 00:16:48.580
And so I took, instead of one folder, I took like a whole collection of folders.

00:16:48.580 --> 00:16:53.260
I had at my disposal 2,500 manuscripts digitized.

00:16:53.260 --> 00:16:58.300
And so I took only the first image, which is of the cover of the binding, right?

00:16:58.300 --> 00:16:58.620
Right.

00:16:58.620 --> 00:17:01.160
It's like kind of a leather wrapper of it.

00:17:01.160 --> 00:17:02.860
And it has like a foldable flap.

00:17:02.860 --> 00:17:12.020
And it sounded to me like knowing some details about that would tell you about maybe the origin or the timing or something of that nature.

00:17:12.060 --> 00:17:12.700
Exactly.

00:17:12.700 --> 00:17:22.020
So in the Islamic world, these manuscripts always have a flap that goes back onto the front to really encapsulate the entire codex, the entire manuscript.

00:17:22.020 --> 00:17:34.680
And I thought it would be interesting to analyze the shape of that codex and particularly the angle that the flap makes, which is a thing that this is truly then a new question.

00:17:34.680 --> 00:17:37.940
It shows that digital humanities can ask these new questions.

00:17:37.940 --> 00:17:41.760
There's nobody thought of measuring the angle of the flap before.

00:17:41.760 --> 00:17:46.100
And if you can do it for one, then you can just iterate it over.

00:17:46.100 --> 00:17:47.420
It's a four.

00:17:47.420 --> 00:17:49.300
Yeah, if you can do it for one, it's a four loop.

00:17:50.440 --> 00:17:50.840
Exactly.

00:17:50.840 --> 00:17:52.420
That's the beauty of it.

00:17:52.420 --> 00:17:54.740
And yeah, it gives you another data point, right?

00:17:54.740 --> 00:18:05.620
If you think of it as data point, then it gives you another, perhaps a small argument if, say, all manuscripts from the 17th century have this kind of angle.

00:18:06.080 --> 00:18:10.500
And now you have an undated manuscript that has kind of that angle on the flap.

00:18:10.500 --> 00:18:14.460
Well, that's another argument in favor for dating it back to the 17th century.

00:18:15.940 --> 00:18:19.260
This portion of Talk Python To Me is brought to you by Command Line Heroes.

00:18:19.260 --> 00:18:22.580
Season 3 of Command Line Heroes is all about programming languages.

00:18:22.580 --> 00:18:27.860
Episode 7 covers the history of AI, starting with its first language, Lisp.

00:18:27.860 --> 00:18:31.220
It was created to teach machines how to learn like humans.

00:18:31.220 --> 00:18:34.200
And at the start, there was a wave of interest and investment.

00:18:34.200 --> 00:18:38.960
But ultimately, the timing and hardware wasn't quite right for AI to work.

00:18:38.960 --> 00:18:43.060
Decades later, AI is now on the verge of changing everything.

00:18:43.500 --> 00:18:49.800
Get the full story and subscribe wherever you get your podcasts or just visit talkpython.fm/heroes.

00:18:49.800 --> 00:18:57.480
Just to give folks who have not watched your keynote and probably, you know, visually it's a little hard to understand.

00:18:57.480 --> 00:19:05.720
Think of like a folded out sheet of paper where there's like a triangle bit, maybe some kind of fastener on the end of it, and then like a big rectangle.

00:19:06.220 --> 00:19:14.240
And you're using OpenCV and Python and various things to like understand like that last triangle part, right?

00:19:14.240 --> 00:19:14.940
That's right.

00:19:14.940 --> 00:19:25.880
And once I had completed it, basically, once I really had it figured out, I understood that this is a really good example of applying technology to humanities research.

00:19:26.000 --> 00:19:33.720
Because OpenCV and Python are now sort of easy enough to use to learn it as you go, basically.

00:19:33.720 --> 00:19:37.780
And especially, this is especially a good thing about OpenCV.

00:19:37.780 --> 00:19:39.780
It's a very powerful library.

00:19:39.780 --> 00:19:47.620
And that I'm able to use it already now shows actually just how highly developed OpenCV is.

00:19:48.020 --> 00:19:48.280
Right.

00:19:48.280 --> 00:19:53.880
And it seems like the understanding that, like I have a picture and that I want to find this triangle in it.

00:19:53.880 --> 00:19:55.440
That seems like a really hard problem.

00:19:55.440 --> 00:19:59.520
But it sounds like the tools have come along to get really good, huh?

00:19:59.520 --> 00:20:00.000
Yes.

00:20:00.000 --> 00:20:07.800
But at the same time, as like the tools are relatively easy to use, you also see that now you give it a humanities problem.

00:20:07.800 --> 00:20:12.220
And all of a sudden, like you're really kind of pushing the technology to its very least.

00:20:12.220 --> 00:20:24.220
Because OpenCV was developed for, you know, receipts, for automated analysis of receipts of chess games, of video feeds, of, you know, gas stations, like how many people are in the store.

00:20:24.220 --> 00:20:32.560
Those kinds of relatively easy tasks for OpenCV apply that to manuscripts, of photos of manuscripts.

00:20:32.560 --> 00:20:36.800
And all of a sudden has a much harder time coming up with anything meaningful.

00:20:36.800 --> 00:20:45.360
You have to take a lot of in-between steps to reduce it, to reduce the image in a way and shape it in a way that you get out of it what you want.

00:20:45.360 --> 00:20:45.680
Yeah.

00:20:45.680 --> 00:20:48.040
I can definitely imagine that.

00:20:48.040 --> 00:20:51.900
And, you know, people should watch the videos and just see some of those manuscripts.

00:20:51.900 --> 00:21:01.720
Because it, like I said at the beginning, it doesn't seem at first glance that like, oh, I could just take that image and point that at some sort of algorithm and get super meaningful stuff out.

00:21:01.800 --> 00:21:04.940
But we're going to dig into some of the cool things that you did do.

00:21:04.940 --> 00:21:09.480
Before we move on from this kind of introductory stuff, though, I do want to circle back just a little bit.

00:21:09.480 --> 00:21:18.840
So it sounds like in this digital humanities world, there's kind of like this scientific philosophy of trying to understand stuff like sort of in a computational way.

00:21:18.840 --> 00:21:24.280
And then there's the humanities part where, you know, it's sort of more traditional studies.

00:21:24.280 --> 00:21:29.400
And you said there's a couple of possible paths organizations and academics and stuff might take.

00:21:29.480 --> 00:21:37.520
Like one would be to like build up a team of some researchers and some programmers or, you know, maybe you could teach yourself or things like that.

00:21:37.520 --> 00:21:39.060
Do you want to kind of work us through that?

00:21:39.060 --> 00:21:39.640
Yes.

00:21:39.640 --> 00:21:48.440
So this starts from the observation that there's kind of a it's been called the two cultures problem where sort of sciences versus the arts.

00:21:48.440 --> 00:21:54.660
And there's sort of really a paradigm difference between them has this very good sort of summary of it.

00:21:54.660 --> 00:22:00.700
He says in one of his books, he says, gray is all theory, but green is the golden tree of life.

00:22:00.700 --> 00:22:06.860
With OpenCV, this is all too real because the first thing you do with a color image is reduce it to a gray scale.

00:22:06.860 --> 00:22:22.020
So the sciences, people from coming from the sciences, trying to work on humanities problems all too often, they way too quickly think that they can rely on a calculation and then be very sure about the outcome of it.

00:22:22.020 --> 00:22:31.380
And sort of their assurance of their methodology trips them up when it comes to humanities problems, which don't always have a yes or no answer to them.

00:22:31.640 --> 00:22:42.800
But the humanities, then they're really sort of deeply grounded problem is that they just can't really understand what technology could give them.

00:22:42.800 --> 00:22:47.820
So oftentimes they think that things that they think are they should be so easy.

00:22:47.820 --> 00:22:49.740
They're incredibly hard for technology.

00:22:49.740 --> 00:22:57.160
When I talk about digitized manuscripts, obviously they're saying, oh, so like OCR, like you can read the text out of it.

00:22:57.160 --> 00:22:58.020
No, that's not.

00:22:58.020 --> 00:23:01.260
Have you seen this text?

00:23:01.260 --> 00:23:02.940
Exactly.

00:23:02.940 --> 00:23:04.400
No, that's really interesting.

00:23:04.400 --> 00:23:08.480
You know, honestly, I can see that that community would have that perspective.

00:23:08.480 --> 00:23:12.500
But at the same time, a lot of people do who are not actual programmers.

00:23:12.500 --> 00:23:15.540
A lot of things when you're doing programming, like that must be hard.

00:23:15.540 --> 00:23:17.380
You're like, no, actually, that's like three lines of code.

00:23:17.380 --> 00:23:19.720
And they think this other thing is really easy.

00:23:19.720 --> 00:23:22.820
And you're like, no, that's like two more weeks of programming.

00:23:22.820 --> 00:23:23.960
That's really hard.

00:23:23.960 --> 00:23:24.340
That's thing.

00:23:24.340 --> 00:23:26.280
You're just, I know it looks like a small step.

00:23:26.280 --> 00:23:27.720
It is not a small step, right?

00:23:27.720 --> 00:23:29.460
It's just how software is.

00:23:29.860 --> 00:23:30.300
Exactly.

00:23:30.300 --> 00:23:30.700
Yes.

00:23:30.700 --> 00:23:34.480
It's like there's a whole research team working on it for five years now.

00:23:34.480 --> 00:23:35.840
They haven't gotten it yet.

00:23:35.840 --> 00:23:36.360
Exactly.

00:23:36.360 --> 00:23:37.240
You're getting close.

00:23:37.240 --> 00:23:40.680
It sort of brings this huge divide between the two, right?

00:23:40.680 --> 00:23:47.500
And so for most purposes, people try to solve this like, okay, let's work in teams with an engineer and a humanity scholar.

00:23:47.760 --> 00:23:51.720
But then you get this sort of Babylonian problem of that.

00:23:51.720 --> 00:23:54.060
You know, it's very hard for them to talk to each other.

00:23:54.440 --> 00:24:02.260
And so I am much more in favor of learning the tech yourself, which I think is really doable at this particular point in time.

00:24:02.260 --> 00:24:05.040
It's the tech is high level enough.

00:24:05.040 --> 00:24:08.720
And it also gives you very fast return on investment.

00:24:08.720 --> 00:24:11.220
You can really learn on the job, so to say.

00:24:11.220 --> 00:24:17.840
And so I want to really like to see people sort of occupy that middle space in between the humanities and the digital.

00:24:17.840 --> 00:24:18.160
Right.

00:24:18.200 --> 00:24:26.500
Without completely sort of abandoning that side, the humanities side and just going, I'm going to become a programmer, but sort of continue doing your work, but actually gain some of these digital skills.

00:24:26.500 --> 00:24:27.140
That's right.

00:24:27.140 --> 00:24:39.560
And the more I thought about this and the more I wrote about this on my website, digitalorientalist.com, I thought, okay, actually, it's not just one middle position, but it's kind of we should think of it as a spectrum.

00:24:39.560 --> 00:24:42.040
You know, where are you on the DH spectrum?

00:24:42.040 --> 00:24:44.020
That's what I'd like to think about it.

00:24:44.020 --> 00:24:47.580
And I sort of distinguished six archetypes in the spectrum.

00:24:47.880 --> 00:24:48.740
You saw it in the keynote.

00:24:48.740 --> 00:24:49.620
Yeah, yeah, yeah.

00:24:49.620 --> 00:24:57.520
So you started out with the believer who is somebody who knows, like, actually, I've seen what you've done.

00:24:57.520 --> 00:24:58.480
I agree with it.

00:24:58.480 --> 00:24:59.160
This is amazing.

00:24:59.160 --> 00:25:00.700
Let's go down that path, right?

00:25:00.700 --> 00:25:08.680
Those who are really doing programming themselves, who are kind of 100% of their time doing programming as part of their humanities research somehow.

00:25:08.680 --> 00:25:08.940
Yeah.

00:25:08.940 --> 00:25:15.300
And just to the left of them, you have the sour one, which are also programmers, you know, full on.

00:25:15.380 --> 00:25:21.560
But they really, they want to share their newfound, you know, things, their newfound results.

00:25:21.560 --> 00:25:24.460
They want to share it so much with the rest of humanities.

00:25:24.460 --> 00:25:36.800
But a lot of colleagues, you know, they find it very hard to understand, just as we have also found it hard to talk about this example that we discussed about analyzing the Kodishes, the angle of it.

00:25:36.940 --> 00:25:38.760
And so they've become really turned off.

00:25:38.760 --> 00:25:44.600
And I've had these Skype conversations with people where I was reaching out to them, very enthusiastic.

00:25:44.600 --> 00:25:51.820
And I quickly realized that I was asking questions that they heard like a million times and they thought I was up to no good or something.

00:25:51.980 --> 00:25:57.560
And so very quickly, these conversations escalate in a way that's really not necessary.

00:25:57.560 --> 00:25:57.940
Yeah.

00:25:57.940 --> 00:26:00.560
I wonder about the sour one category.

00:26:00.560 --> 00:26:11.000
I feel like if you had tried to solve these problems 10 years ago in the same way that you're solving them now, it might have been much harder and much more fruitless.

00:26:11.260 --> 00:26:19.440
Right. Like before OpenCV and simple Python packages and stuff, you might be like, well, we tried this, but we're not like a tech company.

00:26:19.440 --> 00:26:21.600
We're not going to be able to do this.

00:26:21.600 --> 00:26:22.720
So it's never going to work.

00:26:22.720 --> 00:26:24.760
And they just maybe didn't come back or something.

00:26:24.760 --> 00:26:25.240
I don't know.

00:26:25.240 --> 00:26:25.740
What do you think?

00:26:25.740 --> 00:26:27.740
Oh, yes, there's there's definitely part of that.

00:26:27.740 --> 00:26:37.240
Or they just slaved and labored for years and years and sort of got the same result that I now got in, you know, with like you said, like just a couple of lines of code, you know.

00:26:37.240 --> 00:26:37.960
Yeah.

00:26:37.960 --> 00:26:40.420
Like, yeah, this is how we're supposed to do this.

00:26:40.940 --> 00:26:44.220
I mean, that's just a challenge of things moving so fast, moving on.

00:26:44.220 --> 00:26:44.420
All right.

00:26:44.420 --> 00:26:46.680
So the next one of your archetype is the spider.

00:26:46.680 --> 00:26:47.380
Right.

00:26:47.380 --> 00:26:53.220
And so the spider is usually a full professor who is the leader of a research team.

00:26:53.220 --> 00:26:56.140
They have this magic ability to attract grants.

00:26:56.140 --> 00:26:57.360
It's a very useful skill.

00:26:57.360 --> 00:26:59.300
Very strong currency in academics.

00:26:59.300 --> 00:27:01.020
Like papers and you have grants.

00:27:01.020 --> 00:27:01.640
These are the two.

00:27:01.640 --> 00:27:02.740
Yes.

00:27:02.740 --> 00:27:03.360
Yes.

00:27:03.360 --> 00:27:05.000
Money is hard to get by.

00:27:05.000 --> 00:27:10.620
And, you know, these professors, they wouldn't program themselves, but they have sort of reading knowledge.

00:27:10.620 --> 00:27:13.580
And they understand at least the concepts of it.

00:27:13.580 --> 00:27:22.420
And but they have the ability to collect these people around them, put them in connection and let the people who are able to do it, you know, really shine.

00:27:22.880 --> 00:27:25.200
So I think a lot of good can come from the spider.

00:27:25.200 --> 00:27:25.940
Yeah, absolutely.

00:27:25.940 --> 00:27:26.360
I agree.

00:27:26.360 --> 00:27:29.180
I've seen that in other contexts, although I didn't call them that.

00:27:29.180 --> 00:27:32.520
But yeah, I think that's a very positive place to be in academics.

00:27:32.520 --> 00:27:33.180
Absolutely.

00:27:33.180 --> 00:27:33.740
Yes.

00:27:34.220 --> 00:27:46.720
So then you have like a team structure that actually functions quite well because there's not a big divide between engineers and scholars, researchers of the humanities, but it's integrated into one basically.

00:27:47.100 --> 00:27:52.600
Now, sort of a smaller scale of that you have in what I call the blind and the lame.

00:27:53.340 --> 00:27:56.180
So now you don't have a team, but you just have two persons.

00:27:56.500 --> 00:28:07.940
And the lame one is the professor again, but maybe like an associate professor or somebody who has heard like, oh, the H is all the fashion.

00:28:07.940 --> 00:28:09.640
So I need to do something with this.

00:28:09.700 --> 00:28:11.160
But I have absolutely no idea.

00:28:11.160 --> 00:28:15.260
So I'm going to hire a student or an engineer to do the work for me.

00:28:15.260 --> 00:28:22.740
And this is definitely headed nowhere, really, because it's so hard to talk to each other, to come to a meaningful conclusion.

00:28:22.740 --> 00:28:34.660
And even if you have everything kind of working, then usually the result is like, well, a representative is this to make a claim that the data is trying to make.

00:28:35.120 --> 00:28:47.620
Or the way around, are we now just proving the – if it's corresponding with the erudition of scholars of many generations before, is this just proving the erudition or is the erudition actually proving the software?

00:28:47.620 --> 00:28:53.920
Like is the software now validated because it's in accordance with the opinions of scholars before?

00:28:53.920 --> 00:28:54.500
Yeah, of course.

00:28:54.500 --> 00:29:02.500
So what I think is really the best place to be in the DH spectrum is to reduce that two-person team into one person.

00:29:02.900 --> 00:29:05.020
So I call that the centaur.

00:29:05.020 --> 00:29:11.660
Yeah, that's the person who is on or the creature whose top half is a human and the bottom half is a horse, right?

00:29:11.660 --> 00:29:12.300
That's right.

00:29:12.300 --> 00:29:15.460
So it's got still a human head, a humanistic head.

00:29:15.460 --> 00:29:24.440
But the legs are already digitizing basically, able to move by him or herself in the digital sphere, making use of digital tools.

00:29:24.440 --> 00:29:27.120
I see that with PhD students now.

00:29:27.120 --> 00:29:33.960
This is now kind of the generation who's really investing real time into acquiring these kinds of skills.

00:29:34.220 --> 00:29:37.620
What is your experience with people who are in the humanities?

00:29:37.620 --> 00:29:42.780
So they probably were not super drawn to like computer science type of subjects.

00:29:42.780 --> 00:29:50.920
What is their experience when you say, hey, you know, it really makes sense for you to learn a little bit of programming or to explore this technical side of things?

00:29:50.920 --> 00:29:52.120
Are they excited?

00:29:52.120 --> 00:29:52.880
Are they resistant?

00:29:52.880 --> 00:29:54.300
Are they nervous?

00:29:54.700 --> 00:30:05.180
It really varies because even just putting up numbers in your article or your book, I really have to comment like, what are you trying to make me do here?

00:30:05.180 --> 00:30:05.960
Mathematics?

00:30:05.960 --> 00:30:08.120
Like, what are these numbers for?

00:30:09.380 --> 00:30:10.340
Is this a graph?

00:30:10.340 --> 00:30:10.980
Are you kidding me?

00:30:10.980 --> 00:30:11.540
Is this a graph?

00:30:11.540 --> 00:30:12.580
Did that come from Plotly?

00:30:12.580 --> 00:30:13.580
Yeah.

00:30:13.580 --> 00:30:24.840
It really is challenging because even now, a lot of students in the humanities, they kind of use the computer just as a typewriter in a way.

00:30:24.840 --> 00:30:32.780
And, you know, I'm also really kind of concerned about this sort of the digital literacy of our society in general as our technology is.

00:30:32.780 --> 00:30:34.420
It's all so polished.

00:30:34.420 --> 00:30:35.520
You know what I mean?

00:30:35.520 --> 00:30:37.740
Like Facebook and all these websites.

00:30:37.740 --> 00:30:39.180
All the mobile apps.

00:30:39.180 --> 00:30:41.200
Yeah, they look so good and smooth.

00:30:41.200 --> 00:30:42.300
It just works.

00:30:42.300 --> 00:30:43.240
Yeah, I get that.

00:30:43.240 --> 00:30:45.200
That's a great, you know, marketing statement.

00:30:45.200 --> 00:30:51.540
But it also makes you not really think twice about how it works and why it works that way.

00:30:51.540 --> 00:30:55.140
So you really have to show them immediate benefits.

00:30:55.140 --> 00:30:57.120
And it has to be low-hanging fruit.

00:30:57.120 --> 00:31:02.700
You can't force them to like, okay, well, just learn Python for half a year and then come back.

00:31:02.700 --> 00:31:06.520
It has to be a very, very quick return on investment.

00:31:06.520 --> 00:31:08.880
Yeah, and that's probably why Python is working well.

00:31:08.880 --> 00:31:10.680
In this situation, right?

00:31:10.680 --> 00:31:13.680
Because you can be productive with a partial understanding of Python.

00:31:13.680 --> 00:31:18.740
You never have to know what a function, a class, a generator, a database, any of these things are.

00:31:18.740 --> 00:31:21.780
And you can still have outcomes, right?

00:31:21.780 --> 00:31:22.660
Just top to bottom.

00:31:22.660 --> 00:31:23.220
It does this.

00:31:23.220 --> 00:31:24.600
Here comes the analysis or whatever.

00:31:24.740 --> 00:31:25.740
That's right.

00:31:25.740 --> 00:31:26.140
That's right.

00:31:26.140 --> 00:31:31.100
And to sort of to beef that up, I recommend to people, well, just listen to podcasts like this one.

00:31:31.500 --> 00:31:36.180
And just do that in your spare time when you're exercising or whatnot.

00:31:36.180 --> 00:31:37.760
You don't have to understand it.

00:31:37.760 --> 00:31:38.560
That's not the point.

00:31:38.560 --> 00:31:40.820
Just hear the words they're saying.

00:31:40.820 --> 00:31:45.800
And if words are repeated and it sounds interesting, it's maybe an interesting context.

00:31:45.800 --> 00:31:48.440
Then maybe then look it up and see what it is.

00:31:48.500 --> 00:31:49.400
You know, it's really interesting.

00:31:49.400 --> 00:31:57.800
When I first started this podcast like four and a half years ago, my expectation of who my audience member was, was a dedicated expert.

00:31:58.000 --> 00:32:08.580
Somebody who loves Python and programming enough that not only do they do it as their hobby or their job, but they also do it in their spare time and they listen to the background stories of it, right?

00:32:08.580 --> 00:32:18.620
And what I found out is that many of the listeners, maybe over half, are like beginners who are more trying to do a tech immersion type of experience, right?

00:32:18.620 --> 00:32:21.340
They don't, they'll write me and say, hey, I love your show.

00:32:21.340 --> 00:32:22.040
It's really entertaining.

00:32:22.040 --> 00:32:24.380
A lot of stuff you're talking about doesn't make sense.

00:32:24.680 --> 00:32:28.360
But after a couple of months, a lot more of it is making sense.

00:32:28.360 --> 00:32:29.600
And I know what it means now.

00:32:29.600 --> 00:32:30.900
And it's really wonderful.

00:32:30.900 --> 00:32:32.960
And I just never expected it.

00:32:32.960 --> 00:32:36.480
But it's kind of like a language immersion almost.

00:32:36.480 --> 00:32:39.220
To all those people, you're doing it the right way.

00:32:39.220 --> 00:32:41.040
This is the way to do it.

00:32:41.040 --> 00:32:41.860
Just keep listening.

00:32:41.860 --> 00:32:44.280
Yeah, I totally agree the more that I think about it.

00:32:44.280 --> 00:32:54.060
Yeah, so your pitch or your idea is that this centaur, this person who is both doing the work, but also becoming this programmer, adding a little bit of like superpower.

00:32:54.660 --> 00:32:58.120
To what you're doing, as I like to put it, is sort of the recommended path.

00:32:58.120 --> 00:32:59.160
And that makes a lot of sense.

00:32:59.160 --> 00:33:00.280
I definitely agree with that one.

00:33:00.280 --> 00:33:03.820
Now, let's dig into some of the technology that you're using.

00:33:03.820 --> 00:33:09.200
Because some of the problems you're solving and how you're solving them, because they're really interesting.

00:33:09.200 --> 00:33:13.720
So we talked a little bit about this flap thing and understanding its angle and its orientation.

00:33:14.100 --> 00:33:20.240
But there's also other things that you're looking at that you're going to have to guide me a little here, because I don't know a ton about.

00:33:20.240 --> 00:33:23.300
But there's like these stamps, like these seals, these decorative seals.

00:33:23.720 --> 00:33:31.140
And those are like kind of like a signature or proof of authenticity of various authors and thinkers and stuff.

00:33:31.140 --> 00:33:34.620
And so they're all over these manuscripts in very ornate ways.

00:33:34.620 --> 00:33:41.940
And so you were trying to like discover those and also trying to actually digitize the calligraphy that I talked about.

00:33:42.060 --> 00:33:48.660
So maybe talk us through some of the libraries you're using and just this whole analysis that you're doing, because it's really deep, like I said.

00:33:48.660 --> 00:33:49.580
Yeah, I'd love to.

00:33:49.580 --> 00:33:50.060
I'd love to.

00:33:50.060 --> 00:33:52.400
Like, let's start with the stamps and seals.

00:33:52.400 --> 00:33:58.660
So these are small little imprints that maybe some of your listeners will know what an ex libris is.

00:33:58.660 --> 00:34:06.500
It's basically a sticker or a stamp that you place in your book, and it just says ex libris from the books of, meaning I own this book.

00:34:06.500 --> 00:34:07.360
This book is mine.

00:34:08.140 --> 00:34:17.740
And people in the past did that all the time, and they usually had quite lovely stamps that they placed in their manuscript to say, like, this is my manuscript.

00:34:17.740 --> 00:34:32.740
Now, if you have a manuscript that originated in the 12th century, and it was sold often, and now it ends up in a research library, you might just have a whole history of ownership in the manuscript.

00:34:32.740 --> 00:34:33.980
Isn't that amazing?

00:34:34.160 --> 00:34:39.800
So you can actually trace how the manuscript traveled, not only through time, but through space.

00:34:39.800 --> 00:34:42.960
You can see, like, oh, it was in Tabriz in the 14th century.

00:34:42.960 --> 00:34:44.980
It was in Istanbul in the 16th century.

00:34:44.980 --> 00:34:47.200
It was in Morocco in the 18th century.

00:34:47.200 --> 00:34:49.260
It's like the blockchain of manuscripts.

00:34:49.260 --> 00:34:52.820
I don't know too much about blockchain.

00:34:52.820 --> 00:34:53.900
I'm just kidding.

00:34:53.900 --> 00:34:54.360
Keep going.

00:34:54.920 --> 00:34:59.820
And so I said before that OCR, that's kind of out of the question for now for manuscripts.

00:34:59.820 --> 00:35:04.680
Yes, people are working on handwriting recognition, but it's not at any production level.

00:35:05.400 --> 00:35:11.160
But these stamps, they are kind of ideal for OCR kind of recognition, you know.

00:35:11.160 --> 00:35:14.200
And so, again, OpenCV is so powerful.

00:35:14.200 --> 00:35:27.820
You have to find a very pristine example of the seal, and then, well, you give it to OpenCV, and then you just give a whole bunch of images, and you say to OpenCV, try, find me this image, basically.

00:35:27.920 --> 00:35:34.540
And it will find seals, imprints that have a huge chunk out of them, and it will still detect them very well.

00:35:34.540 --> 00:35:51.420
So now you can imagine if you scale this up, now if you have a couple of thousand of manuscripts that kind of belong to each other, but you don't know exactly how, now you can all of a sudden, with this kind of code, you can reconstruct the collection of an owner from the 17th century.

00:35:51.420 --> 00:35:54.540
You can say all of these manuscripts actually belong to the same owner.

00:35:54.540 --> 00:35:55.900
They all have the same seal.

00:35:55.900 --> 00:36:07.400
Yeah, so you can ask really interesting questions like, show me who has ever owned, show me all the places that this person or this scholar has ever owned, and then was edited by this other person.

00:36:07.400 --> 00:36:18.160
Or something like you can, these questions seem over many thousands of permutations, sound really difficult, but now all of a sudden now it's a, you know, a sub-millisecond database query.

00:36:18.160 --> 00:36:33.580
This is not a reality yet, but this is what I'm working towards, and it will undoubtedly lay bare all kinds of interesting things, because then you can really see what's not spoken of, what's just kind of in the evidence.

00:36:33.580 --> 00:36:35.460
But it doesn't make it sense.

00:36:35.460 --> 00:36:39.020
These invisible connections that nobody has yet actually highlighted, right?

00:36:39.160 --> 00:36:49.740
Like, all of a sudden you find that a particular kind of type of text, a text of logic, they all came through a very religious place of some kind.

00:36:49.740 --> 00:36:55.040
Or you see that, you know, two people were constantly buying and selling from each other.

00:36:55.040 --> 00:36:58.700
Those kinds of things you can now quite easily lay bare.

00:36:58.700 --> 00:37:00.120
That's one thing.

00:37:00.120 --> 00:37:08.540
Another thing is, what if you've taken, so you have 300 photos of a manuscript, meaning 300 photos in which you see a page spread.

00:37:08.540 --> 00:37:12.500
You see two pages, basically the book opens to two pages.

00:37:12.500 --> 00:37:18.800
And then the next photo is one page is flipped over, and so you see, you know, the next two pages, right?

00:37:18.800 --> 00:37:22.640
What if you make it so that you only extract the ink?

00:37:22.640 --> 00:37:26.620
You kind of have OpenCV and NumPy.

00:37:26.620 --> 00:37:28.900
You're going to have to use NumPy for this as well.

00:37:28.900 --> 00:37:30.980
You sort of lock on to the ink.

00:37:30.980 --> 00:37:32.240
You delete everything.

00:37:32.240 --> 00:37:39.080
And now you stack those layers, those very tiny sort of wafer-thin layers of ink.

00:37:39.080 --> 00:37:40.660
You stack them on top of each other.

00:37:40.660 --> 00:37:44.100
So now you've got 300 layers on top of each other.

00:37:44.100 --> 00:37:51.600
And now all of a sudden you see hills and valleys, places where there's always ink and places where there's never ink.

00:37:51.600 --> 00:37:54.200
So it's kind of like an x-ray of a manuscript.

00:37:54.200 --> 00:38:04.600
And so the result is you see where in the manuscript there's ink, but in a way that it represents the entire manuscript, right?

00:38:04.600 --> 00:38:06.600
It's not a representation of one page.

00:38:06.600 --> 00:38:09.220
It's a representation of all pages together.

00:38:09.520 --> 00:38:17.920
And so, for example, very clearly then the text block becomes visible because obviously back then they were very neat writers.

00:38:17.920 --> 00:38:23.340
They always began at the very same place and the lining is very clear.

00:38:23.340 --> 00:38:25.080
Yeah, that's super cool.

00:38:25.080 --> 00:38:39.060
And you even show how in order to get some of the OCR stuff to work better, you might use OpenCV to go and delete some of the extraneous punctuation and characters and things like commas and whatnot, right?

00:38:39.060 --> 00:38:42.320
Yeah, so now we're moving to 19th century prints.

00:38:42.320 --> 00:38:48.420
In the 19th century, the printing press did come along in the Islamic world and they started to print.

00:38:48.420 --> 00:38:49.940
They were very late to it.

00:38:49.940 --> 00:38:54.660
But in the very beginning, these printed productions were quite poor in a way.

00:38:54.660 --> 00:38:58.500
So if you just use OCR, say Tesseract, you get rubbish basically.

00:38:58.500 --> 00:38:59.460
Is that from Google?

00:38:59.460 --> 00:39:00.840
It's now by Google, yes.

00:39:00.840 --> 00:39:03.540
I believe, you know, that's amazing about these things.

00:39:03.660 --> 00:39:05.900
You know, I'm just a user of technology in a way.

00:39:05.900 --> 00:39:07.780
And so I stumble upon a new library.

00:39:07.780 --> 00:39:10.300
And then, you know, you quickly look into it.

00:39:10.300 --> 00:39:12.940
It's like, oh, it's been developing for the last 20 years.

00:39:12.940 --> 00:39:13.340
Yeah, yeah.

00:39:13.340 --> 00:39:14.180
I just didn't know about it.

00:39:14.180 --> 00:39:15.080
Yeah, I haven't really heard of it.

00:39:15.080 --> 00:39:17.980
So I think it originated at HP, if I'm correct.

00:39:17.980 --> 00:39:20.580
So it's one of the leading OCR packages.

00:39:20.580 --> 00:39:21.560
There are other ones.

00:39:21.680 --> 00:39:25.980
But this one works, whereas others, I can't get them to work.

00:39:25.980 --> 00:39:29.300
So I can't invest more time in it to make it work.

00:39:29.300 --> 00:39:29.540
Cool.

00:39:29.540 --> 00:39:33.800
So you're basically going through and you'll like clean, you'll use OpenCV to actually clean

00:39:33.800 --> 00:39:37.800
up and purify the stuff before you send it off to these OCR places.

00:39:37.800 --> 00:39:39.880
So they actually have a chance of success, right?

00:39:39.880 --> 00:39:40.500
Yes.

00:39:40.500 --> 00:39:46.080
And really, the results are amazing, you know, because when you want to just extract the plain

00:39:46.080 --> 00:39:50.940
text, you don't want commas because commas have been inserted by the editors anyway.

00:39:51.060 --> 00:39:51.900
They're probably wrong.

00:39:51.900 --> 00:39:54.800
You don't want page numbers because you can insert them digitally.

00:39:54.800 --> 00:39:56.560
You know, you've got that under control.

00:39:56.560 --> 00:40:00.440
So you remove all of that, you get the plain text and then to direct.

00:40:00.440 --> 00:40:02.040
There's a really wonderful job.

00:40:02.040 --> 00:40:06.420
This portion of Talk Python To Me is brought to you by Linode.

00:40:06.420 --> 00:40:10.160
Are you looking for hosting that's fast, simple, and incredibly affordable?

00:40:10.160 --> 00:40:15.260
Well, look past that bookstore and check out Linode at talkpython.fm/Linode.

00:40:15.260 --> 00:40:17.180
That's L-I-N-O-D-E.

00:40:17.180 --> 00:40:20.520
Plains start at just $5 a month for a dedicated server.

00:40:20.680 --> 00:40:21.580
With a gig of RAM.

00:40:21.580 --> 00:40:23.800
They have 10 data centers across the globe.

00:40:23.800 --> 00:40:27.620
So no matter where you are or where your users are, there's a data center for you.

00:40:27.620 --> 00:40:32.120
Whether you want to run a Python web app, host a private Git server, or just a file server,

00:40:32.120 --> 00:40:38.920
you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24-7 friendly

00:40:38.920 --> 00:40:42.060
support, even on holidays, and a seven-day money-back guarantee.

00:40:42.060 --> 00:40:43.660
Need a little help with your infrastructure?

00:40:43.840 --> 00:40:48.400
They even offer professional services to help you with architecture, migrations, and more.

00:40:48.400 --> 00:40:51.340
Do you want a dedicated server for free for the next four months?

00:40:51.340 --> 00:40:54.400
Just visit talkpython.fm/Linode.

00:40:55.600 --> 00:41:01.320
I don't know that people really appreciate how much work that is unless you see the original

00:41:01.320 --> 00:41:01.740
manuscript.

00:41:01.740 --> 00:41:03.980
Like I said, this is not a printed book, right?

00:41:03.980 --> 00:41:08.440
This is annotated, handwritten stuff.

00:41:08.720 --> 00:41:09.020
Yes.

00:41:09.020 --> 00:41:15.460
I think the best way to approach this is to first find something that you can find.

00:41:15.460 --> 00:41:21.600
Say, like I said, doing sort of the x-ray analysis, now you kind of know where text ought

00:41:21.600 --> 00:41:22.020
to be.

00:41:22.020 --> 00:41:28.320
And so then you can use that knowledge and come back to one image and analyze that one image

00:41:28.320 --> 00:41:31.360
with that knowledge of where the text ought to be.

00:41:31.360 --> 00:41:36.220
And so now if you find ink outside of those text blocks, then you know, oh, there's a

00:41:36.220 --> 00:41:36.940
marginal note.

00:41:36.940 --> 00:41:38.020
This might be interesting.

00:41:38.020 --> 00:41:38.460
Right.

00:41:38.460 --> 00:41:40.200
Maybe that's the most interesting part, right?

00:41:40.200 --> 00:41:41.600
But you want to capture it separately.

00:41:41.600 --> 00:41:44.580
You can't just like try to cram it together or whatever.

00:41:44.580 --> 00:41:44.960
Yeah.

00:41:44.960 --> 00:41:45.680
How interesting.

00:41:45.680 --> 00:41:51.740
It's got to be inspiring to a lot of people to just see all of this cool technology being

00:41:51.740 --> 00:41:56.620
applied to stuff that I suspect it was not initially designed to be applied to.

00:41:56.620 --> 00:42:00.980
And things that when I first look at it, like I don't really know how much technology is

00:42:00.980 --> 00:42:03.420
going to directly answer these questions.

00:42:03.420 --> 00:42:08.740
But this idea of saying like I'm going to create like an ownership chain in say a graph

00:42:08.740 --> 00:42:13.760
database or something like that of these Vanish scripts and use that to explore these hidden

00:42:13.760 --> 00:42:14.000
links.

00:42:14.000 --> 00:42:14.760
That's really interesting.

00:42:14.760 --> 00:42:18.240
I think it's unexpected and very cool.

00:42:18.240 --> 00:42:18.700
Yes.

00:42:18.700 --> 00:42:21.420
But there are a lot of challenges ahead of us.

00:42:21.420 --> 00:42:21.800
Okay.

00:42:21.800 --> 00:42:22.080
Yeah.

00:42:22.080 --> 00:42:22.940
What are some of them?

00:42:22.940 --> 00:42:28.900
To start with the data that we need is not accessible right now in the way that we would

00:42:28.900 --> 00:42:29.340
like it.

00:42:29.340 --> 00:42:35.460
Right now, digitized manuscripts are often just on a website of a university library.

00:42:35.460 --> 00:42:38.820
You can only kind of look at them with your human eyes, right?

00:42:38.820 --> 00:42:41.660
But you can't access them with an API.

00:42:41.660 --> 00:42:45.260
So you can't just sort of programmatically bring them in.

00:42:45.260 --> 00:42:47.260
So this really lost opportunity.

00:42:47.260 --> 00:42:48.800
But, you know, I'm working on it.

00:42:48.800 --> 00:42:50.060
I'm talking a lot with librarians.

00:42:50.060 --> 00:42:52.760
So hopefully we'll see some of those there.

00:42:52.760 --> 00:42:53.860
That's really cool.

00:42:53.860 --> 00:42:57.540
You know, sometimes when there's not an API, there is.

00:42:57.540 --> 00:42:58.580
It's just not intended.

00:42:58.580 --> 00:42:59.160
Right.

00:42:59.160 --> 00:43:03.720
Like there's Selenium or there's Beautiful Soup and requests or right.

00:43:03.720 --> 00:43:08.560
There's ways to go through and actually extract that data, even if it wasn't prepared to be

00:43:08.560 --> 00:43:09.640
presented in that way.

00:43:09.640 --> 00:43:12.100
You know, how much is that possible?

00:43:12.100 --> 00:43:17.840
And how much are there like licensing or usage restrictions that make that kind of research

00:43:17.840 --> 00:43:18.420
impossible?

00:43:18.420 --> 00:43:24.920
Well, the sort of the licensing copyright issue is murky territory at the moment.

00:43:24.920 --> 00:43:27.340
So this is also something that needs to be worked out.

00:43:27.340 --> 00:43:28.260
But you're right.

00:43:28.260 --> 00:43:29.880
I mean, this is, of course, possible.

00:43:29.880 --> 00:43:30.220
Yeah.

00:43:30.220 --> 00:43:33.620
But you don't want to go through and like grab all this data and say, well, you can't use

00:43:33.620 --> 00:43:36.840
any of it because you're going to be in trouble for X, Y and Z.

00:43:36.840 --> 00:43:37.140
Right.

00:43:37.240 --> 00:43:37.480
Yeah.

00:43:37.480 --> 00:43:43.100
In these kinds of fields, the world is very small, you know, so you can't make enemies.

00:43:43.100 --> 00:43:44.720
Yeah, of course.

00:43:44.720 --> 00:43:46.960
You might be able to make proof of concepts, though.

00:43:46.960 --> 00:43:47.480
Right.

00:43:47.480 --> 00:43:52.060
Like maybe as a research, like a young grad student or something, you could say, well, if

00:43:52.060 --> 00:43:54.460
I had this data, what questions could I ask and answer?

00:43:54.460 --> 00:43:58.120
And then maybe do something and present it back to that organization.

00:43:58.120 --> 00:44:02.620
Say like, look, if you would somehow provide us this information, like these are the types of

00:44:02.620 --> 00:44:03.600
things that we can do.

00:44:03.600 --> 00:44:06.880
We spend a week to show you how can we help make that happen.

00:44:06.880 --> 00:44:07.120
Right.

00:44:07.120 --> 00:44:07.720
That's right.

00:44:07.720 --> 00:44:11.700
I feel like I myself, at least I'm building up towards that.

00:44:11.700 --> 00:44:15.100
I've already been saying these kinds of things to libraries.

00:44:15.100 --> 00:44:22.380
And perhaps after this project, I might just be able to do like a full DH project, so to

00:44:22.380 --> 00:44:25.960
say, and really come at this with full force.

00:44:25.960 --> 00:44:29.840
That's the other challenge for us in the humanities.

00:44:29.840 --> 00:44:33.400
Like we can't spend all of our time on programming.

00:44:33.400 --> 00:44:35.780
We can't spend all of our time on these kinds of things.

00:44:35.780 --> 00:44:41.540
So we have to divide our time and always make this difficult decision, like how much of my

00:44:41.540 --> 00:44:44.560
personal development is going to go into this.

00:44:44.560 --> 00:44:51.900
And if I stumble upon the problem that I can't just fix with Google and Stack Overflow, then...

00:44:51.900 --> 00:44:56.020
Was it a dead end that like wasted your tenure track possibilities or something like that,

00:44:56.020 --> 00:44:56.200
right?

00:44:58.160 --> 00:44:59.720
So you don't want to change that.

00:44:59.720 --> 00:45:00.000
Yeah.

00:45:00.000 --> 00:45:01.080
Yeah.

00:45:01.080 --> 00:45:01.320
Yeah.

00:45:01.320 --> 00:45:01.500
Yeah.

00:45:01.500 --> 00:45:01.600
Yeah.

00:45:01.600 --> 00:45:01.700
Yeah.

00:45:01.700 --> 00:45:08.000
So in the harder sciences, we have the Journal of Open Source Software.

00:45:08.880 --> 00:45:09.840
Are you familiar with this?

00:45:09.840 --> 00:45:11.940
I wasn't until you showed this to me.

00:45:11.940 --> 00:45:12.260
Yeah.

00:45:12.260 --> 00:45:16.620
So I had these folks on the show a while ago and it's pretty interesting.

00:45:16.620 --> 00:45:22.160
It's a place to publish and cite the software side of your research.

00:45:22.160 --> 00:45:27.220
And I did say it was more for the hard sciences, but it sounds to me like the stuff that you all

00:45:27.220 --> 00:45:33.060
are doing in your community would also be potentially something you could publish there and then cite as

00:45:33.060 --> 00:45:34.920
some kind of publication and whatnot.

00:45:34.920 --> 00:45:40.620
So maybe that's a little bit of a release valve to get a little bit more credit for the software side of

00:45:40.620 --> 00:45:40.880
things.

00:45:40.880 --> 00:45:41.360
Definitely.

00:45:41.360 --> 00:45:47.320
Because our main challenge is that much of our exactly, like you said, tenure track career wise,

00:45:47.320 --> 00:45:54.100
it kind of still depends on print publications and building software kind of doesn't count,

00:45:54.100 --> 00:45:55.140
it seems, right?

00:45:55.140 --> 00:45:55.860
Yes.

00:45:55.860 --> 00:45:57.320
Very difficult for us.

00:45:57.320 --> 00:45:58.040
Yeah.

00:45:58.460 --> 00:46:03.380
But there have been discussions about this where people in digital humanities have been

00:46:03.380 --> 00:46:06.080
saying, well, actually we're builders, we're not writers.

00:46:06.080 --> 00:46:08.160
We should just come to terms.

00:46:08.160 --> 00:46:08.720
Yeah.

00:46:08.720 --> 00:46:13.840
It's not a unique problem to digital humanities, but it is, I can imagine, a little bit harder

00:46:13.840 --> 00:46:14.100
there.

00:46:14.100 --> 00:46:14.420
Yes.

00:46:14.420 --> 00:46:20.040
Because let me ask you this, like the one very sort of one problem that presents itself is

00:46:20.040 --> 00:46:23.000
that every piece of software is kind of unique.

00:46:23.000 --> 00:46:26.920
So how are you going to compare it with other pieces of software?

00:46:26.920 --> 00:46:30.720
How are you going to say like, oh, this is very good or this is not so good.

00:46:30.720 --> 00:46:35.880
This is good enough for to make tenure or this is bad enough to fire him.

00:46:35.880 --> 00:46:36.220
Yeah.

00:46:36.220 --> 00:46:37.740
I have no idea how to do that.

00:46:37.740 --> 00:46:39.460
That is really tricky.

00:46:39.460 --> 00:46:40.380
I see the problem.

00:46:40.380 --> 00:46:41.500
Yeah.

00:46:41.500 --> 00:46:42.620
Interesting.

00:46:42.620 --> 00:46:47.540
Now, we're getting kind of short on time here, but I do want to give you a chance to talk about

00:46:47.540 --> 00:46:49.960
your two new projects or your book.

00:46:50.060 --> 00:46:55.520
You have a book called Among Digitized Manuscripts and another one, a project that you're thinking

00:46:55.520 --> 00:46:57.240
about called Digital Literacy.

00:46:57.240 --> 00:46:59.480
Do you want to touch on those real quick before we wrap things up?

00:46:59.480 --> 00:46:59.860
Sure.

00:46:59.860 --> 00:47:00.340
Thank you.

00:47:00.340 --> 00:47:06.120
So over the last two years, I've been really focused on digitized manuscripts and how to

00:47:06.120 --> 00:47:07.820
incorporate that into a workflow.

00:47:07.820 --> 00:47:12.980
And out of it came this handbook, which I called Among Digitized Manuscripts.

00:47:12.980 --> 00:47:18.820
You can go to github.com/among and you will find the repository and find some more

00:47:18.820 --> 00:47:19.780
information about it.

00:47:19.780 --> 00:47:26.900
It's basically a conceptual and practical toolkit for those who want to work with manuscripts

00:47:26.900 --> 00:47:30.540
in a digital file format when they're looking at digital photos.

00:47:31.180 --> 00:47:36.280
And so it's a very broad discussion about the concepts of it, about the challenges, but

00:47:36.280 --> 00:47:38.760
also a whole range of tools.

00:47:38.760 --> 00:47:45.360
So we're starting just with how to make vector images, which is a crucial skill to have.

00:47:45.360 --> 00:47:51.020
If you can replicate glyphs from manuscripts in a vector format, you can manipulate them much

00:47:51.020 --> 00:47:52.480
easier on a computer, of course.

00:47:53.060 --> 00:47:57.100
And then we go up until a chapter that really introduces Python.

00:47:57.100 --> 00:48:02.800
And I go through this example of measuring the angle of the flap and really go through

00:48:02.800 --> 00:48:07.120
that entire code and explain it step by step what we're doing here.

00:48:07.120 --> 00:48:12.840
All the while, of course, pointing out to people that if they want to really know more about

00:48:12.840 --> 00:48:19.160
Python and JavaScript, which I also cover, then there are plenty of resources on the internet

00:48:19.160 --> 00:48:21.920
or in other books available to do that.

00:48:21.920 --> 00:48:28.120
So it's also to really to get people who are not yet in, who are not familiar with technology

00:48:28.120 --> 00:48:29.740
to get them up and running.

00:48:29.740 --> 00:48:30.080
Okay.

00:48:30.080 --> 00:48:30.320
Yeah.

00:48:30.320 --> 00:48:31.360
That sounds really interesting.

00:48:31.360 --> 00:48:33.760
And I'm sure it's a pretty unique book.

00:48:33.760 --> 00:48:37.280
There's probably not that many books on digital manuscripts and Python.

00:48:37.280 --> 00:48:40.360
So it sounds like it's going to be a good resource for people.

00:48:40.360 --> 00:48:42.280
I don't want to pat myself on the back here.

00:48:42.280 --> 00:48:44.880
Yes, it's a virgin field.

00:48:44.880 --> 00:48:50.780
So, you know, I would say to others, you know, really need to think through these problems.

00:48:50.780 --> 00:48:53.800
These problems and push things further.

00:48:53.800 --> 00:48:54.400
Yeah, absolutely.

00:48:54.400 --> 00:48:55.940
And then the other one, digital literacy.

00:48:55.940 --> 00:48:56.740
Well, yes.

00:48:56.740 --> 00:49:00.020
This is sort of a concern of mine that I've had for many years.

00:49:00.020 --> 00:49:05.200
And it's sort of part of the reason why I started the digital orientalist.com sort of online

00:49:05.200 --> 00:49:12.480
magazine about how to use computers in your day to day workflow as somebody in Islamic studies

00:49:12.480 --> 00:49:16.860
and in neighboring fields like Synology, Japan studies, Africana, etc.

00:49:18.260 --> 00:49:24.700
I kind of feel that a generation before me and perhaps also my generation, I'm in my early 30s.

00:49:24.700 --> 00:49:30.960
We still grew up with computers as something that you had to kind of make work.

00:49:30.960 --> 00:49:32.540
Otherwise, it wouldn't do anything.

00:49:32.540 --> 00:49:36.120
And people even before that, they didn't grow up with computers.

00:49:36.120 --> 00:49:38.260
They don't know what's going on basically.

00:49:38.720 --> 00:49:44.600
But people younger than us, they're growing up now with all of this, you know, polished technology.

00:49:44.600 --> 00:49:46.520
It's just a black box basically.

00:49:47.400 --> 00:49:57.260
And I'm kind of worried that people don't really get to understand exactly what really the foundational concepts of computers,

00:49:57.260 --> 00:50:00.840
how knowledge is stored on a computer in bits and bytes.

00:50:00.840 --> 00:50:04.360
It might not, you know, it might not come up on a daily basis.

00:50:04.360 --> 00:50:09.280
But knowing this, I think, is very important in handling computers well and understanding

00:50:09.280 --> 00:50:13.860
how computers are also used at scale, of course, by governments and whatnot.

00:50:13.860 --> 00:50:14.460
Yeah.

00:50:14.460 --> 00:50:21.860
Well, I definitely think that the world is absolutely full of consumers of technology.

00:50:21.860 --> 00:50:27.640
But there's definitely room for more producers or creators with technology, right?

00:50:27.640 --> 00:50:28.360
That's right.

00:50:28.360 --> 00:50:33.240
I feel that in the humanities, we ought to be self-sufficient and self-reliant.

00:50:33.240 --> 00:50:41.500
And if we can use some of programming skills ourselves, we can already solve, you know, 90% of our problems.

00:50:41.500 --> 00:50:41.900
Yeah, absolutely.

00:50:41.900 --> 00:50:42.760
Yeah.

00:50:42.880 --> 00:50:44.760
Programming is a superpower for whatever you're doing.

00:50:44.760 --> 00:50:45.320
That's for sure.

00:50:45.320 --> 00:50:46.460
All right.

00:50:46.460 --> 00:50:50.560
Well, I think we're going to leave it there for the main topic because we're out of time.

00:50:50.560 --> 00:50:53.160
But before you get out here, let me ask you the last two questions.

00:50:53.160 --> 00:50:59.300
So if you're going to write some Python code, do some of this cool image analysis, what editor do you use?

00:50:59.300 --> 00:51:01.000
I use PyCharm CE.

00:51:01.000 --> 00:51:01.620
Yes.

00:51:01.620 --> 00:51:02.400
Yeah.

00:51:02.400 --> 00:51:03.600
I just like it.

00:51:04.060 --> 00:51:05.060
I've been using now.

00:51:05.060 --> 00:51:11.200
I've been getting into Jupyter Notebooks, actually, thanks to your podcast with the Azure Notebooks episode.

00:51:12.160 --> 00:51:15.600
Because I'm working on a workshop in the fall for PhD students.

00:51:15.600 --> 00:51:20.380
And I was not planning on installing Python and pip on 30 machines.

00:51:20.380 --> 00:51:23.020
So this is the way to go for that thing.

00:51:23.020 --> 00:51:23.320
Yes.

00:51:23.320 --> 00:51:23.560
Yeah.

00:51:23.560 --> 00:51:27.020
I think it's great for teachers and for running classes and workshops.

00:51:27.020 --> 00:51:29.620
It's just open a browser, go here, you're ready.

00:51:29.620 --> 00:51:30.020
Yeah.

00:51:30.020 --> 00:51:30.500
Yeah.

00:51:30.500 --> 00:51:30.780
Yeah.

00:51:30.780 --> 00:51:34.200
It's not like, well, it won't install this package on my computer.

00:51:34.200 --> 00:51:35.040
Like, oh, no.

00:51:35.040 --> 00:51:35.940
Here we go.

00:51:35.940 --> 00:51:37.740
So, yeah, that's great.

00:51:37.740 --> 00:51:39.000
Excellent.

00:51:39.000 --> 00:51:41.380
And then notable PyPI package.

00:51:41.380 --> 00:51:45.640
It sounds like you've got some interesting experience and exposure to them.

00:51:45.640 --> 00:51:46.120
Yes.

00:51:46.120 --> 00:51:48.220
Well, I mentioned OpenCV, of course.

00:51:48.220 --> 00:51:54.220
Something that you can manipulate PDFs is PyPDF2.

00:51:54.220 --> 00:52:03.180
Something I've been using, in my experience, there's a lack of, or there's really a need for more libraries handling PDFs.

00:52:03.180 --> 00:52:09.660
I don't know how it's in other fields, but in the humanities, PDFs are very important file format.

00:52:09.660 --> 00:52:12.500
A lot of information is contained just in PDFs.

00:52:12.500 --> 00:52:16.340
So I hope we'll see more development in that.

00:52:16.340 --> 00:52:22.900
For image, like drawing and manipulation, you're also going to need something like Pillow, which is a fork, I believe,

00:52:23.080 --> 00:52:24.600
the Python imaging library.

00:52:24.600 --> 00:52:28.760
Beyond that, I'd also like to mention that TextFabric.

00:52:28.760 --> 00:52:31.520
You can pip install Text-Fabric.

00:52:31.520 --> 00:52:36.520
And that's really just for analysis of a text that you have in plain text format.

00:52:36.520 --> 00:52:42.960
I myself work with a developer to get the Quran in the TextFabric format, so to say.

00:52:42.960 --> 00:52:46.760
And you can do real deep, like syntactic or semantic analysis.

00:52:47.480 --> 00:52:55.680
And kind of just outside of sort of the Python ecosystem, I also use something.

00:52:55.680 --> 00:52:56.920
It's called Pandoc.

00:52:56.920 --> 00:53:00.500
And it's incredibly useful for me.

00:53:00.500 --> 00:53:02.180
So it's a command line app.

00:53:02.180 --> 00:53:04.200
You can reinstall Pandoc.

00:53:04.460 --> 00:53:09.340
And you can change from one file format to another.

00:53:09.340 --> 00:53:18.640
And for example, in my case, when you have different kinds of scripts, like English and Arabic mixed in, and you've got footnotes and endnotes and all of that.

00:53:18.840 --> 00:53:20.540
And you've got it in a Word doc.

00:53:20.540 --> 00:53:23.680
And Word is horrible, of course, to do anything with.

00:53:23.680 --> 00:53:26.680
And you want to go to just plain text.

00:53:26.680 --> 00:53:27.240
Yeah.

00:53:27.240 --> 00:53:28.560
It's amazing.

00:53:28.560 --> 00:53:29.240
Yeah.

00:53:29.240 --> 00:53:31.040
I hadn't heard about that, really.

00:53:31.040 --> 00:53:32.020
I haven't used it anyway.

00:53:32.020 --> 00:53:33.160
And it looks super cool.

00:53:33.300 --> 00:53:40.820
Like, it'll convert to and from OpenOffice, Microsoft Doc, RestructureText, Markdown, all kinds of stuff in here.

00:53:40.820 --> 00:53:41.740
That's great.

00:53:41.740 --> 00:53:42.400
Mm-hmm.

00:53:42.400 --> 00:53:43.160
Mm-hmm.

00:53:43.160 --> 00:53:43.400
Yeah.

00:53:43.400 --> 00:53:48.660
And then also on our shared notes, you have this Mac app that I'm pretty excited to try out called GoToShell.

00:53:48.660 --> 00:53:50.340
You know, it's a good example.

00:53:50.340 --> 00:53:51.080
GoToShell.

00:53:51.080 --> 00:53:54.320
So it's macOS only, and it sits in your Finder.

00:53:54.320 --> 00:54:03.120
And if you're in a folder, just a little button at the top, like in the toolbar, you click it, and a terminal opens, and you're immediately in that folder.

00:54:03.120 --> 00:54:03.560
Yeah.

00:54:03.560 --> 00:54:04.020
In the terminal.

00:54:04.020 --> 00:54:21.760
It's these kinds of things that are tremendously important for me as just a scholar who's, yes, I program, but I understand that some others would, you know, would sort of scoff at this and say, like, oh, no, the only right way to do that is to open a terminal and, you know, CD into it.

00:54:21.760 --> 00:54:22.840
No, no, this is cool.

00:54:22.840 --> 00:54:23.940
It's easy for me, you know?

00:54:23.940 --> 00:54:29.580
Yeah, a lot of times you're in a terminal or a Finder, and you're like, I just want to go here, but how do I copy here?

00:54:29.580 --> 00:54:32.060
There are some tricks that are absolutely not obvious.

00:54:32.060 --> 00:54:41.480
Like, you can go to the file of the folder, either one really, but let's say the folder, and you can command C it, and then you can command V that into a terminal window.

00:54:41.480 --> 00:54:43.460
But it's just like push a button.

00:54:43.460 --> 00:54:45.260
It opens the terminal in the right place.

00:54:45.260 --> 00:54:46.200
That's cool.

00:54:46.200 --> 00:54:46.880
I'm going to check this out.

00:54:46.880 --> 00:54:47.760
There you go.

00:54:47.760 --> 00:54:48.020
Yeah.

00:54:48.020 --> 00:54:48.500
All right.

00:54:48.500 --> 00:54:51.260
Well, that's pretty much it for our time.

00:54:51.260 --> 00:54:52.240
Final call to action.

00:54:52.720 --> 00:54:59.580
Maybe speak to the folks out there doing humanities research or maybe not even using the digital side of things yet.

00:54:59.580 --> 00:55:00.680
How do they get started?

00:55:00.680 --> 00:55:01.260
Exactly.

00:55:01.260 --> 00:55:10.420
Well, so if you're listening in the humanities and you want to, you're interested, you want to, you want to go for it, just, you know, keep listening to podcasts like this.

00:55:10.420 --> 00:55:16.920
Pick up my book if it comes out later this year and come in contact with me.

00:55:16.920 --> 00:55:22.540
You know, there's sort of a network at digitalorientalist.com where you can also share your experience.

00:55:22.540 --> 00:55:25.860
I'd be very happy to help you get along.

00:55:26.140 --> 00:55:28.940
And this also sort of kind of crosses both ways.

00:55:28.940 --> 00:55:35.520
If there are Python developers out there who kind of think like, oh, this would be so exciting to work on humanities projects.

00:55:35.520 --> 00:55:42.720
There are all kinds of project ideas that I just have, you know, but I can't do anything with them because of limited time then.

00:55:42.720 --> 00:55:49.160
So connect with me and be very great to, you know, see things go forward.

00:55:49.160 --> 00:55:50.060
Yeah, absolutely.

00:55:50.540 --> 00:55:56.280
Well, Cornelius, I really love this look into what you're doing and I think it's fascinating how all these pieces are coming together.

00:55:56.280 --> 00:55:58.360
So thanks for taking the time and being on the show.

00:55:58.360 --> 00:56:02.020
Thanks to you and the community for making Python what it is.

00:56:02.020 --> 00:56:03.020
That's what I would say.

00:56:03.020 --> 00:56:04.220
Absolutely.

00:56:04.220 --> 00:56:04.540
All right.

00:56:04.540 --> 00:56:05.240
Well, talk to you later.

00:56:05.240 --> 00:56:05.560
Thanks.

00:56:05.560 --> 00:56:05.820
All right.

00:56:05.820 --> 00:56:06.020
Bye.

00:56:06.020 --> 00:56:09.600
This has been another episode of Talk Python To Me.

00:56:09.600 --> 00:56:15.360
Our guest in this episode was Cornelius Van Litt and it's been brought to you by Command Line Heroes and Linode.

00:56:15.360 --> 00:56:19.080
Command Line Heroes is a podcast telling the story of developers.

00:56:19.200 --> 00:56:23.940
This season is all about programming languages and starts off with Python, of course.

00:56:23.940 --> 00:56:27.260
Subscribe at talkpython.fm/heroes.

00:56:27.260 --> 00:56:31.040
Linode is your go-to hosting for whatever you're building with Python.

00:56:31.040 --> 00:56:34.580
Get four months free at talkpython.fm/Linode.

00:56:34.580 --> 00:56:36.460
That's L-I-N-O-D-E.

00:56:36.460 --> 00:56:39.260
Want to level up your Python?

00:56:39.260 --> 00:56:44.100
If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

00:56:44.380 --> 00:56:52.260
Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python.

00:56:52.260 --> 00:56:56.900
And of course, if you're interested in more than one of these, be sure to check out our everything bundle.

00:56:56.900 --> 00:56:58.820
It's like a subscription that never expires.

00:56:59.400 --> 00:57:01.120
Be sure to subscribe to the show.

00:57:01.120 --> 00:57:03.540
Open your favorite podcatcher and search for Python.

00:57:03.540 --> 00:57:04.760
We should be right at the top.

00:57:04.760 --> 00:57:13.740
You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

00:57:14.280 --> 00:57:15.840
This is your host, Michael Kennedy.

00:57:15.840 --> 00:57:17.340
Thanks so much for listening.

00:57:17.340 --> 00:57:18.400
I really appreciate it.

00:57:18.400 --> 00:57:20.160
Now get out there and write some Python code.

00:57:20.160 --> 00:57:20.560
Thank you.

00:57:20.560 --> 00:57:40.820
Thank you.

00:57:40.820 --> 00:58:10.800
Thank you.

