#230: Python in digital humanities research Transcript
00:00 Michael Kennedy: You've often heard me talk about Python as a super power. It can amplify whatever you're interested in or what you've specialized in for your career. This episode is an amazing example of this. You'll meet Cornelius van Lit. He's a scholar of Medieval Islamic philosophy and works at Utrecht University in the Netherlands. What he's doing with Python is pretty awesome. Even if you aren't interested in digital humanities and that type of research, the example set by Cornelius is a blueprint for bringing Python into your world and for those around you. I think you'll enjoy this conversation. This is Talk Python To Me, Episode 230 recorded August 27th, 2019. Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via @talkpython. This episode is brought to you by the podcast Command Line Heroes from Red Hat and Linode. Please check out what they're offering during their segments. It really helps support the show. Hey folks, before we get to the interview, I have some exciting news. We've teamed up with Humble Bundle to launch a great bundle of Python educational goodness. For a couple of weeks, you can get three of our courses, along with great content from Real Python, PyBites and many others for as little as just $1. If you've been on the fence about trying one of our courses, here's a chance to get three of them, along with a bunch of other stuff, just visit talk python.fm/hb2019, that's hb2019, and be sure to check it out before time runs out. Now, let's get to that interview. Cornelis, welcome to Talk Python To Me.
01:49 Cornelius van Lit: Thanks for having me.
01:50 Michael Kennedy: It's great to have you here. I know we're going to have a fun conversation talking about digital humanities, which, honestly, I didn't know a whole lot about before we started talking, but it's really a cool intersection of well humanities and software.
02:03 Cornelius van Lit: That's right, and it's only growing and growing, and soon it will just consume the entirety of the humanities. That's our goal.
02:10 Michael Kennedy: There was this article recently written that Python is eating the world and it may be true. We'll see about that as we go. Before we get into all that though, you know, let's start with your story. How'd you get into programming in Python?
02:20 Cornelius van Lit: Well, this goes back many years. It was actually, if you will believe it, the last year of my elementary school, I programmed the website of the school, this was 1999, so I just did it in Note.
02:32 Michael Kennedy: That's amazing.
02:33 Cornelius van Lit: Yeah, just HTML tags. The weren't really IEs, at least, not that I know of that was like Front Page and Dreamweaver, but you know that was kind of cheating in a way.
02:43 Michael Kennedy: Yeah, it was like you write in Word and then like you publish it as a web page. That was weird, right?
02:48 Cornelius van Lit: Everybody knew that that was not the right way to do it.
02:50 Michael Kennedy: That's right, yes, they did. So wow, that's really young. How did you even get that opportunity to do that?
02:56 Cornelius van Lit: I don't know exactly how I got this book. It was some sort of, you know, HTML For Dummies kind of book and that's what I used. Then I guess I just showed it to one of the teachers and he was like, "Well, actually this school doesn't even have a website yet, so."
03:09 Michael Kennedy: That's really cool.
03:09 Cornelius van Lit: After that, my dad gave me this brick of a book on Visual Basic, thinking that well, okay, it got visual elements to it, you know, but that was way too hard. And it's really so great to see that there are so many resources out there now for children to learn programming. That's really cool.
03:26 Michael Kennedy: It's really amazing. I talk to people like what their experience was, who are our age and and generally that age. And it's like, "Well, I got a magazine, and then I would type in like a C code into something, and then I would make that run. And that's how I learned in programming." And then I see my daughters and stuff and you know, there's like adventure games, where you program your way through Dungeons and you know, robots, and all sorts of stuff. And I'm like, "Wow, that is a long away from, you know, typing in Hello World," you know, 10 Hello World, 20 go to 10
03:58 Cornelius van Lit: That's right. That's right. But I guess I did feel very attracted to that kind of thing. So eventually, I learned of Flash, which is dying as we speak because it's horrible. So I got in that when it was still developed by Macromedia Flash 4 and this was a really great combination for me to be both the designer and developer so I did this for them.
04:21 Michael Kennedy: You can do really interesting stuff and Flash properly the whippen boy these days, it gets beat up and is getting ostracized and that's probably a good thing. But you know, the time frame that you're talking about learning, it was really powerful and it was really unique and you know, we didn't have HTML5 and JavaScript that worked well everywhere and you could do interesting stuff with that. A lot of things were done with it.
04:42 Cornelius van Lit: I've only kind of recently come back into programming and I was just stunned by what JavaScript had the come. You know, in 1999 you could do an alert box, that was kind of it.
04:53 Michael Kennedy: Yeah, exactly. We can validate this text or whatever, yeah.
04:56 Cornelius van Lit: It has been such a huge change in that world. And that was so recently, like 2016 or so. That's when I got sort of reintroduced to real programming and the Python quickly came on my radar and I've been kind of incorporating it into my work as a Swiss army knife. I use it for all kinds of things and I also noticed that I reach for it faster and faster. More quickly, I think Python as a good solution to my problem.
05:24 Michael Kennedy: Yeah, that's really interesting. And when you say you're incorporating, I know the stuff that you're doing that we're going to talk about, it's deep and meaningful stuff. It's not like a little quick little automation of like some Excel file or something like that. There's real stuff that I think is really powerful programming that you're doing. So maybe that's a good way to segue into, well, what do you do day-to-day as your main job?
05:46 Cornelius van Lit: By trade and profession, I'm scholar of Islamic studies, so that means that I spend most of my time alone. I do research, I'm a postdoctoral researcher at Utrecht university, and I've got my own project funded by the Dutch Research Council. And this is really about philosophy from the 12th century, from Islamic world up until it could be the 19th century even. And so I read a whole discussion of all kinds of people through the centuries who were all talking in my project, specifically a work on the imagination. And a lot of these texts are in printed editions, but some of them are in manuscripts. So there's a lot of just sitting down with the text and reading them. That's what my job is supposed to be. But then came Python, basically.
06:34 Michael Kennedy: Yeah, of course. Well, I look at these manuscripts, and I haven't read old Islamic ones because I don't read any of those languages, but I do remember trying to read some of the mathematical ones from like Newton and stuff when I was studying that. And these do not seem like writing that would easily be understood by computers. It's not like nearly printed text or something like that, Right? It's handwritten kind of one off unique stuff. The calligraphy was really beautiful back then. People could write really well, but still it doesn't seem like a first blush, a computer should be able to just take it on.
07:10 Cornelius van Lit: Working with manuscripts is a real skill, and it's a real complicated skill that needs a lot of training for very specific periods, very specific types of hands of scripts basically. And as far as what you just mentioned about, you called it the calligraphy and rightly so, and just think of it back then when you wanted to put down knowledge, you just really had to take a pen, take ink, a paper and the write it down by hand. And the fastest way to do that is to not release your hand, not release your pen from the paper. So that's why cursive is all connected because that's simply fastest way to put this down. I'm looking at text there, There are sometimes spanning like 600 folios of handwritten text. It's amazing to think that somebody 500, 600 years ago did this.
08:03 Michael Kennedy: It's really amazing that old manuscripts. So maybe before we get into any of the technical side of things, it might be worth just sitting out some of the questions that you try to answer right as part of your research, you know, put the tech aside for a minute. Like, what are some of the things, some of the outcomes you're looking for and stuff?
08:22 Cornelius van Lit: In my real research?
08:24 Michael Kennedy: Yes, yeah.
08:25 Cornelius van Lit: So the basic sort of, the very overarching thesis that I am approaching is the late medieval, early modern period in Islam in the Islamic world is that a period of sort of intellectual decay and sort of darkness, or is there actually much activity going on? And the former has been argued for and also sort of made into a political argument and plays in a significant role in view political discussions right now. So they're sort of the societal relevance of my research. But then to do that, I really go down to the actual evidence that we have the texts and the where other people kind of discarded, not looked at it. I said, "Okay, let's look at what are these people talking about." And I do that mainly through what are called commentary traditions. This was a very much used device in the those centuries that you were then really write a text of yourself, but you took a text from before and then you copied it and then you added comments to it. So this way you have a very structural, very secure way of knowing that these two people, even though they're separated by centuries and continents, they're interacting, they're talking to each other somehow. And right now, I'm looking at a commentary tradition of 140 commentaries. So there is a lot of things to sort out and usually they do not name the person that they're referring to. So you have to do a lot of triangulations, basically to get to understand what is the movement of the discussion over the centuries.
09:59 Michael Kennedy: I see. So you almost can study how like thought around this common idea has evolved over time or one thinker influenced the other, something like this.
10:08 Cornelius van Lit: That's exactly what I'm after. So it was also to sort of a bite off essentialism or some sort of a idea that they're all talking about the same. Now usually, there is a very subtle movement in the discussion. And for me right now, this is about imagination, about what does it mean that we can imagine. Are our imaginative things real or are they not real? And this is also then for these people placed in a religious context, does the imagination play a role in prophecies and for example, while mystical experiences say.
10:41 Michael Kennedy: Okay, yeah, that sounds really interesting. And obviously, you need to understand the manuscripts and lots of them. I think at this keynote, I don't remember where it was, you'll have to maybe let everyone know and we'll link to it. It's on YouTube. And it's really interesting you had talked about there's a common set of papers and manuscripts and stuff that people have studied over and over and are recorded and understood, but like 99% of the writings that people do, which is kind of vanish, right?
11:08 Cornelius van Lit: Yeah, so in a manuscript world, when somebody thinks of a profound idea and he writes it down, he uses a pen, ink, paper to put it into writing. Now you have, you know, the idea is of population and as one and so you better hope that somebody comes along and says, "Oh, I'll take the time to copy it," and you know, copy it onto another piece of paper and then sort of distribute it from there. So it's very different from the print world where in one swoop you can have, you know, a 100 or a 1,000 copies of a text then distribute it across a whole continent. So this way, copying in the manuscript world is incredibly important. So that's why I tried to sort of encapsulate my methodology instead of looking at all writings of one author, let's look at all authors of one text.
11:58 Michael Kennedy: Yeah, that's an interesting twist. So one of the things I really like about your story and that I think will become apparent is this interesting juxtaposition of the very old and the very new, right? Like, where we're talking about applying artificial intelligence to understand manuscripts from 900 years ago or something like this, right? These are both extremely cutting edge and like we don't even really use manuscripts in modern day things and bringing them together. And I think another interesting one is at that keynote you'd said that you're a Friar of the Order of Preachers, which I think is pretty interesting. And also I think it's just another interesting, aspect to what you're doing around technology.
12:41 Cornelius van Lit: Well, yes, the Order of Preachers also known as the Dominicans it's a religious order of the Catholic church founded in a 13th century. And in fact, I stand in the long tradition of Dominican's who have introduced technology and have sort of thought about what technology means when a printing press came along, the Dominicans were there. And so and now also in the digital world, there is a network of Friars, it's called Optic. And they think a lot about the ethical aspects of sort the digitalization of our lives. What does that mean to us? How can we still be the... What does it mean to be a human in a digital world, really? So kind of my work is sort of close to that. But of course, I also have to sort of smile sometimes, you know, late at night I'm coding and then, you know, I look down and I notice that I'm wearing my habit.
13:37 Michael Kennedy: Yeah, it's a very interesting juxtaposition and I think there probably is, not exactly the same, but in a similar sense, a kind of a culture clash when you think about using computer technology and computer programming to do digital the humanities. So maybe let's try to define that term broadly because I think we've defined it, started to define it for what you doing, but in a broad sense, what is digital humanities?
14:04 Cornelius van Lit: Yes, it's a term that refers to the use of computer technology in humanities research, where humanities research relies on human artifacts. That's basically the sort of the raw material of humanities research, like architecture, art, texts, all of that. And so usually our laboratory has always been the library and we're just, you know, stack up a whole bunch of books and do our thing. And now we're saying like, "How can we use computer technology? How can we unleash that computing power that we know is literally at our fingertips?" And this is especially important because sort of unbeknownst to ourselves, so much of our workflow has already been coming into the digital world. Of course, we don't write our books by hand anymore. We write them on a computer. And most of our journal articles, we take them from online databases. But one of the issues with it have not been thought out exactly. And the people who have sort of gone into this direction usually have gone all the ways so to say, that they have really made digital humanities into a field of its own. And their main purpose is to really push the technological boundaries to see what kind of new technology could be possibly applied to the humanities but without really the coming back with sort of real results for other colleagues in the humanities. So that sort of caused the rift or a divide in what you can then call classical humanities and digital.
15:37 Michael Kennedy: Yeah, I can definitely see that. You know, you can just keep going down at the computer side of things, but you know, hear you are, trying to these very traditional sort of philosophical questions using source material that's like, I said, 900 years old written by hand, and yet you're bringing in some really cool technology to do it. Maybe let's just set the stage quick by talking about some of the things you're actually setting like these scrolls with their flaps and so on and some of the technology that you're applying to it.
16:08 Cornelius van Lit: Right, yeah. So I think sort of the most interesting aspects of my Python work applied to manuscript studies has been the automated image analysis of Kodishes. So, and this comes because we have digitized so many manuscripts. Actually we didn't mention this before, but this is really a key element to all of this libraries around the world have digitized hundreds of thousands of manuscripts. Meaning they take photos of the entire manuscript of every page. They take a photo and now you have a folder with like 300 images. And so instead of one folder, I took like a whole collection of folders. I had at my disposal 2,500 manuscripts digitized. And so I took only the first image, which is of the cover of the binding, right?
16:58 Michael Kennedy: Right, it's like kind of a leather wrapper of it and it has like a foldable flap and it sounded to me like knowing some details about that would tell you about maybe the origin or the timing or something of that nature.
17:12 Cornelius van Lit: Exactly, so in the Islamic world, these manuscripts always have a flap that goes back onto the front to really encapsulate the entire codex, the entire manuscript, and I thought it would be interesting to analyze the shape of that codex and the particularly the angle that the flap makes, which is a thing that this is truly then a new question this shows that digital humanities can ask these new questions that nobody thought of measuring the angle of the flab before and if you can do it for one, then you can just iterate it over.
17:46 Michael Kennedy: Yeah, if you can do it for one, it's a for loop.
17:49 Cornelius van Lit: Exactly, that's the beauty of it. And yeah, it gives you another data point, right? If you think of it as data point, then it gives you another, perhaps a small argument if say all manuscripts from the 17th century have this kind of angle and now you have an undated manuscript that has kind of that angle on the flap. Well that's another argument in favor for dating it back to the 17th century.
18:16 Michael Kennedy: This portion of the talk by Python To Me is brought to you by Command Line Heroes. Season 3 of Command Line Heroes is all about programming languages. Episode seven covers the history of AI starting with its first language Lisp. It was created to teach machines how to learn like humans. And at the start, there was a wave of interest and investment, but ultimately, the timing and hardware wasn't quite right for AI to work. Decades later, AI is now on the verge of changing everything. Get the full story and subscribe wherever you get your podcasts, or just visit talkpython.fm/heroes. Just to give folks who have not watched your keynote, and probably, you know, visually, it's a little hard to understand. Think of like a folded out sheet of paper where there's like a triangle bit, maybe some kind of fastener on the end of it and then like a big rectangle and you are using Open CV and Python and various things to like understand like that last triangle part, right?
19:14 Cornelius van Lit: That's right. And once I have completed it, basically. Once I really had it figured out, I understood that this is a really good example of applying technology to humanities research because Open CV and Python are now sort of easy enough to use to learn it as you go basically. And especially this is especially good thing about Open CV. It's a very powerful library and I'm able to use it already now shows actually just how highly developed Open CV is.
19:47 Michael Kennedy: Right. And it seems like the understanding that. Like I love a picture, and that I want to find this triangle in it. That seems like a really hard problem but it sounds like the tools have come along to get really good at?
19:59 Cornelius van Lit: Yes, but at the same time as like the tools are relatively easy to use, you also see that now you give it a humanities problem and all of a sudden like you're really kind of pushing the technology to the very limits. Open CV was developed for you know, receipts, for automated analysis of receipts of chess games, of video feeds, of you know, gas stations, like how many people are in the store. Those kinds of relatively easy task for Open CV, apply that to manuscripts or photos of manuscripts and all of a sudden has a much harder time coming up with anything meaningful. You have to take a lot of in-between steps to reduce it, to reduce the image in a way and shape it in a way that you get out of it what you want.
20:45 Michael Kennedy: Yeah, I can definitely imagine that. And you know, people should watch the videos and just see some of those manuscripts cause it, like I said at the beginning, it doesn't seem at first glance at like, "Oh, I could just take that image and point that out some sort of algorithm and get super meaningful stuff out." But we're going to dig into some other cool things that you did do. Before we move on from this kind of introductory stuff, I do want to circle back just a little bit. So it sounds like in this digital humanities world, there's kind of like this scientific philosophy of trying to understand stuff like sort of in a computational way. And then there's the humanities part where, you know, it's the sort of more traditional studies and you said there's a couple possible paths organizations and academics and stuff might take, like one would be to like build up a team of some researchers and some programmers or you know, maybe you could teach yourself or things like that. Do you want to kind of work us through that?
21:39 Cornelius van Lit: Yeah, so this starts from the observation that there's kind of a... It's been called the two cultures for a problem where sort of sciences versus the arts. And they're sort of really a paradigm difference between them has this very good sort of summary of it. In one of his books he says, "Gray is all theory, but green is the golden tree of life." With Open CV, this is all too real because the first thing you do with a color image is reduce it to a gray scale. So people coming from coming from the sciences trying to work on humanities problems, all too often they way too quickly think that they can rely on a calculation and then be very sure about the outcome of it and sort of their assurance of their methodology trips them up when it comes to humanities problems, which don't always have a yes or no answer to them, but the humanities then they're really sort of deeply grounded problem is that they just can't really understand what technology could give them. So oftentimes, they think that things that they think are they should be so easy, they're incredibly hard for technology. When I talk about digitize manuscripts, obviously they're saying, "Oh, so like OCR, like you can read the text out of it?" No, it's not.
23:00 Michael Kennedy: Have you seen this text? Exactly. You know that's real interesting. You know, honestly I can see that that community would have that perspective, but at the same time, a lot of people do who are not actual programmers. A lot of things when you're doing programming like that must be really hard. You're like, no actually that's like three lines of code and they think this other thing is really easy. And you're like, "No. That's like two more weeks of programming. That's really hard. I know it looks like a small step. It is not a small step," right? It's just how the software is.
23:29 Cornelius van Lit: Exactly, yes. There's like, there's a whole research team working on it for five years now. They haven't gotten it yet, so.
23:36 Michael Kennedy: Exactly. You're getting close.
23:37 Cornelius van Lit: It sort of brings this huge divide between the two, right? And so for most purposes, people tried to solve this like, "Okay, let's work in teams with an engineer and the humanities scholar." But then you get this sort of Babylonian problem of that, you know, it's very hard for them to talk to each other. And so I am much more in favor of learning the tech yourself, which I think is really doable at this particular point in time. The tech is high level enough and it also gives you very fast return on investment. You can really learn on the job so to say. And so I want to really like to see people sort of occupied that middle space in between the humanities and the digital
24:18 Michael Kennedy: Right, without completely sort of abandoning that humanity side and just going, "I'm going to become a programmer, but sort of continue doing your work but actually gained some of these digital skills."
24:26 Cornelius van Lit: That's right. And the more I thought about this and the more I wrote about this on my website, digital orientalist.com I thought, okay, actually it's not just one middle position, but it's kind of, we should think of it as a spectrum, you know, where are you on the DH spectrum? That's what I like to think about it. And I sort of distinguished six arc types in the spectrum. You saw it in the keynote.
24:48 Michael Kennedy: Yeah, yeah, yeah. So you started out with the believer who is somebody who knows... Actually I've seen what you've done. I agree with it. This is amazing. Let's go down that path, right.
25:00 Cornelius van Lit: Those who are really doing programming themselves, who are kind of 100% of their time doing programming as part of their humanities research somehow. And just to the left at them, you have the sour ones, which are also programmers, you know, full on, but really, they want to share their new found, you know, things their new found results. They want to share it so much with the rest of humanities. But a lot of colleagues, you know, they find it very hard to understand just as we have also found that hard to talk about this example that we discussed about analyzing the Kodish's the angle of it. And so they'd be turned off and I've had these Skype conversations with people where I was reaching out to them, very enthusiastic and I quickly realized that I was asking questions that they heard like a million times, and they thought I was up to no good or something. And so very quickly these conversations escalate in a way that's really not necessary.
25:57 Michael Kennedy: Yeah, I wonder about the sour one category. I feel like if you had tried to solve these problems 10 years ago in the same way that you're solving them now, it might have been much harder and much more fruitless, right? Like, before Open CV and simple Python packages and stuff, you might be like, "Well, we tried this, but we're not like a tech company. We're not going to be able to do this, so it's never going to work." And they just maybe didn't come back or something. I don't know. What do you think?
26:26 Cornelius van Lit: Oh yes, there's definitely part of that. Or they just slaved and labored for years and years and sort of get the same results that I now got. And you know, just a couple of lines of code, you know?
26:38 Michael Kennedy: I mean that's just a challenge of things moving so fast. Moving on. Alright, so the next one, your archetype is the spider.
26:46 Cornelius van Lit: Right, and so the spider is usually a full professor who is the leader of research team. They have this magic ability to attract grants. It is a very useful skill.
27:00 Michael Kennedy: Very strong currency and academics. You have typers and you have grants. These are the two.
27:02 Cornelius van Lit: Yes, yes. Money is hard to get by. And you know, these professors, they wouldn't program themselves but they have sort of reading knowledge of it and they understand that at least the concepts of it, but they have the ability to collect these people around them, put them in connection and let the people who are able to do it, you know, really shine. So I think a lot of good can come from the spider.
27:25 Michael Kennedy: Yeah, absolutely. I agree. I've seen that in other contexts, although I didn't call them that. But yeah, I think that's a very positive place to be in academics.
27:32 Cornelius van Lit: Absolutely, Yes. So then you have it like a team structure that actually functions quite well because there's not a big divide between engineers and scholars, researchers of the humanities, but it's integrated into one basically. Now sort of a smaller scale of that, you have in what I call the blind and the lame. So now you don't have a team that you just have two persons and a lame one is the professor again. But maybe like an associate professor. So somebody who was heard like, "Oh, the H is all the fashion, so I need to do something with this, but I have absolutely no idea. So I'm going to hire a student or an engineer to do the work for me." And this is definitely headed nowhere, really because it's so hard to talk to each other, to come to a meaningful conclusion. And even if you have everything kind of working, then usually the result is like, "Well, our representative is this," to make a claim that the data is trying to make or the other way around. Are we now disproving with the... If it corresponding erudition of scholars of you know, many generations before. Is this just proving the erudition or is the erudition actually proving the software? Like is the suffering now validated because it's in a court at the opinions of scholars before?
28:53 Michael Kennedy: Yeah, of course.
28:54 Cornelius van Lit: So what I think is really the best place to be in the DH spectrum is to reduce that two person team into one person. And so I call that a the centaur.
29:05 Michael Kennedy: Yeah, that's the, the person who is on or the creature who's top half is a human in the bottom half is the horse, right?
29:11 Cornelius van Lit: That's right. So it's got still a human head, a humanistic head, but the legs are already digitizing, basically, able to move by him or herself in the digital sphere, making use of digital tools. I see that with PhD students now, or this is now kind of the generation who's really investing real time into acquiring these kinds of skills.
29:34 Michael Kennedy: What is your experience with people who are in the humanities? So they probably were not super drawn to like computer science type of subjects? What is their experience when you say, "Hey, you know, it really makes sense for you to learn a little bit of programming or to explore this technical side of things." Are they excited? Are they resistant? Are they nervous?
29:54 Cornelius van Lit: It really varies because even just putting up numbers in your article or you book I really have to comment like, what are you trying to make me do here, mathematics? What are numbers for?
30:09 Michael Kennedy: Is this a graph? Are you kidding me? Is this a graph? Did that come from plotly?
30:13 Cornelius van Lit: Yeah, it really is challenging because even now a lot of students in the humanities they kind of use the computer just as a typewriter, in a way. And you know, I'm also really kind of concerned about this, sort of the digital literacy of our society in general, as our technology is all so polished, you know what I mean? Like Facebook and all these websites they...
30:38 Michael Kennedy: All the mobile apps and stuff. Yeah, they look so good and smooth.
30:41 Cornelius van Lit: It just works. Yeah, I get that. That's a great, you know, marketing statement, but it also makes you not really think twice about how it works and why it works that way. So you really have to show them immediate benefits and it has to be low hanging fruit. You can't force them to like, "Okay, well, just a learn Python for half a year and then come back." It has to be very, very quick return on investment.
31:06 Michael Kennedy: And that's probably why Python is working well in this situation, right? Because you can be productive with a partial understanding of Python. You never have to know what a function, a class, a generator, a database, any of these things are, and you can still have outcomes, right? Just top to bottom. It does this, here comes the analysis or whatever.
31:24 Cornelius van Lit: That's right. That's right. And to sort of, to be fed up, I recommend to people, well, just listen to podcasts like this one and just do that in your spare time when you're exercising or whatnot. You don't have to understand it. That's not the point. Just hear the words they're saying. And if words are repeated and it sounds interesting, it's maybe an interesting context, then maybe then look it up and see what it is.
31:48 Michael Kennedy: You know, it's really interesting. When I first started this podcast like four and a half years ago, my expectation of who my audience member was, was a dedicated expert. Somebody who loves Python and programming enough that not only do they do it as their hobby or their job, but they also do it in their spare time and they listen to the background stories of it, right. And what I found out is that many of the listeners, maybe over half are like beginners who are more trying to do a tech immersion, type of experience, right. They'll write me and say, "Hey, I love your show. It's really entertaining. A lot of stuff you're talking about doesn't make sense. But after a couple of months, a lot more of it is making sense. And I know what it means now and it's really wonderful and I just never expected it, but it's kind of like a language immersion almost."
32:36 Cornelius van Lit: To all those people, you're doing it the right way. This is the way to do it, just keep listening.
32:41 Michael Kennedy: Yeah, I totally agree. The more that I think about it. Yeah, so your hitch or your idea is that this centaur, this person who is both doing the work but also becoming this programmer, adding a little bit of like superpower to what you're doing, as I like to put it, is sort of the recommended path. That makes a lot of sense. I definitely agree with that one. Now let's dig into some of the technology that you're using because some of the problems you're solving and how you're solving them because they're really interesting. So we talked a little bit about this flap thing and understanding its angle and his orientation, but there's also other things that you're looking at that you're going to have to guide me a little here because I don't know a ton about, but there's like these stamps, like these seals, these decorative seals and those are like kind of like a signature or a proof of authenticity of various authors and thinkers and stuff. And so they're all over these manuscripts in very ornate ways, ans so you were trying to like discover those and also trying to actually digitize the calligraphy that I talked about. So maybe talk us through some of the libraries you're using and just this whole analysis that you're doing. Because it's really deep, like I said.
33:48 Cornelius van Lit: Yeah, I'd love to. I'd love to, like let's start with the stamps and seals. So these are small little imprints that maybe some of your listeners will know. What an Ex Libris is, it's basically a sticker or a stamp that you place in your book and it just says Ex Libris from the books of, meaning, I own this book, this book is mine. And the people in the past did that all the time and they usually have quite lovely stamps that they placed in their manuscript to say like, "This is my manuscript." Now, you have a manuscript that originated in the 12th century and it was sold often and now it ends up in a research library, you might just have a whole history of ownership in the manuscript. Isn't that amazing? So you can actually trace how the manuscript traveled, not only through time but through space. You can see like, "Oh, it was in Tabriz in the 14th century, It was in Istanbul in the 16th century, it was in Morocco in the 18th century."
34:47 Michael Kennedy: It's like the blockchain of manuscripts.
34:49 Cornelius van Lit: I don't know too much about blockchains, so.
34:54 Michael Kennedy: I was just kidding. Keep going.
34:55 Cornelius van Lit: And so I said before that OCR, that's kind of out of the question for now for manuscripts. Yes, people working on handwriting recognition, but it's not at any production level. But these stamps, they are kind of ideal for OCR kind of recognition, you know? And so again, OpenCV is so powerful, you have to find a very pristine example of the seal and then, well, you give it to OpenCV and then you just give a whole bunch of images and you say to OpenCV, find me this image basically. And it will find seals that imprints that have a huge chunk out of them and it will still detect them very well. So now you can imagine if you scale this up, now, if you have a couple of thousand of manuscripts that kind of belonged to each other, but you don't know exactly how, now you can all of a sudden with this kind of code, you can reconstruct the collection of an owner from the 17th century. You can say, all of these manuscripts actually belonged to the same owner. They all have the same seal.
35:56 Michael Kennedy: Yeah, so you can ask really interesting, like show me who has ever owned or show me all the places that this person or this scholar has ever owned and then was edited by this other person or something like you can...
36:08 Cornelius van Lit: Exactly
36:09 Michael Kennedy: These questions seem over many thousands of permutations, sound really difficult, but now all of a sudden, now it's a, you know, sub-millisecond database query.
36:17 Cornelius van Lit: This is not a reality yet, but this is what I'm working towards and it will undoubtedly lay bare all kinds of interesting things because then you can really see what's not spoken off, what's just kind of in the evidence. But it doesn't make it.
36:35 Michael Kennedy: These invisible connections that nobody has yet actually highlighted, right?
36:38 Cornelius van Lit: Exactly, like all of a sudden you find that a particular kind of type of texts, say text of logic, they all came through a very religious place, of some kind or you see that, you know, two people were constantly buying and selling from each other. Those kinds of things. You can now quite easily lay bare. That's one thing. Another thing is what if you taken... So you have 300 photos of a manuscript, meaning 300 photos in which you see a page break, you see two pages. Basically the book opens to two pages and then the next photo is one pages flipped over and so you see, you know the next two pages, right? What if you make it so that you only extract the ink, you kind of have OpenCV and NumPy. You're going to have to use NumPy for this as well. You sort of lock onto the ink, you delete everything and now you stack those layers, those very tiny sort of wafer thin layers of ink. You stack them on top of each other. So now you've got 300 layers on top of each other and now all of a sudden you see hills and valleys, places where there's always ink and places where there's never ink. So it's kind of like an x-ray of a manuscript. And so the result is you see where in the manuscript there's ink, but in a way that it represents the entire manuscript, right? It's not a representation of one page is a representation of all pages together. And so for example, very clearly then the text block becomes visible because obviously back then they were very neat writers. They always began at the very same place and the lining is very clear.
38:23 Michael Kennedy: Yeah, that's super cool. And you even show how in order to get some of the OCR stuff to work better, you might use OpenCV to go and delete some of the extraneous like punctuation and characters and things like commas and whatnot, right?
38:39 Cornelius van Lit: Yeah, so now we're moving to 19th-century prints. In the 19th century, the printing press that come along in Islamic world. And they started to a print, they were very late. But in the very beginning, these printed productions are quite poor in a way. So if you just use OCR, say Tesseract you get rubbish basically.
38:58 Michael Kennedy: Is that from Google?
38:59 Cornelius van Lit: It's now by Google, yes. I believe, you know, that's amazing about these things. You know, I'm just a user of technology in a way. And so I stumbled upon a new library and then you know, you quickly look into it. It's like, "Oh, it's been developing for the last 20 years."
39:13 Michael Kennedy: Yeah, I just didn't know about it. Yeah, I haven't really heard of it.
39:15 Cornelius van Lit: So I think it originated at HP if I'm correct. So it's one of the leading OCR packages. There are other ones, but this one works, whereas others I can't get them to work so I can't invest more time in it too to make it work.
39:29 Michael Kennedy: Cool, so you're basically going through and you'll use OpenCV to actually clean up and purify this up before you send it off to these OCR places so they actually have a chance of success, right?
39:40 Cornelius van Lit: Yes, and really the results are amazing. because when you want to just extract a plain text, you don't want comas because comas have been inserted by the editors anyway. They're probably wrong and you don't want page numbers because you can insert them digitally. You know, you got that on under control. So you remove all of that, you get the plaintext and then Tesseract does a really wonderful job.
40:01 Michael Kennedy: This portion of Talk Python To Me is brought to you by Linode. Are you looking for hosting that's fast, simple and incredibly affordable. Well look past that bookstore and check out Linode at talkpython.fm/linode, that's L-I-N-O-DE. Plans start at just $5 a month for a dedicated server with a Gig of Ram. They have 10 data centers across the globe, so no matter where you are or where your users are, there's a data center for you. Whether you want to run a Python and web app, who's to private Git server or just a file server, you'll get native SSDs on all the machines and newly upgraded 200 gigabit network, 24/7 friendly support even on holidays and a seven day money back guarantee. You need a little help with your infrastructure? They even offer professional services to help you with architecture migrations and more. Do you want to a dedicated server for free for the next four months? Just visit talkpython.fm/linode. I don't know that people really appreciate how much work that is, unless you see the original manuscript. Like I said, this is not a printed book, right? This is annotated, handwritten stuff.
41:08 Cornelius van Lit: Yes, I think the best way to approach is to first find something that you like, something that you can find. Say like I said, doing sort of the x-ray analysis. Now, you kind of know where block ought to be and so then you can use that knowledge and come back to one image and analyze that one image with that knowledge of where the text ought to be. And so now, if you find ink outside of those text blocks, then you know, "Oh, there is a marginal note. This might be interesting."
41:38 Michael Kennedy: Right, maybe that's the most interesting part. Right, but you want to capture it separately. You can't just like try to cram it together or whatever. Yeah, how interesting. It's going to be inspiring to a lot of people to just see all of this cool technology being applied to stuff. I suspect it was not initially designed to be applied to and when I first looked at it like, I don't really know how much technology is going to directly answer these questions, but this idea of saying like, I'm going to create like an ownership chain in say a graph database or something like that of these manuscripts and use that to explore these hidden links. That's really interesting. I think it's unexpected and a very cool.
42:18 Cornelius van Lit: Yes, but there are a lot of challenges ahead of us.
42:21 Michael Kennedy: Okay, yeah. What are some of them?
42:23 Cornelius van Lit: To start with, the data that we need is not accessible right now in the way that we would like it. Right now digitized manuscripts are often just on a website of of a university library. You can only kind of look at them with your human eyes, right. But you can't access them with an API so you can just sort of programmatically bring them in. So there's really lost opportunity. But you know, I'm working on it and I'm talking a lot with librarians, so, hopefully, we'll see progress there.
42:53 Michael Kennedy: That's really cool. You know, sometimes when there's not an API, there is, it's just not intended, right? Like there's Selenium or there's Beautiful Soup and requests or right. There's ways to go through and actually extract that data even if it wasn't prepared to be presented in that way. How much is that possible and how much are there like FlySensing and or usage restrictions that make that kind of research impossible?
43:18 Cornelius van Lit: Well, sort of the licensing copyright issue is murky territory at the moment. So this is also something that needs to be worked out. But you're right, I mean, this is of course possible.
43:30 Michael Kennedy: Yeah, but you don't want to go through and like grab all this data and then say, well, you can't use any of it because you can be in trouble for X, Y, and Z , right?
43:37 Cornelius van Lit: Yeah, in these kinds of fields, the world is very small, you know, so you can't make enemies basically.
43:43 Michael Kennedy: Yeah, of course, you might be able to make our proof of concepts though, right? Like maybe as a research, like a young grad student or something, you could say, "Well, if I had this data, what questions could I ask and answer?" and then maybe do something and present it back to that organization, say like, "Look, if you would somehow provide us this information, these are the types of things that we can do. We spent a week to show you, how can we help make that happen?" Right?
44:07 Cornelius van Lit: That's right. I feel like I, myself, at least I'm building up to towards that. I've already been saying these kinds of things to libraries and the perhaps after this project, I might just to be able to do like a full DH project, so to say, and really come at this with full force. That's the other challenge for us in the humanities. Like we can't spend all of our time on programming. We can't spend all of our time on these kinds of things. So we have to divide our time and always make this difficult decision. Like how much of my personal development is going to go into this and if I stumble upon the problem that I can't just fix with Google and Stack Overflow, then...
44:51 Michael Kennedy: Was it a dead end that like wasted your a tenure track possibilities or something like that? Right
44:58 Cornelius van Lit: Okay. So you don't want to change that
45:01 Michael Kennedy: Yeah, yeah, yeah. So in the harder sciences, we have the Journal of Open Source Software. Are you familiar with this?
45:09 Cornelius van Lit: I wasn't until you showed this to me. This is very...
45:13 Michael Kennedy: Yeah, so I had these folks on the show a while ago and it's pretty interesting. It's a place to publish and site the software side of your research, and I did say it was more for the hard sciences, but it sounds to me like the stuff that you all are doing in your community would also be potentially something that you could publish there and then cite as a some kind of publication and whatnot. So maybe that's a little bit of a release valve to get a little bit more credit for the software side of things.
45:40 Cornelius van Lit: Definitely because a main challenges that that much of our, exactly like you said, tenure track career wise, it kind of still depends on print publications and building software kind of doesn't count, it seems, right.
45:56 Michael Kennedy: Yes.
45:57 Cornelius van Lit: Very well for us. But there have been discussions about this where people in in digital humanities may say, "Well, actually, we're builders, we're not writers." we should just...
46:08 Michael Kennedy: Yeah, it's not a unique problem to digital humanities, but it is a, I can imagine a little bit harder there.
46:14 Cornelius van Lit: Yes, because, let me ask you this, the one very sort of one problem that presents itself is every piece of software is kind of unique. So how are you going to compare it with other pieces of software? How are you going to say like, "Oh, this is very good, or this is not so good. This is good enough to make tenure or this is bad enough to fire him?"
46:35 Michael Kennedy: Yeah. I have no idea how to do that. That is really tricky. I see the problem. Interesting. Now, we're getting kind of short on time here, but I do want to give you a chance to talk about your two new projects or your book. You have a book called Among Digitized Manuscripts and then another one, a project that you're thinking about called Digital Literacy. Do you want to touch on those real quick before we wrap things up?
46:59 Cornelius van Lit: Sure, thank you. So over the last two years I've been really focused on digitized manuscripts and how to incorporate that into a workflow. And out of it came this handbook, which I called Among Digitized Manuscripts. You can go to github.com/among and you will find the repository and find some more information about it. It's basically a conceptual and practical toolkit for those who want to work with manuscripts in a digital file format when they're looking at they're still photos. And so it's a very broad, especially about the concept of it, about the challenges but also a whole range of tools. So we're starting just with how to make vector images, which is a crucial skill to have, if you can replicate the glyphs from manuscripts in vector format, you can manipulate them much easier on a computer of course. And then we go up until a chapter that really introduced Python and I go through this example of measuring the angle of the flap and really go through that entire code and explain it step by step what we're doing here. All the while, of course, pointing out to people that if they want to really know more about Python and JavaScript, which I also cover, then there are plenty of resources on the internet or in other books available to to do that. So it's also to get people who are not yet in, who are not familiar with technology to get them up and running.
48:29 Michael Kennedy: Okay. Yeah, that sounds really interesting. And I'm sure it's a pretty unique book. There's probably not that many books on digital manuscripts and Python, so it sounds like it's going to be a good resource for people.
48:40 Cornelius van Lit: I don't want to pat myself on the back here but yes, it's virgin field. So, you know, I would say to others, you know. You know really need to think through these problems and push things further.
48:53 Michael Kennedy: So the other one, Digital Literacy.
48:56 Cornelius van Lit: Well, yes, this is sort of a concern of mine that I've had for many years and it's sort of part of the reason why I started a digitalorientalist.com, sort of, online magazine about how to use computers in your day to day workflow as somebody in Islamic studies and in neighboring fields like synology, Japan studies, Africana, et cetera. Because I kind of feel that the generation before me and perhaps also my generation, I'm in my early 30s, we still grew up with computers as something that you had to kind of make work, otherwise it wouldn't do anything. And people even before that, they didn't grow up with computers. They don't know what's going on basically, but people younger than us, they're growing up now with all of this, you know, polished technology. It's just a black box basically. And I kind of worried worry that people don't really get to understand exactly what's really the foundational concepts of computers, how knowledge is stored on a computer in bits and bytes. It might not come up on a daily basis. But knowing this, I think is very important handling computers well and understanding how computers are also used at scale, of course, by governments and whatnot.
50:13 Michael Kennedy: Yeah, well I definitely think that the world is absolutely full of consumers of technology, but there's definitely room for more producers or creators with technology, right?
50:27 Cornelius van Lit: That's right. I feel that in the humanities we are to be a self sufficient and self reliant. And if we can use some of programming skills ourselves, we can already solve 90% of our problems.
50:41 Michael Kennedy: Yeah, absolutely. Yeah, programming's a super power for whatever you do and that's for sure. Alright, well I think we're going to leave it there for the main topic because we're out of time. But before you get out of here, let me ask you the last two questions. So if you're going to write some Python code, do some of this cool image analysis. What editor do you use?
50:59 Cornelius van Lit: I use PyCharm, CE. Yes. I just like it. I've been using now, I've been getting into Jupyter notebooks. Actually thanks to your podcast, Azure notebooks episode because I'm working on a workshop in the fall for PhD students and I was not planning on installing Python and so this is the way to go for that thing, yes.
51:22 Michael Kennedy: Yeah, I think it's great for teachers and for running classes and workshops. It's just, "Open a browser and go here, you're ready." It's not like, "Well, it won't install this package on my computer, Oh no, here we go." So, yeah, that's great. Excellent and them notable PyPI package? It sounds like you've got some interesting experience and exposure to them.
51:45 Cornelius van Lit: Yes. Well, I mentioned the OpenCV of course. Something that that you can manipulate PDFs is a PyPDF2, something I've been using. In my experience there's a lack of, or there's a really a need for more libraries handling PDFs. I don't know how it's in other fields, but then the humanities the PDFs are very important file format. A lot of information is contained in just in PDF. So I hope we'll see more development in that. For image, like drawing and manipulation, you're also going to need something like Pillow, which is a fork, I believe, of the python imaging library. Beyond that, also like to mention that text fabric and you can pip install text-fabric. And that's really just for analysis of a text that you have in plain text format. I myself work with a developer to get the court in the text fabric format, so to say. And you can do real deep like syntactic or semantic analysis. And kind of just outside of sort of the python ecosystem, I also use something that's called Pandoc and it's incredibly useful for me. So it's a command line app, you can brew install Pandoc and you can change from one file format to another. And for example, in my case, when you have different kinds of scripts like English and Arabic mixed in and you've got footnotes and end notes and all of that and you've got it in a Word doc and Word is horrible, of course to do anything with. You want to go to just plain text, Pandoc is amazing, so.
53:29 Michael Kennedy: Yeah, I hadn't heard about that really. I haven't used it anyway. And it looks super cool. Like it'll convert to and from Open Office, Microsoft doc, restructured text, Marked Down, all kinds of stuff in here. That's, that's great. Yeah, and also on our shared notes, you have this Mac app that I'm pretty excited to try out called Go2Shell.
53:49 Cornelius van Lit: You know, it's a good example of Go2Shell. So it's macOS only and it sits in your finder. And if you're in a folder, just a little button at the top, like in the toolbar, you click it and the terminal opens and you're immediately in that folder in the terminal. It's these kinds of things that are tremendously important for me as just a scholar who's, yes, I program but I understand that some others would, you know, would sort of scoff at this and say like, "Oh no, the only right way to do that is to open a terminal and you know, and CD into it."
54:22 Michael Kennedy: No, no, this is cool.
54:23 Cornelius van Lit: But it's easy for me, you know?
54:24 Michael Kennedy: Yeah, a lot of times you're in a terminal or a finder and you're like, "I just want to go here, but how do I copy here?" There are some tricks that are absolutely not obvious. Like you can go to the file or the folder, either one really, but just see the folder and you can command C it and then you can command V that into a terminal window. But it's like push a button, it opens the terminal in the right place. That's cool, I'm going to check this out.
54:47 Cornelius van Lit: There you go.
54:47 Michael Kennedy: Alright, well, that's pretty much it for our time. Final call to action. Maybe speak to the folks out there doing humanities research or maybe not even using the digital side of things yet. How do they get started?
55:00 Cornelius van Lit: Exactly. Well, so if you're listening in the humanities and you're interested, you want to go for it. Just, you know, keep listening to podcasts like this. Pick up my book if it comes out later this year and come in contact with me, you know, there's sort of a network at digitalorientalist.com where you can also share your experience, I'd be very happy to help you get along. And this also sort of kind of crosses both ways. If there're Python developers out there who kind of think like, "Oh, this would be so exciting to work on humanities projects." There are all kinds of project ideas that I just have, you know but can't do anything with them because of limited time, so connect with me and it'd be very great to you know, see things go forward.
55:49 Michael Kennedy: Yeah, absolutely. Well, Cornelius, I really love this look into what you're doing. And I think it's fascinating how all these pieces are coming together, So thanks for taking the time and being on the show.
55:58 Cornelius van Lit: Thanks to you and the community for making Python what it is. That's what I would say.
56:03 Michael Kennedy: Yeah, absolutely. All right, well I'll talk to you later. Thanks.
56:07 Cornelius van Lit: All right, bye.
56:08 Michael Kennedy: This has been another episode of Talk Python To Me. Our guest in this episode was Cornelius Van Lit and it's been brought to you by Command Line Heroes and Linode, Command Line Hereos is a podcast telling the story of developers. This season is all about programming languages and starts off with Python, of course. Subscribe at talkpython.fm/heroes. Linode is your go to hosting for whatever you're building with Python. Get four months free at talkpython.fm/linode that's, L-I-N-O-D-E. Want to level up your Python? If you're just getting started, try my Python Jumpstart by Building 10 Apps course, or if you're looking for something more advanced, check out our new Async course that digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher in search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feature at /play and the direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some python code.