Python in digital humanities research

Episode #230, published Wed, Sep 18, 2019, recorded Tue, Aug 27, 2019

Episode Deep Dive Links Transcript

You've often heard me talk about Python as a superpower. It can amplify whatever you're interested in or what you have specialized in for your career.

This episode is an amazing example of this. You'll meet Cornelis van Lit. He is a scholar of medieval Islamic philosophy and woks at Utrecht University in the Netherlands. What he is doing with Python is pretty amazing.

Even if you aren't interested in digital humanities and that type of research, the example set by Cornelis is a blueprint for bringing Python into your world and for those around you. I think you'll enjoy this conversion.

Episode Deep Dive

Guest Introduction and Background

Cornelis van Lit is a scholar of medieval Islamic philosophy working at Utrecht University in the Netherlands. He has long focused on studying manuscripts and texts from centuries past but has more recently incorporated Python programming into his research. Cornelis has used tools like OpenCV and Tesseract to perform image-based and OCR-based analyses on historic documents. He also founded and contributes to digitalorientalist.com to help other scholars apply technology to research in the humanities.

What to Know If You're New to Python

If you are fairly new to Python and want to follow along, here are a few basics you should understand before diving into Cornelis’s conversation:

Installation and Virtual Environments: Ensure Python is installed and learn to create and activate virtual environments for isolating dependencies.
OpenCV Concepts: Know that OpenCV is a powerful library for computer vision (analyzing images/video).
Package Management: Practice using pip to install common libraries like numpy, pandas, or PyPDF2.
Tesseract Usage: If you plan to do OCR (text recognition), Tesseract is a widely used command-line tool you can integrate with Python via bindings like pytesseract.

Key Points and Takeaways

Python as a Superpower in Digital Humanities Cornelis emphasized how Python amplifies traditional humanities research. While many scholars rely on purely manual analysis of texts, coding skills help process images and classify manuscripts at scale. Python fills in the “missing piece” between raw digitized images and meaningful conclusions about history and philosophy.
- Links and Tools:
Analyzing Manuscripts with Image Processing One of Cornelis’s major projects involves measuring the angle of the flap on bound manuscripts and detecting ornamental stamps. With a few well-placed OpenCV functions, he could measure details across thousands of digitized covers. This kind of computation allows researchers to draw connections between book-binding styles and the eras or regions in which they were produced.
- Links and Tools:
  - OpenCV on PyPI
  - PyPDF2 on PyPI
Stamps, Seals, and Ownership Trails Many historic Islamic texts have owner stamps or “ex libris” entries that reveal a manuscript’s journey across time. By detecting stamped images with OpenCV’s pattern matching, Cornelis can trace a book’s movement from one scholar or region to another. These ownership chains can illuminate broader intellectual networks and how ideas spread through the centuries.
- Links and Tools:
  - digitalorientalist.com
  - opencv.org
OCR and Tesseract for Text Extraction While complex cursive and calligraphy remain challenging, Cornelis found that Tesseract performs well for certain segments of printed text. Before handing off pages to Tesseract, he uses Python to remove page numbers, margins, and noise so the OCR engine can focus on the words themselves. This pre-processing step greatly improves recognition rates for older prints.
- Links and Tools:
  - Tesseract OCR
  - pytesseract on PyPI
Bridging the Two Cultures of Humanities and Tech Cornelis mentioned the “two cultures” dilemma: humanities scholars often see code as foreign, and many programmers lack domain knowledge in the humanities. Instead of pure outsourcing to engineers, he advocates being a “centaur” , the researcher who learns enough code to integrate software tools directly into their academic workflow.
- Links and Tools:
  - digitalorientalist.com
Incremental Learning and Quick Wins Many first-time coders in humanities get intimidated by Python’s breadth. Cornelis suggests starting with small tasks like renaming files or extracting metadata with pyPDF2. Even one or two scripts that save hours of manual labor show the immediate ROI of coding in a research environment.
- Links and Tools:
  - PyPDF2 on PyPI
Critical Role of Metadata and Data Formats Libraries often digitize manuscripts without standardized APIs or open access points. That lack of uniform data can limit how effectively Python can batch process or cross-reference manuscripts. Cornelis’s work involves encouraging libraries to offer better programmatic access, so humanities scholars can easily run large-scale digital analyses.
- Links and Tools:
  - digitalorientalist.com
Text Summation and Aggregation Techniques Beyond single-page analysis, Cornelis has attempted to “stack” text layers and detect patterns of frequent note-taking or marginalia. By layering all scanned pages, he can see the densest regions of ink across the entire manuscript. This reveals interesting “meta” features that might be lost when reading page by page.
- Links and Tools:
  - numpy.org
  - pandas.pydata.org
Practical Career Advice for Scholar-Developers A recurring theme is balancing time spent on coding with time spent on traditional research outputs like publications. Though the humanities often prize published text over software, open-source contributions and bridging the technical gap can drastically enhance research impact. Cornelis encourages writing and citing your own code as an important scholarly output.
- Links and Tools:
  - Journal of Open Source Software
Digital Literacy and the Future of Manuscript Research Cornelis believes that the next generation of humanities scholars must become digitally literate to keep pace with growing digital archives. The question is no longer whether to use technology in research but how effectively you can adopt it. Tools like Python give researchers agency to explore new types of historical data and previously untapped relationships among manuscripts.

Links and Tools:
- Among Digitized Manuscripts (GitHub Repo)
- digitalorientalist.com

Interesting Quotes and Stories

"You’ve often heard me talk about Python as a superpower. It can amplify whatever you're interested in or what you've specialized in for your career." , Host

"I also noticed that I reach for it faster and faster. More quickly I think Python has a good solution to my problem." , Cornelis van Lit

"All of a sudden, you see hills and valleys, places where there's always ink and places where there's never ink. So it's kind of like an x-ray of a manuscript." , Cornelis van Lit

Key Definitions and Terms

Digital Humanities: The intersection of computational methods and traditional humanities research, using software tools to analyze historical, literary, and cultural artifacts.
OCR (Optical Character Recognition): Technology that converts images of text into machine-readable text.
OpenCV: A popular open-source computer vision library used in Python for image processing and analysis.
Ex Libris: A label or stamp meaning “from the books of,” used to denote ownership in manuscripts.
Transparent Reactive Programming: A programming style where state and data flow are managed automatically, usually found in frameworks that reduce manual callback complexity.

Learning Resources

Python for Absolute Beginners: A foundational course to get you comfortable with Python’s syntax and core concepts.
Getting started with pytest: Testing is a great way to ensure your code remains trustworthy, essential when analyzing large collections of data.
Build An Audio AI App: While not directly humanities-focused, it demonstrates text analysis and AI integration with Python, which can parallel digital humanities approaches.
Among Digitized Manuscripts: Cornelis's upcoming handbook and toolkit offering code examples and conceptual guides for working with digitized manuscripts.

Overall Takeaway

This conversation shows how mastering even modest Python skills can dramatically expand the horizons of humanities research. Through coding, old manuscripts become rich datasets for pattern recognition, analysis, and discovery. Cornelis’s experience highlights that while bridging disciplines can be challenging, the rewards, new historical insights, more efficient workflows, and broader reach, make it well worth the effort.

Links from the show

Cornelis’ portfolio: lwcvl.com
Cornelis on Twitter: @LWCvL
Repo for Among Digitized Manuscripts: github.com/among
The Digital Orientalist: digitalorientalist.com
Keynote on ‘Getting Ready for the CV Revolution: youtube.com
Go2Shell macOS App: zipzapmac.com
Episode #230 deep-dive: talkpython.fm/230
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #230 deep-dive: talkpython.fm/230

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 You've often heard me talk about Python as a superpower.

00:02 It can amplify whatever you're interested in or what you've specialized in for your career.

00:07 This episode is an amazing example of this.

00:10 You'll meet Cornelius van Litt.

00:11 He's a scholar of medieval Islamic philosophy and works at Utrecht University in the Netherlands.

00:17 What he's doing with Python is pretty awesome.

00:19 Even if you aren't interested in digital humanities and that type of research,

00:22 the example set by Cornelius is a blueprint for bringing Python into your world and for those around you.

00:27 I think you'll enjoy this conversation.

00:29 This is Talk Python To Me, episode 230, recorded August 27th, 2019.

00:34 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

00:53 This is your host, Michael Kennedy.

00:55 Follow me on Twitter where I'm @mkennedy.

00:57 Keep up with the show and listen to past episodes at talkpython.fm.

01:01 And follow the show on Twitter via at Talk Python.

01:03 This episode is brought to you by the podcast Command Line Heroes from Red Hat and Linode.

01:08 Please check out what they're offering during their segments.

01:11 It really helps support the show.

01:12 Hey, folks.

01:14 Before we get to the interview, I have some exciting news.

01:16 We've teamed up with Humble Bundle to launch a great bundle of Python educational goodness.

01:22 For a couple of weeks, you can get three of our courses along with great content from RealPython, PyBytes, and many others for as little as just $1.

01:30 If you've been on the fence about trying one of our courses, here's a chance to get three of them along with a bunch of other great stuff.

01:36 Just visit talkpython.fm/HB2019.

01:41 That's HB2019.

01:43 And be sure to check it out before time runs out.

01:45 Now, let's get to that interview.

01:47 Cornelius, welcome to Talk Python To Me.

01:49 Thanks for having me.

01:50 It's great to have you here.

01:51 I know we're going to have a fun conversation talking about digital humanities, which, honestly, I didn't know a whole lot about before we started talking.

01:59 But it's really a cool intersection of, well, humanities and software.

02:03 That's right.

02:04 And it's only growing and growing.

02:06 And soon it will just consume the entirety of the humanities.

02:09 That's our goal.

02:10 There was this article recently written that Python is eating the world.

02:13 And it may be true.

02:15 We'll see about that as we go.

02:16 Before we get into all that, though, let's start with your story.

02:19 How did you get into programming in Python?

02:20 Well, this goes back many years.

02:22 It was actually, if you will believe it, the last year of my elementary school, I programmed the website of the school.

02:29 This is 1999.

02:31 So I just did it in notebook.

02:32 That's amazing.

02:33 Yep.

02:33 Just HTML tags.

02:35 There weren't really any IDEs, at least not that I know of.

02:38 There was like front page and Dreamweaver.

02:40 But, you know, that was kind of cheating in a way.

02:43 Yeah.

02:43 It was like you write in Word and then like you publish it as a web page.

02:46 That was weird, right?

02:47 Everybody knew that that was not the right way to do it.

02:50 That's right.

02:50 Yes, they did.

02:51 So, wow, that's really young.

02:53 How did you even get that opportunity to do that?

02:56 I don't know exactly how I got this book.

02:58 It was some sort of, you know, HTML for dummies kind of book.

03:01 And that's what I used.

03:02 And I guess I just showed it to one of the teachers and he was like, well, actually, this school doesn't even have a website yet.

03:08 So that's really cool.

03:09 After that, my dad gave me this brick of a book on visual basic thinking that, well, okay, that visual elements to it, you know, but that was way too hard.

03:19 And it's really so great to see that there's so many resources out there now for children to learn programming.

03:24 That's really cool.

03:25 It's really amazing.

03:26 I talk to people who, like, what their experience was who are our age and generally that age.

03:32 And it's like, well, I got a magazine and then I would type in, like, the C code into something.

03:38 And then I would make that run.

03:39 And that's how I learned programming.

03:40 And then I see my daughters and stuff.

03:43 And, you know, there's, like, adventure games where you program your way through dungeons and, you know, robots and all sorts of stuff.

03:51 And I'm like, wow, that is a long way from, you know, typing in, hello world, you know, 10, hello world, 20, go to 10.

03:57 That's right.

03:59 That's right.

04:00 But I guess I did feel very attracted to that kind of thing.

04:03 So, eventually, I learned of Flash, which is dying as we speak.

04:09 It's horrible.

04:10 So, I got in that when it was still developed by Macromedia, Flash 4.

04:14 And this was a really great combination for me to be both a designer and developer.

04:19 So, I did this for my daughters.

04:21 You can do really interesting stuff.

04:22 And, like, Flash is properly the whipping boy these days.

04:26 It gets beat up and is getting ostracized.

04:28 And that's probably a good thing.

04:29 But, you know, the timeframe that you're talking about learning it, it was really powerful.

04:33 And it was really unique.

04:35 And it, you know, we didn't have HTML5 and JavaScript that worked well everywhere.

04:39 And you could do interesting stuff with that.

04:41 A lot of things were done with it.

04:42 I've only kind of recently come back into programming.

04:45 And I was just stunned by what JavaScript had become.

04:50 You know, in 1999, you could do an alert box.

04:52 That was kind of it.

04:53 Yeah, exactly.

04:54 We can validate this text or whatever, yeah.

04:56 It's been such a huge change in that world.

04:59 And that was so recently, like 2016 or so, that's when I got sort of reintroduced to real programming.

05:06 And Python quickly came on my radar.

05:09 And I've been kind of incorporating it into my work as a Swiss army knife.

05:14 I use it for all kinds of things.

05:15 And I also noticed that I reach for it faster and faster.

05:20 More quickly, I think Python has a good solution to my problem.

05:24 Yeah, that's really interesting.

05:25 And when you say you're incorporating it, I know the stuff that you're doing that we're going to talk about.

05:28 It's deep and meaningful stuff.

05:31 It's not like a little quick little automation of like some Excel file or something.

05:35 There's real stuff that I think is really powerful programming that you're doing.

05:41 So maybe that's a good way to segue into what do you do day to day as your main job.

05:46 By trade and profession, I'm a scholar of Islamic studies.

05:50 So that means that I spend most of my time alone.

05:53 I do research.

05:54 I'm a postdoctoral researcher at Utrecht University.

05:57 And I've got my own project funded by the Dutch Research Council.

06:01 And this is really about philosophy from the 12th century, from the Islamic world up until could be the 19th century even.

06:10 And so I read a whole discussion of all kinds of people through these centuries.

06:15 We're all talking in my project specifically.

06:17 I work on the imagination.

06:18 And a lot of these texts are in printed editions, but some of them are in manuscripts.

06:24 So there's a lot of just sitting down with the text and reading them.

06:29 That's what my job is supposed to be.

06:31 But then came Python, basically.

06:34 Yeah, of course.

06:35 Well, I look at these manuscripts and I haven't read old Islamic ones because I don't read any of those languages.

06:41 But I do remember trying to read some of the mathematical ones from like Newton and stuff when I was studying it.

06:46 And these do not seem like writing that would easily be understood by computers.

06:53 It's not like nearly printed text or something like that, right?

06:57 It's handwritten, kind of one-off, unique stuff.

07:01 The calligraphy was really beautiful back then.

07:03 People could write really well.

07:04 But still, it doesn't seem like a first blush.

07:08 A computer should be able to just take it on.

07:09 But working with manuscripts is a real skill.

07:12 And it's a real complicated skill that needs a lot of training for very specific periods, very specific types of hands, of scripts, basically.

07:22 And as far as what you just mentioned about, you called it calligraphy, and rightly so.

07:28 And just think of it.

07:29 Back then, when you wanted to put down knowledge, you just really had to take a pen, take ink, a paper, and then write it down by hand.

07:38 And the fastest way to do that is to not release your hand, not release your pen from the paper.

07:45 So that's why cursive is all connected, because that's simply the fastest way to put this down.

07:51 I'm looking at text that are sometimes spanning like 600 folios of handwritten text.

07:57 It's amazing to think that somebody 500, 600 years ago did this.

08:03 It's really amazing, those old manuscripts.

08:05 So maybe before we get into any of the technical side of things, it might be worth just setting out some of the questions that you try to answer, right?

08:15 As part of your research, you know, put the tech aside for a minute.

08:18 Like, what are some of the things, some of the outcomes you're looking for and stuff?

08:22 In my real research, you're looking for, so the basic, sort of the very overarching thesis that I am approaching is the late medieval, early modern period in Islam, in the Islamic world.

08:35 Is that a period of sort of intellectual decay and sort of darkness, or is there actually much activity going on?

08:43 And the former has been argued for and also sort of made into a political argument and plays in a significant role in geopolitical discussions right now.

08:53 So there's sort of the societal relevance of my research.

08:57 But then to do that, I really go down to the actual evidence that we have, the texts, and where other people kind of discarded, not looked at it.

09:08 I say, okay, let's look at what are these people talking about.

09:11 And I do that mainly through what are called commentary traditions.

09:14 This was a very much-used device in those centuries that you didn't really write a text of yourself, but you took a text from before, and then you copied it, and then you added comments to it.

09:25 So this way, you have a very structural, very secure way of knowing that these two people, even though they're separated by centuries and continents, they're interacting.

09:35 They're talking to each other somehow.

09:38 And right now, I'm looking at a commentary tradition of 140 commentaries.

09:42 So there's a lot of things to sort out, and usually they do not name the person that they're referring to.

09:49 So you have to do a lot of triangulations, basically, to get to understand what is the movement of the discussion over the centuries.

09:59 I see.

09:59 So you almost can study how, like, thought around this common idea has evolved over time, or one thinker influenced the other, something like this.

10:08 That's exactly what I'm after.

10:09 So it's also to sort of fight off essentialism or some sort of idea that they're all talking about the same.

10:16 Now, usually there is a very subtle movement in the discussion.

10:21 And for me right now, this is about the imagination, about what does it mean that we can imagine?

10:27 Are imagining things real or are they not real?

10:31 And this is also then for these people placed in a religious context.

10:34 Does the imagination play a role in prophecies and, for example, mystical experiences, say?

10:41 Okay.

10:41 Yeah, that sounds really interesting.

10:42 And obviously, you need to understand the manuscripts and lots of them.

10:46 I think at this keynote, I don't remember where it was.

10:49 You'll have to maybe let everyone know.

10:51 And we'll link to it.

10:52 It's on YouTube, and it's really interesting.

10:54 You had talked about how there's a common set of papers and manuscripts and stuff that people have studied over and over and are recorded and understood.

11:04 But like 99% of the writings that people do just kind of vanish, right?

11:08 Yeah.

11:09 So in a manuscript world, when somebody thinks of a profound idea and he writes it down, he uses a pen, ink, paper to put it into writing.

11:19 Now you have, you know, the idea is of population and is one.

11:23 Yeah.

11:24 And so you better hope that somebody comes along and says, oh, I'll take the time to copy it and, you know, copy it onto another piece of paper and then sort of distribute it from there.

11:34 So it's very different from the print world where in one swoop you can have, you know, a hundred or a thousand copies of a text and distribute it across a whole continent.

11:44 So this way, copying in the manuscript world is incredibly important.

11:49 So that's why I try to sort of encapsulate my methodology.

11:53 Instead of looking at all writings of one author, let's look at all authors of one text.

11:58 Yeah.

11:59 That's an interesting twist.

12:00 So one of the things I really like about your story and that I think will become apparent is this interesting juxtaposition of the very old and the very new, right?

12:10 Like we're talking about applying artificial intelligence to understand manuscripts from 900 years ago or something like this, right?

12:18 These are both extremely cutting edge and like we don't even really use manuscripts in modern day things and bringing them together.

12:26 And I think another interesting one is that you said that you are a friar of the order of preachers, which I think is pretty interesting.

12:35 And also I think it's just another interesting aspect to what you're doing around technology.

12:41 Well, yes.

12:41 The order of preachers, also known as the Dominicans.

12:44 It's a religious order of the Catholic Church founded in the 13th century.

12:49 And in fact, I stand in a long tradition of Dominicans who have introduced technology and have sort of thought about what technology means when a printing press came along, the Dominicans were there.

13:01 And now also in the digital world, there is a network of friars.

13:07 It's called Optic.

13:09 And they think a lot about the ethical aspects of sort of the digitization of our lives.

13:16 What does that mean to us?

13:17 How can we still be – what does it mean to be a human in a digital world really?

13:21 So kind of my work is sort of close to that.

13:24 But of course, I also have to sort of smile sometimes when I – late at night, I'm coding and then I look down and I notice that I'm wearing my habits.

13:36 Yeah, it's a very interesting juxtaposition.

13:39 And I think there probably is not exactly the same but in a similar sense, a kind of a culture clash when you think about using computer technology and computer programming to do digital humanities.

13:53 So maybe let's try to define that term broadly because I think we've defined it – started to define it for what you're doing.

14:01 But in a broad sense, what is digital humanities?

14:04 Yes, it's a term that refers to the use of computer technology in humanities research where humanities research relies on human artifacts.

14:15 That's basically the sort of the raw material of humanities research like architecture, art, texts, all of that.

14:23 And so usually our laboratory has always been the library and we're just, you know, stack up a whole bunch of books and do our thing.

14:31 And now we're seeing like, okay, how can we use computer technology?

14:35 How can we unleash that computing power that we know is literally at our fingertips?

14:39 And this is especially important because sort of unbeknownst to ourselves, so much of our workflow has already been coming into the digital world.

14:49 Of course, we don't write our books by hand anymore.

14:51 We write them on a computer.

14:53 And most of our journal articles, we take them from online databases.

14:58 But a lot of the issues with it have not been thought out exactly.

15:03 And the people who have sort of gone into this direction usually have gone all the way, so to say.

15:11 They have really made digital humanities into a field of its own.

15:14 And their main purpose now is to really push the technological boundaries to see what kind of new technology could be possibly applied to the humanities.

15:25 But without really coming back with sort of real results for other colleagues in the humanities.

15:30 So that sort of caused a rift or a divide between what you can then call classical humanities and digital.

15:37 Yeah, I can definitely see that.

15:38 You know, you can just keep going down the computer side of things.

15:42 But, you know, here you are trying to answer these very traditional sort of philosophical questions using source material that's, like I said, 900 years old, written by hand.

15:53 And yet you're bringing in some really cool technology to do it.

15:56 Maybe let's just set the stage quick by talking about some of the things you're actually studying, like these scrolls with their flaps and so on.

16:05 And some of the technology that you're applying to it.

16:07 Right, yeah.

16:08 So I think sort of the most interesting aspects of my Python work applied to manuscript studies has been the automated image analysis of Kodishes.

16:19 And this comes because we have digitized so many manuscripts.

16:23 Actually, we didn't mention this before, but this is really a key element to all of this.

16:30 Libraries around the world have digitized thousands, hundreds of thousands of manuscripts, meaning they take photos of the entire manuscript of every page.

16:38 They take a photo.

16:39 And now you have a folder with like 300 images.

16:43 And so I took, instead of one folder, I took like a whole collection of folders.

16:48 I had at my disposal 2,500 manuscripts digitized.

16:53 And so I took only the first image, which is of the cover of the binding, right?

16:58 Right.

16:58 It's like kind of a leather wrapper of it.

17:01 And it has like a foldable flap.

17:02 And it sounded to me like knowing some details about that would tell you about maybe the origin or the timing or something of that nature.

17:12 Exactly.

17:12 So in the Islamic world, these manuscripts always have a flap that goes back onto the front to really encapsulate the entire codex, the entire manuscript.

17:22 And I thought it would be interesting to analyze the shape of that codex and particularly the angle that the flap makes, which is a thing that this is truly then a new question.

17:34 It shows that digital humanities can ask these new questions.

17:37 There's nobody thought of measuring the angle of the flap before.

17:41 And if you can do it for one, then you can just iterate it over.

17:46 It's a four.

17:47 Yeah, if you can do it for one, it's a four loop.

17:50 Exactly.

17:50 That's the beauty of it.

17:52 And yeah, it gives you another data point, right?

17:54 If you think of it as data point, then it gives you another, perhaps a small argument if, say, all manuscripts from the 17th century have this kind of angle.

18:06 And now you have an undated manuscript that has kind of that angle on the flap.

18:10 Well, that's another argument in favor for dating it back to the 17th century.

18:15 This portion of Talk Python To Me is brought to you by Command Line Heroes.

18:19 Season 3 of Command Line Heroes is all about programming languages.

18:22 Episode 7 covers the history of AI, starting with its first language, Lisp.

18:27 It was created to teach machines how to learn like humans.

18:31 And at the start, there was a wave of interest and investment.

18:34 But ultimately, the timing and hardware wasn't quite right for AI to work.

18:38 Decades later, AI is now on the verge of changing everything.

18:43 Get the full story and subscribe wherever you get your podcasts or just visit talkpython.fm/heroes.

18:49 Just to give folks who have not watched your keynote and probably, you know, visually it's a little hard to understand.

18:57 Think of like a folded out sheet of paper where there's like a triangle bit, maybe some kind of fastener on the end of it, and then like a big rectangle.

19:06 And you're using OpenCV and Python and various things to like understand like that last triangle part, right?

19:14 That's right.

19:14 And once I had completed it, basically, once I really had it figured out, I understood that this is a really good example of applying technology to humanities research.

19:26 Because OpenCV and Python are now sort of easy enough to use to learn it as you go, basically.

19:33 And especially, this is especially a good thing about OpenCV.

19:37 It's a very powerful library.

19:39 And that I'm able to use it already now shows actually just how highly developed OpenCV is.

19:48 Right.

19:48 And it seems like the understanding that, like I have a picture and that I want to find this triangle in it.

19:53 That seems like a really hard problem.

19:55 But it sounds like the tools have come along to get really good, huh?

19:59 Yes.

20:00 But at the same time, as like the tools are relatively easy to use, you also see that now you give it a humanities problem.

20:07 And all of a sudden, like you're really kind of pushing the technology to its very least.

20:12 Because OpenCV was developed for, you know, receipts, for automated analysis of receipts of chess games, of video feeds, of, you know, gas stations, like how many people are in the store.

20:24 Those kinds of relatively easy tasks for OpenCV apply that to manuscripts, of photos of manuscripts.

20:32 And all of a sudden has a much harder time coming up with anything meaningful.

20:36 You have to take a lot of in-between steps to reduce it, to reduce the image in a way and shape it in a way that you get out of it what you want.

20:45 Yeah.

20:45 I can definitely imagine that.

20:48 And, you know, people should watch the videos and just see some of those manuscripts.

20:51 Because it, like I said at the beginning, it doesn't seem at first glance that like, oh, I could just take that image and point that at some sort of algorithm and get super meaningful stuff out.

21:01 But we're going to dig into some of the cool things that you did do.

21:04 Before we move on from this kind of introductory stuff, though, I do want to circle back just a little bit.

21:09 So it sounds like in this digital humanities world, there's kind of like this scientific philosophy of trying to understand stuff like sort of in a computational way.

21:18 And then there's the humanities part where, you know, it's sort of more traditional studies.

21:24 And you said there's a couple of possible paths organizations and academics and stuff might take.

21:29 Like one would be to like build up a team of some researchers and some programmers or, you know, maybe you could teach yourself or things like that.

21:37 Do you want to kind of work us through that?

21:39 Yes.

21:39 So this starts from the observation that there's kind of a it's been called the two cultures problem where sort of sciences versus the arts.

21:48 And there's sort of really a paradigm difference between them has this very good sort of summary of it.

21:54 He says in one of his books, he says, gray is all theory, but green is the golden tree of life.

22:00 With OpenCV, this is all too real because the first thing you do with a color image is reduce it to a gray scale.

22:06 So the sciences, people from coming from the sciences, trying to work on humanities problems all too often, they way too quickly think that they can rely on a calculation and then be very sure about the outcome of it.

22:22 And sort of their assurance of their methodology trips them up when it comes to humanities problems, which don't always have a yes or no answer to them.

22:31 But the humanities, then they're really sort of deeply grounded problem is that they just can't really understand what technology could give them.

22:42 So oftentimes they think that things that they think are they should be so easy.

22:47 They're incredibly hard for technology.

22:49 When I talk about digitized manuscripts, obviously they're saying, oh, so like OCR, like you can read the text out of it.

22:57 No, that's not.

22:58 Have you seen this text?

23:01 Exactly.

23:02 No, that's really interesting.

23:04 You know, honestly, I can see that that community would have that perspective.

23:08 But at the same time, a lot of people do who are not actual programmers.

23:12 A lot of things when you're doing programming, like that must be hard.

23:15 You're like, no, actually, that's like three lines of code.

23:17 And they think this other thing is really easy.

23:19 And you're like, no, that's like two more weeks of programming.

23:22 That's really hard.

23:23 That's thing.

23:24 You're just, I know it looks like a small step.

23:26 It is not a small step, right?

23:27 It's just how software is.

23:29 Exactly.

23:30 Yes.

23:30 It's like there's a whole research team working on it for five years now.

23:34 They haven't gotten it yet.

23:35 Exactly.

23:36 You're getting close.

23:37 It sort of brings this huge divide between the two, right?

23:40 And so for most purposes, people try to solve this like, okay, let's work in teams with an engineer and a humanity scholar.

23:47 But then you get this sort of Babylonian problem of that.

23:51 You know, it's very hard for them to talk to each other.

23:54 And so I am much more in favor of learning the tech yourself, which I think is really doable at this particular point in time.

24:02 It's the tech is high level enough.

24:05 And it also gives you very fast return on investment.

24:08 You can really learn on the job, so to say.

24:11 And so I want to really like to see people sort of occupy that middle space in between the humanities and the digital.

24:17 Right.

24:18 Without completely sort of abandoning that side, the humanities side and just going, I'm going to become a programmer, but sort of continue doing your work, but actually gain some of these digital skills.

24:26 That's right.

24:27 And the more I thought about this and the more I wrote about this on my website, digitalorientalist.com, I thought, okay, actually, it's not just one middle position, but it's kind of we should think of it as a spectrum.

24:39 You know, where are you on the DH spectrum?

24:42 That's what I'd like to think about it.

24:44 And I sort of distinguished six archetypes in the spectrum.

24:47 You saw it in the keynote.

24:48 Yeah, yeah, yeah.

24:49 So you started out with the believer who is somebody who knows, like, actually, I've seen what you've done.

24:57 I agree with it.

24:58 This is amazing.

24:59 Let's go down that path, right?

25:00 Those who are really doing programming themselves, who are kind of 100% of their time doing programming as part of their humanities research somehow.

25:08 Yeah.

25:08 And just to the left of them, you have the sour one, which are also programmers, you know, full on.

25:15 But they really, they want to share their newfound, you know, things, their newfound results.

25:21 They want to share it so much with the rest of humanities.

25:24 But a lot of colleagues, you know, they find it very hard to understand, just as we have also found it hard to talk about this example that we discussed about analyzing the Kodishes, the angle of it.

25:36 And so they've become really turned off.

25:38 And I've had these Skype conversations with people where I was reaching out to them, very enthusiastic.

25:44 And I quickly realized that I was asking questions that they heard like a million times and they thought I was up to no good or something.

25:51 And so very quickly, these conversations escalate in a way that's really not necessary.

25:57 Yeah.

25:57 I wonder about the sour one category.

26:00 I feel like if you had tried to solve these problems 10 years ago in the same way that you're solving them now, it might have been much harder and much more fruitless.

26:11 Right. Like before OpenCV and simple Python packages and stuff, you might be like, well, we tried this, but we're not like a tech company.

26:19 We're not going to be able to do this.

26:21 So it's never going to work.

26:22 And they just maybe didn't come back or something.

26:24 I don't know.

26:25 What do you think?

26:25 Oh, yes, there's there's definitely part of that.

26:27 Or they just slaved and labored for years and years and sort of got the same result that I now got in, you know, with like you said, like just a couple of lines of code, you know.

26:37 Yeah.

26:37 Like, yeah, this is how we're supposed to do this.

26:40 I mean, that's just a challenge of things moving so fast, moving on.

26:44 All right.

26:44 So the next one of your archetype is the spider.

26:46 Right.

26:47 And so the spider is usually a full professor who is the leader of a research team.

26:53 They have this magic ability to attract grants.

26:56 It's a very useful skill.

26:57 Very strong currency in academics.

26:59 Like papers and you have grants.

27:01 These are the two.

27:01 Yes.

27:02 Yes.

27:03 Money is hard to get by.

27:05 And, you know, these professors, they wouldn't program themselves, but they have sort of reading knowledge.

27:10 And they understand at least the concepts of it.

27:13 And but they have the ability to collect these people around them, put them in connection and let the people who are able to do it, you know, really shine.

27:22 So I think a lot of good can come from the spider.

27:25 Yeah, absolutely.

27:25 I agree.

27:26 I've seen that in other contexts, although I didn't call them that.

27:29 But yeah, I think that's a very positive place to be in academics.

27:32 Absolutely.

27:33 Yes.

27:34 So then you have like a team structure that actually functions quite well because there's not a big divide between engineers and scholars, researchers of the humanities, but it's integrated into one basically.

27:47 Now, sort of a smaller scale of that you have in what I call the blind and the lame.

27:53 So now you don't have a team, but you just have two persons.

27:56 And the lame one is the professor again, but maybe like an associate professor or somebody who has heard like, oh, the H is all the fashion.

28:07 So I need to do something with this.

28:09 But I have absolutely no idea.

28:11 So I'm going to hire a student or an engineer to do the work for me.

28:15 And this is definitely headed nowhere, really, because it's so hard to talk to each other, to come to a meaningful conclusion.

28:22 And even if you have everything kind of working, then usually the result is like, well, a representative is this to make a claim that the data is trying to make.

28:35 Or the way around, are we now just proving the – if it's corresponding with the erudition of scholars of many generations before, is this just proving the erudition or is the erudition actually proving the software?

28:47 Like is the software now validated because it's in accordance with the opinions of scholars before?

28:53 Yeah, of course.

28:54 So what I think is really the best place to be in the DH spectrum is to reduce that two-person team into one person.

29:02 So I call that the centaur.

29:05 Yeah, that's the person who is on or the creature whose top half is a human and the bottom half is a horse, right?

29:11 That's right.

29:12 So it's got still a human head, a humanistic head.

29:15 But the legs are already digitizing basically, able to move by him or herself in the digital sphere, making use of digital tools.

29:24 I see that with PhD students now.

29:27 This is now kind of the generation who's really investing real time into acquiring these kinds of skills.

29:34 What is your experience with people who are in the humanities?

29:37 So they probably were not super drawn to like computer science type of subjects.

29:42 What is their experience when you say, hey, you know, it really makes sense for you to learn a little bit of programming or to explore this technical side of things?

29:50 Are they excited?

29:52 Are they resistant?

29:52 Are they nervous?

29:54 It really varies because even just putting up numbers in your article or your book, I really have to comment like, what are you trying to make me do here?

30:05 Mathematics?

30:05 Like, what are these numbers for?

30:09 Is this a graph?

30:10 Are you kidding me?

30:10 Is this a graph?

30:11 Did that come from Plotly?

30:12 Yeah.

30:13 It really is challenging because even now, a lot of students in the humanities, they kind of use the computer just as a typewriter in a way.

30:24 And, you know, I'm also really kind of concerned about this sort of the digital literacy of our society in general as our technology is.

30:32 It's all so polished.

30:34 You know what I mean?

30:35 Like Facebook and all these websites.

30:37 All the mobile apps.

30:39 Yeah, they look so good and smooth.

30:41 It just works.

30:42 Yeah, I get that.

30:43 That's a great, you know, marketing statement.

30:45 But it also makes you not really think twice about how it works and why it works that way.

30:51 So you really have to show them immediate benefits.

30:55 And it has to be low-hanging fruit.

30:57 You can't force them to like, okay, well, just learn Python for half a year and then come back.

31:02 It has to be a very, very quick return on investment.

31:06 Yeah, and that's probably why Python is working well.

31:08 In this situation, right?

31:10 Because you can be productive with a partial understanding of Python.

31:13 You never have to know what a function, a class, a generator, a database, any of these things are.

31:18 And you can still have outcomes, right?

31:21 Just top to bottom.

31:22 It does this.

31:23 Here comes the analysis or whatever.

31:24 That's right.

31:25 That's right.

31:26 And to sort of to beef that up, I recommend to people, well, just listen to podcasts like this one.

31:31 And just do that in your spare time when you're exercising or whatnot.

31:36 You don't have to understand it.

31:37 That's not the point.

31:38 Just hear the words they're saying.

31:40 And if words are repeated and it sounds interesting, it's maybe an interesting context.

31:45 Then maybe then look it up and see what it is.

31:48 You know, it's really interesting.

31:49 When I first started this podcast like four and a half years ago, my expectation of who my audience member was, was a dedicated expert.

31:58 Somebody who loves Python and programming enough that not only do they do it as their hobby or their job, but they also do it in their spare time and they listen to the background stories of it, right?

32:08 And what I found out is that many of the listeners, maybe over half, are like beginners who are more trying to do a tech immersion type of experience, right?

32:18 They don't, they'll write me and say, hey, I love your show.

32:21 It's really entertaining.

32:22 A lot of stuff you're talking about doesn't make sense.

32:24 But after a couple of months, a lot more of it is making sense.

32:28 And I know what it means now.

32:29 And it's really wonderful.

32:30 And I just never expected it.

32:32 But it's kind of like a language immersion almost.

32:36 To all those people, you're doing it the right way.

32:39 This is the way to do it.

32:41 Just keep listening.

32:41 Yeah, I totally agree the more that I think about it.

32:44 Yeah, so your pitch or your idea is that this centaur, this person who is both doing the work, but also becoming this programmer, adding a little bit of like superpower.

32:54 To what you're doing, as I like to put it, is sort of the recommended path.

32:58 And that makes a lot of sense.

32:59 I definitely agree with that one.

33:00 Now, let's dig into some of the technology that you're using.

33:03 Because some of the problems you're solving and how you're solving them, because they're really interesting.

33:09 So we talked a little bit about this flap thing and understanding its angle and its orientation.

33:14 But there's also other things that you're looking at that you're going to have to guide me a little here, because I don't know a ton about.

33:20 But there's like these stamps, like these seals, these decorative seals.

33:23 And those are like kind of like a signature or proof of authenticity of various authors and thinkers and stuff.

33:31 And so they're all over these manuscripts in very ornate ways.

33:34 And so you were trying to like discover those and also trying to actually digitize the calligraphy that I talked about.

33:42 So maybe talk us through some of the libraries you're using and just this whole analysis that you're doing, because it's really deep, like I said.

33:48 Yeah, I'd love to.

33:49 I'd love to.

33:50 Like, let's start with the stamps and seals.

33:52 So these are small little imprints that maybe some of your listeners will know what an ex libris is.

33:58 It's basically a sticker or a stamp that you place in your book, and it just says ex libris from the books of, meaning I own this book.

34:06 This book is mine.

34:08 And people in the past did that all the time, and they usually had quite lovely stamps that they placed in their manuscript to say, like, this is my manuscript.

34:17 Now, if you have a manuscript that originated in the 12th century, and it was sold often, and now it ends up in a research library, you might just have a whole history of ownership in the manuscript.

34:32 Isn't that amazing?

34:34 So you can actually trace how the manuscript traveled, not only through time, but through space.

34:39 You can see, like, oh, it was in Tabriz in the 14th century.

34:42 It was in Istanbul in the 16th century.

34:44 It was in Morocco in the 18th century.

34:47 It's like the blockchain of manuscripts.

34:49 I don't know too much about blockchain.

34:52 I'm just kidding.

34:53 Keep going.

34:54 And so I said before that OCR, that's kind of out of the question for now for manuscripts.

34:59 Yes, people are working on handwriting recognition, but it's not at any production level.

35:05 But these stamps, they are kind of ideal for OCR kind of recognition, you know.

35:11 And so, again, OpenCV is so powerful.

35:14 You have to find a very pristine example of the seal, and then, well, you give it to OpenCV, and then you just give a whole bunch of images, and you say to OpenCV, try, find me this image, basically.

35:27 And it will find seals, imprints that have a huge chunk out of them, and it will still detect them very well.

35:34 So now you can imagine if you scale this up, now if you have a couple of thousand of manuscripts that kind of belong to each other, but you don't know exactly how, now you can all of a sudden, with this kind of code, you can reconstruct the collection of an owner from the 17th century.

35:51 You can say all of these manuscripts actually belong to the same owner.

35:54 They all have the same seal.

35:55 Yeah, so you can ask really interesting questions like, show me who has ever owned, show me all the places that this person or this scholar has ever owned, and then was edited by this other person.

36:07 Or something like you can, these questions seem over many thousands of permutations, sound really difficult, but now all of a sudden now it's a, you know, a sub-millisecond database query.

36:18 This is not a reality yet, but this is what I'm working towards, and it will undoubtedly lay bare all kinds of interesting things, because then you can really see what's not spoken of, what's just kind of in the evidence.

36:33 But it doesn't make it sense.

36:35 These invisible connections that nobody has yet actually highlighted, right?

36:39 Like, all of a sudden you find that a particular kind of type of text, a text of logic, they all came through a very religious place of some kind.

36:49 Or you see that, you know, two people were constantly buying and selling from each other.

36:55 Those kinds of things you can now quite easily lay bare.

36:58 That's one thing.

37:00 Another thing is, what if you've taken, so you have 300 photos of a manuscript, meaning 300 photos in which you see a page spread.

37:08 You see two pages, basically the book opens to two pages.

37:12 And then the next photo is one page is flipped over, and so you see, you know, the next two pages, right?

37:18 What if you make it so that you only extract the ink?

37:22 You kind of have OpenCV and NumPy.

37:26 You're going to have to use NumPy for this as well.

37:28 You sort of lock on to the ink.

37:30 You delete everything.

37:32 And now you stack those layers, those very tiny sort of wafer-thin layers of ink.

37:39 You stack them on top of each other.

37:40 So now you've got 300 layers on top of each other.

37:44 And now all of a sudden you see hills and valleys, places where there's always ink and places where there's never ink.

37:51 So it's kind of like an x-ray of a manuscript.

37:54 And so the result is you see where in the manuscript there's ink, but in a way that it represents the entire manuscript, right?

38:04 It's not a representation of one page.

38:06 It's a representation of all pages together.

38:09 And so, for example, very clearly then the text block becomes visible because obviously back then they were very neat writers.

38:17 They always began at the very same place and the lining is very clear.

38:23 Yeah, that's super cool.

38:25 And you even show how in order to get some of the OCR stuff to work better, you might use OpenCV to go and delete some of the extraneous punctuation and characters and things like commas and whatnot, right?

38:39 Yeah, so now we're moving to 19th century prints.

38:42 In the 19th century, the printing press did come along in the Islamic world and they started to print.

38:48 They were very late to it.

38:49 But in the very beginning, these printed productions were quite poor in a way.

38:54 So if you just use OCR, say Tesseract, you get rubbish basically.

38:58 Is that from Google?

38:59 It's now by Google, yes.

39:00 I believe, you know, that's amazing about these things.

39:03 You know, I'm just a user of technology in a way.

39:05 And so I stumble upon a new library.

39:07 And then, you know, you quickly look into it.

39:10 It's like, oh, it's been developing for the last 20 years.

39:12 Yeah, yeah.

39:13 I just didn't know about it.

39:14 Yeah, I haven't really heard of it.

39:15 So I think it originated at HP, if I'm correct.

39:17 So it's one of the leading OCR packages.

39:20 There are other ones.

39:21 But this one works, whereas others, I can't get them to work.

39:25 So I can't invest more time in it to make it work.

39:29 Cool.

39:29 So you're basically going through and you'll like clean, you'll use OpenCV to actually clean

39:33 up and purify the stuff before you send it off to these OCR places.

39:37 So they actually have a chance of success, right?

39:39 Yes.

39:40 And really, the results are amazing, you know, because when you want to just extract the plain

39:46 text, you don't want commas because commas have been inserted by the editors anyway.

39:51 They're probably wrong.

39:51 You don't want page numbers because you can insert them digitally.

39:54 You know, you've got that under control.

39:56 So you remove all of that, you get the plain text and then to direct.

40:00 There's a really wonderful job.

40:02 This portion of Talk Python To Me is brought to you by Linode.

40:06 Are you looking for hosting that's fast, simple, and incredibly affordable?

40:10 Well, look past that bookstore and check out Linode at talkpython.fm/Linode.

40:15 That's L-I-N-O-D-E.

40:17 Plains start at just $5 a month for a dedicated server.

40:20 With a gig of RAM.

40:21 They have 10 data centers across the globe.

40:23 So no matter where you are or where your users are, there's a data center for you.

40:27 Whether you want to run a Python web app, host a private Git server, or just a file server,

40:32 you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24-7 friendly

40:38 support, even on holidays, and a seven-day money-back guarantee.

40:42 Need a little help with your infrastructure?

40:43 They even offer professional services to help you with architecture, migrations, and more.

40:48 Do you want a dedicated server for free for the next four months?

40:51 Just visit talkpython.fm/Linode.

40:55 I don't know that people really appreciate how much work that is unless you see the original

41:01 manuscript.

41:01 Like I said, this is not a printed book, right?

41:03 This is annotated, handwritten stuff.

41:08 Yes.

41:09 I think the best way to approach this is to first find something that you can find.

41:15 Say, like I said, doing sort of the x-ray analysis, now you kind of know where text ought

41:21 to be.

41:22 And so then you can use that knowledge and come back to one image and analyze that one image

41:28 with that knowledge of where the text ought to be.

41:31 And so now if you find ink outside of those text blocks, then you know, oh, there's a

41:36 marginal note.

41:36 This might be interesting.

41:38 Right.

41:38 Maybe that's the most interesting part, right?

41:40 But you want to capture it separately.

41:41 You can't just like try to cram it together or whatever.

41:44 Yeah.

41:44 How interesting.

41:45 It's got to be inspiring to a lot of people to just see all of this cool technology being

41:51 applied to stuff that I suspect it was not initially designed to be applied to.

41:56 And things that when I first look at it, like I don't really know how much technology is

42:00 going to directly answer these questions.

42:03 But this idea of saying like I'm going to create like an ownership chain in say a graph

42:08 database or something like that of these Vanish scripts and use that to explore these hidden

42:13 links.

42:14 That's really interesting.

42:14 I think it's unexpected and very cool.

42:18 Yes.

42:18 But there are a lot of challenges ahead of us.

42:21 Okay.

42:21 Yeah.

42:22 What are some of them?

42:22 To start with the data that we need is not accessible right now in the way that we would

42:28 like it.

42:29 Right now, digitized manuscripts are often just on a website of a university library.

42:35 You can only kind of look at them with your human eyes, right?

42:38 But you can't access them with an API.

42:41 So you can't just sort of programmatically bring them in.

42:45 So this really lost opportunity.

42:47 But, you know, I'm working on it.

42:48 I'm talking a lot with librarians.

42:50 So hopefully we'll see some of those there.

42:52 That's really cool.

42:53 You know, sometimes when there's not an API, there is.

42:57 It's just not intended.

42:58 Right.

42:59 Like there's Selenium or there's Beautiful Soup and requests or right.

43:03 There's ways to go through and actually extract that data, even if it wasn't prepared to be

43:08 presented in that way.

43:09 You know, how much is that possible?

43:12 And how much are there like licensing or usage restrictions that make that kind of research

43:17 impossible?

43:18 Well, the sort of the licensing copyright issue is murky territory at the moment.

43:24 So this is also something that needs to be worked out.

43:27 But you're right.

43:28 I mean, this is, of course, possible.

43:29 Yeah.

43:30 But you don't want to go through and like grab all this data and say, well, you can't use

43:33 any of it because you're going to be in trouble for X, Y and Z.

43:36 Right.

43:37 Yeah.

43:37 In these kinds of fields, the world is very small, you know, so you can't make enemies.

43:43 Yeah, of course.

43:44 You might be able to make proof of concepts, though.

43:46 Right.

43:47 Like maybe as a research, like a young grad student or something, you could say, well, if

43:52 I had this data, what questions could I ask and answer?

43:54 And then maybe do something and present it back to that organization.

43:58 Say like, look, if you would somehow provide us this information, like these are the types of

44:02 things that we can do.

44:03 We spend a week to show you how can we help make that happen.

44:06 Right.

44:07 That's right.

44:07 I feel like I myself, at least I'm building up towards that.

44:11 I've already been saying these kinds of things to libraries.

44:15 And perhaps after this project, I might just be able to do like a full DH project, so to

44:22 say, and really come at this with full force.

44:25 That's the other challenge for us in the humanities.

44:29 Like we can't spend all of our time on programming.

44:33 We can't spend all of our time on these kinds of things.

44:35 So we have to divide our time and always make this difficult decision, like how much of my

44:41 personal development is going to go into this.

44:44 And if I stumble upon the problem that I can't just fix with Google and Stack Overflow, then...

44:51 Was it a dead end that like wasted your tenure track possibilities or something like that,

44:56 right?

44:58 So you don't want to change that.

44:59 Yeah.

45:00 Yeah.

45:01 Yeah.

45:01 So in the harder sciences, we have the Journal of Open Source Software.

45:08 Are you familiar with this?

45:09 I wasn't until you showed this to me.

45:11 Yeah.

45:12 So I had these folks on the show a while ago and it's pretty interesting.

45:16 It's a place to publish and cite the software side of your research.

45:22 And I did say it was more for the hard sciences, but it sounds to me like the stuff that you all

45:27 are doing in your community would also be potentially something you could publish there and then cite as

45:33 some kind of publication and whatnot.

45:34 So maybe that's a little bit of a release valve to get a little bit more credit for the software side of

45:40 things.

45:40 Definitely.

45:41 Because our main challenge is that much of our exactly, like you said, tenure track career wise,

45:47 it kind of still depends on print publications and building software kind of doesn't count,

45:54 it seems, right?

45:55 Yes.

45:55 Very difficult for us.

45:57 Yeah.

45:58 But there have been discussions about this where people in digital humanities have been

46:03 saying, well, actually we're builders, we're not writers.

46:06 We should just come to terms.

46:08 Yeah.

46:08 It's not a unique problem to digital humanities, but it is, I can imagine, a little bit harder

46:13 there.

46:14 Yes.

46:14 Because let me ask you this, like the one very sort of one problem that presents itself is

46:20 that every piece of software is kind of unique.

46:23 So how are you going to compare it with other pieces of software?

46:26 How are you going to say like, oh, this is very good or this is not so good.

46:30 This is good enough for to make tenure or this is bad enough to fire him.

46:35 Yeah.

46:36 I have no idea how to do that.

46:37 That is really tricky.

46:39 I see the problem.

46:40 Yeah.

46:41 Interesting.

46:42 Now, we're getting kind of short on time here, but I do want to give you a chance to talk about

46:47 your two new projects or your book.

46:50 You have a book called Among Digitized Manuscripts and another one, a project that you're thinking

46:55 about called Digital Literacy.

46:57 Do you want to touch on those real quick before we wrap things up?

46:59 Sure.

46:59 Thank you.

47:00 So over the last two years, I've been really focused on digitized manuscripts and how to

47:06 incorporate that into a workflow.

47:07 And out of it came this handbook, which I called Among Digitized Manuscripts.

47:12 You can go to github.com/among and you will find the repository and find some more

47:18 information about it.

47:19 It's basically a conceptual and practical toolkit for those who want to work with manuscripts

47:26 in a digital file format when they're looking at digital photos.

47:31 And so it's a very broad discussion about the concepts of it, about the challenges, but

47:36 also a whole range of tools.

47:38 So we're starting just with how to make vector images, which is a crucial skill to have.

47:45 If you can replicate glyphs from manuscripts in a vector format, you can manipulate them much

47:51 easier on a computer, of course.

47:53 And then we go up until a chapter that really introduces Python.

47:57 And I go through this example of measuring the angle of the flap and really go through

48:02 that entire code and explain it step by step what we're doing here.

48:07 All the while, of course, pointing out to people that if they want to really know more about

48:12 Python and JavaScript, which I also cover, then there are plenty of resources on the internet

48:19 or in other books available to do that.

48:21 So it's also to really to get people who are not yet in, who are not familiar with technology

48:28 to get them up and running.

48:29 Okay.

48:30 Yeah.

48:30 That sounds really interesting.

48:31 And I'm sure it's a pretty unique book.

48:33 There's probably not that many books on digital manuscripts and Python.

48:37 So it sounds like it's going to be a good resource for people.

48:40 I don't want to pat myself on the back here.

48:42 Yes, it's a virgin field.

48:44 So, you know, I would say to others, you know, really need to think through these problems.

48:50 These problems and push things further.

48:53 Yeah, absolutely.

48:54 And then the other one, digital literacy.

48:55 Well, yes.

48:56 This is sort of a concern of mine that I've had for many years.

49:00 And it's sort of part of the reason why I started the digital orientalist.com sort of online

49:05 magazine about how to use computers in your day to day workflow as somebody in Islamic studies

49:12 and in neighboring fields like Synology, Japan studies, Africana, etc.

49:18 I kind of feel that a generation before me and perhaps also my generation, I'm in my early 30s.

49:24 We still grew up with computers as something that you had to kind of make work.

49:30 Otherwise, it wouldn't do anything.

49:32 And people even before that, they didn't grow up with computers.

49:36 They don't know what's going on basically.

49:38 But people younger than us, they're growing up now with all of this, you know, polished technology.

49:44 It's just a black box basically.

49:47 And I'm kind of worried that people don't really get to understand exactly what really the foundational concepts of computers,

49:57 how knowledge is stored on a computer in bits and bytes.

50:00 It might not, you know, it might not come up on a daily basis.

50:04 But knowing this, I think, is very important in handling computers well and understanding

50:09 how computers are also used at scale, of course, by governments and whatnot.

50:13 Yeah.

50:14 Well, I definitely think that the world is absolutely full of consumers of technology.

50:21 But there's definitely room for more producers or creators with technology, right?

50:27 That's right.

50:28 I feel that in the humanities, we ought to be self-sufficient and self-reliant.

50:33 And if we can use some of programming skills ourselves, we can already solve, you know, 90% of our problems.

50:41 Yeah, absolutely.

50:41 Yeah.

50:42 Programming is a superpower for whatever you're doing.

50:44 That's for sure.

50:45 All right.

50:46 Well, I think we're going to leave it there for the main topic because we're out of time.

50:50 But before you get out here, let me ask you the last two questions.

50:53 So if you're going to write some Python code, do some of this cool image analysis, what editor do you use?

50:59 I use PyCharm CE.

51:01 Yes.

51:01 Yeah.

51:02 I just like it.

51:04 I've been using now.

51:05 I've been getting into Jupyter Notebooks, actually, thanks to your podcast with the Azure Notebooks episode.

51:12 Because I'm working on a workshop in the fall for PhD students.

51:15 And I was not planning on installing Python and pip on 30 machines.

51:20 So this is the way to go for that thing.

51:23 Yes.

51:23 Yeah.

51:23 I think it's great for teachers and for running classes and workshops.

51:27 It's just open a browser, go here, you're ready.

51:29 Yeah.

51:30 Yeah.

51:30 It's not like, well, it won't install this package on my computer.

51:34 Like, oh, no.

51:35 Here we go.

51:35 So, yeah, that's great.

51:37 Excellent.

51:39 And then notable PyPI package.

51:41 It sounds like you've got some interesting experience and exposure to them.

51:45 Yes.

51:46 Well, I mentioned OpenCV, of course.

51:48 Something that you can manipulate PDFs is PyPDF2.

51:54 Something I've been using, in my experience, there's a lack of, or there's really a need for more libraries handling PDFs.

52:03 I don't know how it's in other fields, but in the humanities, PDFs are very important file format.

52:09 A lot of information is contained just in PDFs.

52:12 So I hope we'll see more development in that.

52:16 For image, like drawing and manipulation, you're also going to need something like Pillow, which is a fork, I believe,

52:23 the Python imaging library.

52:24 Beyond that, I'd also like to mention that TextFabric.

52:28 You can pip install Text-Fabric.

52:31 And that's really just for analysis of a text that you have in plain text format.

52:36 I myself work with a developer to get the Quran in the TextFabric format, so to say.

52:42 And you can do real deep, like syntactic or semantic analysis.

52:47 And kind of just outside of sort of the Python ecosystem, I also use something.

52:55 It's called Pandoc.

52:56 And it's incredibly useful for me.

53:00 So it's a command line app.

53:02 You can reinstall Pandoc.

53:04 And you can change from one file format to another.

53:09 And for example, in my case, when you have different kinds of scripts, like English and Arabic mixed in, and you've got footnotes and endnotes and all of that.

53:18 And you've got it in a Word doc.

53:20 And Word is horrible, of course, to do anything with.

53:23 And you want to go to just plain text.

53:26 Yeah.

53:27 It's amazing.

53:28 Yeah.

53:29 I hadn't heard about that, really.

53:31 I haven't used it anyway.

53:32 And it looks super cool.

53:33 Like, it'll convert to and from OpenOffice, Microsoft Doc, RestructureText, Markdown, all kinds of stuff in here.

53:40 That's great.

53:41 Mm-hmm.

53:42 Mm-hmm.

53:43 Yeah.

53:43 And then also on our shared notes, you have this Mac app that I'm pretty excited to try out called GoToShell.

53:48 You know, it's a good example.

53:50 GoToShell.

53:51 So it's macOS only, and it sits in your Finder.

53:54 And if you're in a folder, just a little button at the top, like in the toolbar, you click it, and a terminal opens, and you're immediately in that folder.

54:03 Yeah.

54:03 In the terminal.

54:04 It's these kinds of things that are tremendously important for me as just a scholar who's, yes, I program, but I understand that some others would, you know, would sort of scoff at this and say, like, oh, no, the only right way to do that is to open a terminal and, you know, CD into it.

54:21 No, no, this is cool.

54:22 It's easy for me, you know?

54:23 Yeah, a lot of times you're in a terminal or a Finder, and you're like, I just want to go here, but how do I copy here?

54:29 There are some tricks that are absolutely not obvious.

54:32 Like, you can go to the file of the folder, either one really, but let's say the folder, and you can command C it, and then you can command V that into a terminal window.

54:41 But it's just like push a button.

54:43 It opens the terminal in the right place.

54:45 That's cool.

54:46 I'm going to check this out.

54:46 There you go.

54:47 Yeah.

54:48 All right.

54:48 Well, that's pretty much it for our time.

54:51 Final call to action.

54:52 Maybe speak to the folks out there doing humanities research or maybe not even using the digital side of things yet.

54:59 How do they get started?

55:00 Exactly.

55:01 Well, so if you're listening in the humanities and you want to, you're interested, you want to, you want to go for it, just, you know, keep listening to podcasts like this.

55:10 Pick up my book if it comes out later this year and come in contact with me.

55:16 You know, there's sort of a network at digitalorientalist.com where you can also share your experience.

55:22 I'd be very happy to help you get along.

55:26 And this also sort of kind of crosses both ways.

55:28 If there are Python developers out there who kind of think like, oh, this would be so exciting to work on humanities projects.

55:35 There are all kinds of project ideas that I just have, you know, but I can't do anything with them because of limited time then.

55:42 So connect with me and be very great to, you know, see things go forward.

55:49 Yeah, absolutely.

55:50 Well, Cornelius, I really love this look into what you're doing and I think it's fascinating how all these pieces are coming together.

55:56 So thanks for taking the time and being on the show.

55:58 Thanks to you and the community for making Python what it is.

56:02 That's what I would say.

56:03 Absolutely.

56:04 All right.

56:04 Well, talk to you later.

56:05 Thanks.

56:05 All right.

56:05 Bye.

56:06 This has been another episode of Talk Python To Me.

56:09 Our guest in this episode was Cornelius Van Litt and it's been brought to you by Command Line Heroes and Linode.

56:15 Command Line Heroes is a podcast telling the story of developers.

56:19 This season is all about programming languages and starts off with Python, of course.

56:23 Subscribe at talkpython.fm/heroes.

56:27 Linode is your go-to hosting for whatever you're building with Python.

56:31 Get four months free at talkpython.fm/Linode.

56:34 That's L-I-N-O-D-E.

56:36 Want to level up your Python?

56:39 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

56:44 Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python.

56:52 And of course, if you're interested in more than one of these, be sure to check out our everything bundle.

56:56 It's like a subscription that never expires.

56:59 Be sure to subscribe to the show.

57:01 Open your favorite podcatcher and search for Python.

57:03 We should be right at the top.

57:04 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

57:14 This is your host, Michael Kennedy.

57:15 Thanks so much for listening.

57:17 I really appreciate it.

57:18 Now get out there and write some Python code.

57:20 Thank you.

57:40 Thank you.