Python in Digital Humanities

Episode Deep Dive Links Transcript

Digital humanities sounds niche, until you realize it can mean a searchable archive of U.S. amendment proposals, Irish folklore, or pigment science in ancient art. Today I’m talking with David Flood from Harvard’s DARTH team about an unglamorous problem: What happens when the grant ends but the website can’t. His answer, static sites, client-side search, and sneaky Python. Let’s dive in.

Play on YouTube

Watch the live stream version

Episode Deep Dive

Guest Introduction

David Flood is a developer on Harvard's DARTH (Digital Arts and Humanities) team, formally known as Arts & Humanities Research Computing. David's background is in music and the humanities, with a PhD focused on textual criticism of the New Testament. He had no programming experience before 2019, when he began self-teaching Python during his doctoral studies in Edinburgh, Scotland. His need to use computational tools for comparing manuscript variants led him to learn Python by watching YouTube tutorials and listening to podcasts -- including Talk Python To Me. The pandemic gave him extra time to deepen his skills, and he eventually moved from writing one-off scripts to building reusable tools, publishing packages on PyPI, and creating web applications for fellow scholars. That journey led directly to his current role at Harvard, where he builds production web apps for faculty-led humanities research projects.

What to Know If You're New to Python

This episode covers web development, static site generation, and data infrastructure topics. Here is a quick primer to help you get the most out of the discussion:

Django is Python's most popular full-featured web framework. This episode focuses heavily on Django-based projects and tools like Django Bakery, Django REST Framework, and the Django ORM.
Static vs. dynamic websites: A dynamic site needs a running server and database to respond to requests; a static site is just HTML, CSS, and JavaScript files that can be hosted anywhere cheaply or even for free.
Elasticsearch and PostgreSQL are common backend tools for search and data storage. Understanding that these services cost money to run is essential context for the archival challenge discussed throughout this episode.
WebAssembly (Wasm) is a technology that allows compiled code to run in the browser at near-native speed. The episode explores exciting possibilities around using it to run Python and even Django entirely on the client side.

Key Points and Takeaways

1. Static Sites as the Gold Standard for Archiving Grant-Funded Web Projects

The central thesis of this episode is a deceptively simple question: what happens when the grant money runs out but the research website needs to live on? David and his team at Harvard have developed a "gold standard" answer: convert dynamic web applications into static sites. By baking a Django app out into plain HTML, CSS, and JavaScript files, the resulting site can be hosted for free on platforms like GitHub Pages or dropped into an S3 bucket. The trade-off is that you lose some features like full-text vector search and the ability to easily add new data, but you keep the research accessible to the public indefinitely. This approach acknowledges that the data itself is more valuable than any particular hosting arrangement, and that a folder of HTML files is the most portable, durable format for long-term preservation. David emphasized that they now design some projects from the beginning with this archival endpoint in mind.

pages.github.com -- GitHub Pages for free static site hosting
archiveprogram.github.com -- GitHub's Arctic Code Vault / Archive Program

2. PageFind: Replacing Elasticsearch with Client-Side Search

PageFind emerged as one of the most exciting tools discussed in the episode. It is a fully static, client-side search library that can replace server-side search solutions like Elasticsearch on archived sites. PageFind works by crawling your HTML files and building a search index that it chops into many small fragments organized roughly alphabetically. When a user types a search query, only the relevant fragment is pulled over the network, making search nearly instant. David demonstrated that Harvard's Amendments Project, which searches across 22,000+ full texts, can be powered entirely by PageFind after sunsetting its Postgres full-text search. PageFind supports filtering and faceting, which David considers even more important than keyword search for data discovery. Critically, PageFind has a Python API that lets you build indexes programmatically from database dumps rather than only from HTML files. Harvard's DARTH team also maintains an open-source Vue.js component library for PageFind that they reuse across projects. Michael Kennedy noted that on his personal site, "you can't type fast enough to outrun the results."

pagefind.app -- PageFind static search library
github.com/artshumrc/pagefind-vue -- Harvard DARTH's Vue.js component library for PageFind

3. Django Bakery and Frozen Flask: Baking Dynamic Sites into Static Files

Django Bakery is a library that lets you "bake" a Django site into flat static files. The key requirement is that you must opt into using Django Bakery's own class-based views from the start of your project, which makes it difficult to add retroactively. Harvard's DARTH team used it for the Water Stories project, a companion site for a Radcliffe Institute art installation, where stories submitted on iPads went into a Django database and were then baked into static HTML. Even after archiving, when the faculty requested changes, David could spin up the full app locally with Docker Compose, make edits in the Django admin, and rebake the site. Michael also mentioned Frozen Flask, which provides similar static-baking functionality for Flask applications. Both tools represent a practical middle ground: develop with the full power of a dynamic framework, then freeze the result for long-term hosting.

github.com/datadesk/django-bakery -- Django Bakery for baking Django sites into static files
github.com/Frozen-Flask/Frozen-Flask -- Frozen Flask for converting Flask apps to static files

4. WebAssembly: Running Django Entirely in the Browser

One of the most forward-looking ideas discussed was running a full Django application in the browser via WebAssembly. Thanks to Pyodide (which runs Python in the browser via Wasm) and projects like PGLite (which runs Postgres in the browser), it is now possible to host Django in a service worker with a local SQLite or Postgres database -- all client-side. David pointed to a proof-of-concept project called Django WebAssembly that lets you log into the Django admin entirely within your own browser, with no backend server at all. This approach could preserve the full functionality of a live site as what is essentially a static deployment. The long-term durability question around WebAssembly standards was raised, but David noted that even in a worst case, having everything Dockerized and in a public repo means someone can always rescue the project.

pyodide.org -- Run Python in the browser via WebAssembly
pyscript.net -- PyScript for running Python in web pages
pglite.dev -- PGLite: run PostgreSQL in the browser via Wasm
github.com/m-butterfield/django_webassembly -- Proof of concept for Django running in the browser

5. Harvard's DARTH Team: A Digital Agency Inside a University

DARTH (Digital Arts and Humanities) operates as a small, agency-like team within Harvard's much larger IT organization (500+ people). They consult with faculty who have funded research projects and build custom web applications for them. David describes three main categories of projects: virtual research environments (platforms for doing research with visualization, data entry, and Postgres), public-facing exploration and search interfaces, and data extraction and transformation tools. The team works on two or three greenfield projects per year and also "puts to bed" a few projects each year as grants expire. They work with a designer and evaluate each project from first principles rather than relying on a single cookie-cutter template, because each project's requirements are sufficiently different. Their 404 page features Darth Vader with the message "I find your lack of nav disturbing."

digitalhumanities.fas.harvard.edu -- Harvard DARTH official website
github.com/artshumrc -- DARTH's GitHub organization

6. The Grant Funding Lifecycle and the "What Now?" Problem

A recurring theme throughout the episode is the tension between grant-funded development and long-term sustainability. During active development, projects run on AWS using containers (ECS), RDS Postgres, and Elasticsearch clusters. These are reliable and robust, but the costs add up. When a grant ends, someone has to decide who pays even $100 a month for hosting, and who handles upgrades and maintenance over time. The DARTH team has no DevOps person on call for weekends, so reliability during the active phase matters too. This unglamorous but critical problem drives much of the team's innovation around static site archival, PageFind, and WebAssembly-based solutions. David noted that they now strategize about end-of-life from the very beginning of some projects, rather than scrambling when funding dries up.

7. Showcase of Real-World Digital Humanities Projects

David walked through several compelling projects that illustrate the range of digital humanities work at Harvard. The Amendments Project is a searchable database of over 22,000 proposed amendments to the U.S. Constitution that never passed, built with Postgres full-text vector search and currently being transitioned to a static site with PageFind. The Fionn Folklore Database catalogs hundreds of years of Celtic storytelling traditions around the hero Fionn MacCumhaill, with audio recordings and written documents in English, Scottish Gaelic, and Irish Gaelic -- requiring deep internationalization down to the database level. Mapping Color in History is a pigments database where researchers perform spectral analysis on Asian artwork to track how pigments were made over time, with a deep-zoom image viewer for pinpointing analysis locations. The Tsumeb Mine Notebook documents mineral specimens from the historic Tsumeb mine in Namibia, using an Astro-based static site with PageFind search. Water Stories was a companion to a Radcliffe Institute art installation, built with Django Bakery for easy archival.

digitalhumanities.fas.harvard.edu/projects/amend/ -- The Amendments Project
digitalhumanities.fas.harvard.edu/projects/fionn-cycle/ -- Fionn Folklore Database
mappingcolor.fas.harvard.edu -- Mapping Color in History
tmn.fas.harvard.edu -- Tsumeb Mine Notebook
waterstories.fas.harvard.edu -- Water Stories

8. Data Wrangling and ETL in the Humanities

David confirmed the old data science adage that 80% of the work is data wrangling. Most projects begin with researchers handing over Excel spreadsheets (sometimes several), and the first task is ingesting that data into a proper Postgres database. Figuring out the right data model and relationships is the number one challenge of the early stage, but it benefits everyone because it forces the researchers to think about their data in a more organized way. Cleaning the data involves handling fuzzy dates like "summer of 2020" or "July of 2020," and the team writes extensive test suites around the ingest process. Michael recalled the famous example of biologists who had to rename a gene because Excel kept parsing its name as a date. Once the data is cleaned and loaded, the team builds a web-based interface so researchers can continue entering new data in a structured way rather than going back to spreadsheets.

9. Search Interfaces as a Superpower for Research Discovery

David made a compelling case that a powerful search interface is often the single most effective way to demonstrate the value of digital tools to humanities researchers. Many scholars have their data locked behind terrible search interfaces or are scrolling through Excel spreadsheets trying to find what they need. Putting that same data behind Elasticsearch with good filtering and faceting gives them fast, structured access to their research data in a way they never had before. For the public-facing side of projects, search and filtering are what enable discovery. David noted that faceting and filtering -- by state, by Congress, by author -- are often more useful than keyword search alone. The DARTH team maintains reusable search components across projects and has built their own Vue.js PageFind component library to standardize the search experience.

10. From Humanities PhD to Full-Time Developer: Python as a Career Catalyst

David's personal story is a powerful example of how Python can open unexpected career paths. In 2019, he didn't know what Python was or the name of any programming language. He was doing textual criticism of New Testament manuscripts and needed computational tools that turned out to be software libraries, not desktop apps. Learning Python to glue these tools together during his PhD led him into web development, PyPI, GitHub, and eventually a full-time development role at Harvard. He found that the ethos of open-source software mapped naturally onto academic culture. David credited Talk Python To Me as a form of "language immersion" that helped him understand the developer landscape even before he fully understood every technical detail. His advice to others in similar positions: keep listening, keep learning, and let the technical vocabulary become familiar through repeated exposure.

11. AI in Digital Humanities: A Double-Edged Sword

David reflected on AI's impact from two angles. On the practical side, his team uses AI for data extraction to make search interfaces more powerful, and he personally uses Claude to help with complex mathematical operations in audio processing libraries like Librosa. Harvard faculty are beginning to "vibe code" with AI tools, and the DARTH team is teaching them to use dedicated tools like GitHub Copilot and Cursor rather than copy-pasting from ChatGPT. On the philosophical side, David raised a thought-provoking concern: if AI coding had been available when he started his PhD, he could have accomplished his immediate research goals without ever acquiring deep technical skills. That means he would never have discovered the career path that led him to Harvard's DARTH team. Michael agreed this is a real risk: AI might knock people off the more technical path, preventing the kind of serendipitous career discovery that David experienced. David emphasized that AI is more powerful in the hands of someone with real software engineering skills than in the hands of a complete beginner.

librosa.org -- Librosa: Python library for audio and music analysis

12. AWS Containers, Compliance, and Infrastructure Trade-offs

David shared a candid story about trying to reduce costs by moving from AWS ECS (Elastic Container Service) container clusters to a single EC2 instance. He priced it out, deployed a proof of concept, and confirmed it would be significantly cheaper. However, he discovered that managing your own VM at a large university introduces a host of compliance requirements: you need to ensure the latest OS patches are applied, run specific observability tools, and meet security standards that are automatically handled when running in managed container clusters. The ECS approach costs more but requires much less compliance overhead and no on-call DevOps staff. All of their infrastructure is defined as code using AWS CDK. This is a practical lesson for anyone at an institution with strict IT governance: the cheapest option on paper may not be the cheapest when compliance costs are factored in.

13. Open Source and Long-Term Preservation in Academia

Projects at DARTH typically start as closed-source during development and then transition to open source after consultation with faculty. Making projects open source is especially important when they reach end-of-life, as it allows others to run, fork, or adapt the code. David noted that everything they build is Dockerized, so in the worst case someone can take the Docker image and run the project themselves. He also mentioned GitHub Codespaces as an archival strategy: for one Ruby on Rails project that needed to be archived urgently, he set it up so anyone could boot it with a single command in Codespaces. Universities typically have their own archival systems for important research data, and Michael mentioned GitHub's Arctic Code Vault as an additional layer of preservation. The broader point is that open source, containerization, and static site generation work together as a multi-layered preservation strategy for digital humanities work.

github.com/copier-org/copier -- Copier: project templating with update capability
github.com/cookiecutter/cookiecutter -- Cookiecutter: project scaffolding from templates
joss.theoj.org -- Journal of Open Source Software

Interesting Quotes and Stories

"Just applying these technical tools to old questions, that is the core of digital humanities." -- David Flood

"If AI coding had been around the way it is now when I was learning, I wouldn't be doing digital humanities at Harvard. I wouldn't have been able to get into this field, I wouldn't have known about it." -- David Flood

"AI is more powerful in my hands now [as a software engineer] than it would have been then [as a beginner]. So I'm thankful for that." -- David Flood

"Can this become a static website? Can we bake this out into all HTML files and acknowledge that there will be some trade-offs?" -- David Flood

"It goes on GitHub Pages and it can live hopefully forever. I mean, it feels like GitHub will last forever, but it'll last longer than funding will anyways." -- David Flood

"I just want to hear people talk about deployment to get a sense of what actual deployment sounds like." -- David Flood, on using podcasts as language immersion for learning software development

"You can't type fast enough to outrun the results." -- Michael Kennedy, describing PageFind's search speed

"Programming is a superpower, not a replacement for your job." -- Michael Kennedy

"Talk Python has been kind of like that conversation with the open source community that's been always in my ear." -- David Flood

"Page not found. I find your lack of nav disturbing." -- Harvard DARTH's 404 page

David shared the story of using phylogenetic software -- tools designed for evolutionary biologists to track how DNA mutations spread -- and swapping in Greek textual variants for DNA letters to track how ancient manuscripts changed over time and group them into families. This cross-pollination of computational biology and textual criticism is a vivid example of Python enabling unexpected research connections.

Michael recalled the famous biology story where a human gene had to be renamed because Excel kept parsing its name as a date, perfectly illustrating the data wrangling challenges that humanities researchers face when their data lives in spreadsheets.

Key Definitions and Terms

Digital Humanities: An interdisciplinary field that applies computational tools and methods to traditional humanities research. David describes it as "applying technical tools to old questions."
Textual Criticism: The scholarly discipline of comparing multiple copies of the same text to determine the original wording and trace how it changed over time. David's PhD work focused on New Testament manuscripts.
Critical Apparatus: In textual criticism, a structured display of variant readings alongside a base text, showing how different copies of a text differ from one another.
Static Site: A website made up of pre-built HTML, CSS, and JavaScript files that require no server-side processing. Can be hosted cheaply or for free on platforms like GitHub Pages.
Django Bakery: A Django library that "bakes" a dynamic Django site into flat static HTML files for cheap, long-term hosting.
PageFind: A client-side search library that builds a pre-computed search index split into small fragments, enabling fast, server-free full-text search on static websites.
Pyodide: A Python distribution compiled to WebAssembly that runs CPython directly in web browsers.
PGLite: A WebAssembly build of PostgreSQL that runs entirely in the browser.
ECS (Elastic Container Service): AWS's managed container orchestration service, one step below Kubernetes, used by DARTH for hosting production Django apps.
AWS CDK (Cloud Development Kit): A framework for defining cloud infrastructure as code using familiar programming languages.
Vibe Coding: Using AI tools to generate code through natural language prompts with minimal manual coding, a practice David noted is emerging among Harvard faculty.
Phylogenetic Software: Tools from computational biology that track how DNA sequences mutate over time; repurposed in textual criticism to track how manuscript texts diverge.

Learning Resources

Here are resources from Talk Python Training to go deeper on the topics covered in this episode:

Django: Getting Started: David's team is a Django shop, and this episode revolves around Django-based projects, Django Bakery, Django REST Framework, and the Django admin. This course teaches you how to build your first Django project and guides you through the key parts of the framework.
HTMX + Django: Modern Python Web Apps, Hold the JavaScript: David mentioned the value of using vanilla Django templates for archivability rather than JavaScript-heavy frontends. This course shows how to build interactive Django apps using HTMX with minimal JavaScript.
Static Sites with Sphinx and Markdown: The archival strategy at the heart of this episode is converting dynamic sites to static ones. This free course teaches static site generation with Python tools.
Getting started with pytest: David described writing extensive test suites around data ingest processes to catch issues like date parsing errors. This course covers pytest fundamentals for building reliable test coverage.
Python for Absolute Beginners: David's story of going from zero programming knowledge to a full-time developer role is inspiring. If you are just starting out, this is the premier course for beginning developers, covering the big ideas from CS 101 all the way through building applications.

Overall Takeaway

This episode reveals a problem hiding in plain sight across academia: grant-funded research websites are built to be powerful, but they are not built to survive their own funding. David Flood and Harvard's DARTH team have turned that constraint into a design philosophy, proving that static sites, client-side search with PageFind, and even browser-based Django via WebAssembly can keep research alive long after the last dollar is spent. But the technical story is only half of it. David's personal journey from a humanities scholar who didn't know what Python was in 2019 to a full-time developer at Harvard is a testament to Python's unique power as a gateway language. It meets people where they are, lets them solve real problems, and opens doors they didn't know existed. Whether you work in academia, run a small agency, or just want your projects to outlive their infrastructure, the lesson is the same: build for today, but archive for tomorrow.

Links from the show

Guest
David Flood: davidaflood.com

DARTH: digitalhumanities.fas.harvard.edu
Amendments Project: digitalhumanities.fas.harvard.edu
Fionn Folklore Database: fionnfolklore.org
Mapping Color in History: iiif.harvard.edu
Apatosaurus: apatosaurus.io
Criticus: github.com
github.com/palewire/django-bakery: github.com
sigsim.acm.org/conf/pads/2026/blog/artifact-evaluation: sigsim.acm.org
Hugo: gohugo.io
Water Stories: waterstories.fas.harvard.edu
Tsumeb Mine Notebook: tmn.fas.harvard.edu
Dharma and Punya: dharmapunya2019.org
Pagefind library: pagefind.app
django_webassembly: github.com
Astro Static Site Generator: astro.build
PageFind Python Lib: pypi.org
Frozen-Flask: frozen-flask.readthedocs.io

Watch this episode on YouTube: youtube.com
Episode #538 deep-dive: talkpython.fm/538
Episode transcripts: talkpython.fm

Theme Song: Developer Rap
🥁 Served in a Flask 🎸: talkpython.fm/flasksong

---== Don't be a stranger ==---
YouTube: youtube.com/@talkpython

Bluesky: @talkpython.fm
Mastodon: @talkpython@fosstodon.org
X.com: @talkpython

Michael on Bluesky: @mkennedy.codes
Michael on Mastodon: @mkennedy@fosstodon.org
Michael on X.com: @mkennedy

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Digital humanities sounds niche until you realize that it can mean a searchable archive of U.S.

00:05 amendment proposals, Irish folklore, or pigment science in ancient art. Today I'm talking with

00:11 David Flood from Harvard's DARTH team about an unglamorous problem. What happens when the grant

00:18 ends? But the website can't. His answer? Static sites, client-side search, and sneaky Python.

00:24 Let's dive in. This is Talk Python To Me, episode 538, recorded January 22nd, 2026.

00:48 Welcome to Talk Python To Me, the number one Python podcast for developers and data scientists.

00:53 This is your host, Michael Kennedy.

00:55 I'm a PSF fellow who's been coding for over 25 years.

00:59 Let's connect on social media.

01:00 You'll find me and Talk Python on Mastodon, BlueSky, and X.

01:04 The social links are all in your show notes.

01:06 You can find over 10 years of past episodes at talkpython.fm.

01:10 And if you want to be part of the show, you can join our recording live streams.

01:14 That's right.

01:14 We live stream the raw uncut version of each episode on YouTube.

01:18 Just visit talkpython.fm/youtube to see the schedule of upcoming events.

01:23 Be sure to subscribe there and press the bell so you'll get notified anytime we're recording.

01:27 This episode is brought to you by Sentry.

01:29 Don't let those errors go unnoticed.

01:31 Use Sentry like we do here at Talk Python.

01:33 Sign up at talkpython.fm/sentry.

01:37 And it's brought to you by CommandBook, a native macOS app that I built that gives long-running

01:42 terminal commands a permanent home.

01:44 No more juggling six terminal tabs every morning.

01:46 Carefully craft a command once, run it forever with auto-restart, URL detection, and a full

01:51 CLI.

01:51 Download it for free at talkpython.fm/command book app.

01:56 Hello, David. Welcome to Talk Python To Me. Amazing to have you here.

01:59 I'm glad to be here. Talk Python has been part of my story up to this point.

02:03 Has it? Okay. Well, you are about to write the next chapter in the story. So that's pretty excellent.

02:10 I have a sense of what's coming. We planned out what we're going to talk about and that sort of thing.

02:15 And I'm really excited about this topic. So it's going to be a good one.

02:21 Honestly, I think one of the real powers of the Python community and the reason the language has such staying power is there's such a diversity of use cases, technology, like technology standpoints, right?

02:34 Like I build software for this group or I build these types of apps and it's not just, you know, like Ruby on Rails, which, you know, it's been very popular, but it's, it's for websites, right?

02:44 You know what I mean?

02:45 Yeah, absolutely.

02:47 I mean, web development has dominated my use of it, but my entry into it, which I suppose I'll mention in a moment, was through all those little tools.

02:57 Let's hear it.

02:58 Who are you, David Flood?

03:00 Tell us, introduce yourself real quick and tell us about how you got into it.

03:04 So my background is in music and the humanities.

03:09 I mean, in 2019, I didn't know what Python was or the name of any programming language.

03:16 and I've been doing textual criticism, which is, you know, there's lots of criticisms in the academy.

03:22 This is the one where if you have lots and lots of versions of the same text,

03:27 you are comparing them to work out what the initial text was and like how it changed over time.

03:33 Okay, give us an example.

03:35 Okay, so one of the famous examples, hope I can remember it off the top of my head,

03:40 is from Shakespeare.

03:42 We're all familiar with the line to be or not to be.

03:45 is the question. That is the question. Well, there's a variant of it. One of the early copies

03:53 written by Shakespeare himself has... Somebody's going to be able to type into the chat exactly

03:58 what it is. They'll know this anecdote. But it's something more like, "To be or not to be, I."

04:04 That's the question. And so, which one is the original one? Why did he change it? That's kind

04:09 of one example i work mainly in the in the new testament which is especially complicated because

04:15 no other corpus from ancient history has as many copies of the same text as that corpus does so it's

04:23 quite um quite quite complicated and our techniques have have grown grown because of that and perhaps

04:29 become more advanced than now i mean that many variations over that huge span of time over

04:37 different groups with different, maybe not intentions, but certainly colored by different

04:43 worldviews and philosophies and so on. And yeah, I see the trouble.

04:47 No, yeah. And they were people of the book. So copying it is something that happened a lot. And

04:54 they copied the monks, like the medieval monks copied everything. They copied our Greek classics.

05:01 So that's what I was interested in. And because of the wealth of data that we have,

05:07 Computer tools are more and more important in that field.

05:11 So when I started my PhD in 2019, I knew that I wanted to use some of these cutting-edge tools.

05:17 Some of them may be surprising.

05:19 For example, we've been using phylogenetic software.

05:24 This is software that evolutionary biologists are using or computational biologists are using to track, for example, how COVID strains mutate over time.

05:35 Oh, interesting.

05:36 What they're comparing are the DNA letters.

05:40 And so you have the sequence of letters and you're comparing how those change over time.

05:44 Well, you can swap in textual variants for DNA letters.

05:48 And now we can track how texts change over time and group them into families, things like that.

05:56 It's like a time series, but of words or letters or something.

05:59 Yeah, I mean, yeah, there's lots of important algorithms for comparing

06:06 sequences of things. And so if we can just swap in Greek words and Greek text instead,

06:12 then we can maybe apply it to textual criticism. So I was pretty interested in those things. That

06:16 wasn't actually the method that brought me into it, but something like that, kind of computer

06:21 intensive tools. What I learned is that these tools weren't actually available to me. They

06:27 weren't desktop applications. And for the most part, they weren't public web applications. They

06:36 PyPI or something like that, right?

06:38 Yeah, exactly.

06:39 Exactly.

06:39 Or Java.

06:41 And I needed to glue them together.

06:43 So the long story short on that is during the first year of my PhD, I was picking up Python,

06:50 watching YouTube videos while I was doing the dishes.

06:52 And then the pandemic hit while I was living in Edinburgh in Scotland, probably not far

06:57 from Will McCoogan.

06:59 And so the pandemic gave me the excuse to spend even a few more hours each day picking up these

07:06 new, these new technical skills. And so I did it, I was able to use these advanced tools in my in my

07:13 work. But what was really important to me was sharing, like making that available to my colleagues,

07:18 is I had to I had to move from writing these like bad top to bottom Python scripts into things that

07:23 could be reused by other people. And that led me into the web, because the web is where that's how

07:29 I can share with anybody. It's really wild how much the web is kind of the last bastion of

07:36 app freedom. It's so bizarre because, you know, I've many times told the stories of the insane

07:42 battles of just getting our apps that just playback video of content that's already on the web

07:48 into the app store. I mean, weeks of fighting about the weirdest, most nonsensical things with

07:54 both Google and Apple. But we also now have the Mac platform and the Windows platform very

08:01 aggressively looking for digital code certificates and all sorts of signing and other kinds of proof

08:07 like it you can't even just send somebody an executable anymore it won't run it's it's crazy

08:13 it's it's down to like okay put it on the web i guess that's right i i i played the game of

08:19 distributing desktop apps that's how i did it that's why i initially distributed things um

08:25 and at this point i just require people to install python and then install my desktop app from pypi

08:30 because it's too hard otherwise for me.

08:33 I mean, I could pay for the code signing from Apple and do all of that,

08:37 but it's just, it's too much work for the time that I have.

08:40 Yeah, I'm about to do another round of it.

08:42 I'm working on an app and my developer account is still active.

08:45 So we might have a fresh round of fun.

08:47 Hopefully it goes through this time.

08:50 Anyway, I do think it's such a challenge.

08:52 And are you leveraging?

08:53 I don't know if the timing was right.

08:55 Like maybe this was too early, but these days, are you leveraging things like uvx

09:00 to run, or are you just pip install this thing and then run it?

09:04 Yeah, I haven't updated the readme in a while, so I think it just asks for pip.

09:08 But certainly, if somebody asked me today, I would say, yeah, just install this with uv.

09:14 Because then they don't even need Python.

09:16 Exactly.

09:17 And that's brilliant.

09:18 And that's a really, it is another barrier reduced in distributing these applications,

09:23 right?

09:23 Like, if you can get uv installed on a machine, then you don't even have to say install, just

09:28 The way you run it is uvx my thing and it's all transparent to you, right?

09:33 Which is beautiful.

09:33 So what was it like?

09:35 Yeah.

09:35 So what was it like coming from what sounds like a not super screen focus, super

09:43 techie aspect and having to dive into this world and someday you're probably

09:47 like, how is it that I'm publishing stuff to PyPI?

09:49 What has happened to me?

09:51 Yeah.

09:51 well, yeah, I remember when I, when I first signed up for GitHub, because

09:56 you know, whatever YouTube tutorial I was working through at the time, you know, said that I needed

10:02 to do that. You know, I think it all started making a lot of sense. I didn't have any technical

10:08 background, but the world kind of open source software, it just kind of made sense. It felt

10:17 like it fit really well into my academic, you know, circle. I think a lot of the attitudes are

10:23 similar. I agree. I think they are actually. And I think that's, I think that's a pretty neat thing.

10:28 Yeah. Very cool. All right. Well, let's talk about what you're doing with digital humanities.

10:34 You're actually at a really interesting project or organization, I guess, that does many projects,

10:40 right? Yeah. Yeah. So fast, fast forwarding, I did, I finished my PhD in the humanities.

10:45 Sorry. I had so much fun. No, that's fine. That's fine. I had so much fun writing like these tools

10:50 and then just solving the distribution problem to share them with other scholars.

10:56 That was so fun that I was open to this kind of opportunity

11:00 where now I'm doing this full time.

11:02 And so, yes, so I'm on the, we call it affectionately Darth,

11:07 which is digital arts and humanities at Harvard.

11:11 There has to be a lot of Star Wars memes and references,

11:14 I'm sure.

11:14 If you can pull up a 404, I think there will be a Darth Vader reference.

11:19 Seriously, I'm here for it.

11:22 Yes, page not found. I find your lack of nav disturbing.

11:27 You know what? I think that is beautiful. And I really, I really think that people should embrace

11:33 the 404, the fun 404 page, you know, more, right? There should really be something going on that

11:40 like makes it, you know, something hasn't worked out, but you can just, you can make people laugh.

11:46 Yeah. I appreciate that.

11:48 I've heard people push back against it.

11:50 Like if you're on a, if you're on like your medical website and you're maybe about to get bad news and then you get like a picture of a kitten.

12:00 Dr. Kitten doesn't know where your results went.

12:02 Like I get that.

12:02 That's not funny.

12:04 But I mean, most things are not that serious.

12:06 Yeah.

12:07 Mostly.

12:08 Okay.

12:09 So what kind of things does Darth do?

12:12 You've described this as kind of a web or tech agency within Harvard.

12:17 Yeah, it is very much.

12:18 So, you know, Harvard has a gigantic IT group.

12:21 I don't know how many hundreds of people work, but more than 500 people in IT.

12:28 We are a small team and we operate very much like a small agency.

12:33 So usually what happens is a faculty member has a funded research project that's going to last for an amount of time.

12:42 And then we consult with them to build it.

12:44 And most of the time, I kind of think of these as I kind of have these different categories of these kinds of projects that I think of.

12:54 I lost in my notes what I call them.

12:56 But they are there.

12:57 You have like a one is like a virtual research environment.

13:01 So the focus is this is this is a platform that we're building for the research to be done on.

13:07 Like the reason the research should be done in like a web app would be because you have access to visualization, to Postgres, to Pandas.

13:17 So we can kind of build up this platform to do the actual research on and some of the data entry.

13:23 So like a full on research application.

13:26 Yeah, exactly.

13:27 I guess you can also kind of see your work through the different stages of research projects and academic research and so on.

13:36 And we'll get to maybe end of life in a sense further down in the conversation.

13:42 But so this would be we have a grant or we just work here and we're going to work on some form of research.

13:49 What do you give them?

13:50 Right. And I think that's a super interesting challenge because one of the real common answers would be Jupyter, Jupyter Lab, Marimo, whatever.

13:59 But that's still pretty code heavy for people who are possibly philosophers or something, you know.

14:05 Oh, exactly. That's why in digital humanities, I won't even, maybe I won't even attempt to define

14:13 it in any narrow sense, because I'll get in trouble with somebody. But you have two groups

14:20 that are interfacing with each other. And one is digital humanities as a field, like as a subfield,

14:26 all of its own. And these are people who have humanities domain, like knowledge,

14:31 and technical skills, and they're bringing them together. And in a lot of cases, the audience for

14:36 that kind of work is other people working in the digital humanities. But far more common,

14:42 and this is what we work with, is people who have humanities domain expertise, and they want to

14:49 publish or do research or share with other people who have that same humanities domain expertise,

14:55 and they are now interested in adding a technical component to it.

14:59 How can we supercharge what they have?

15:03 This portion of Talk Python is brought to you by Sentry.

15:06 I've been using Sentry personally on almost every application

15:10 and API that I've built for Talk Python and beyond over the last few years.

15:14 They're a core building block for keeping my infrastructure solid.

15:18 They should be for yours as well.

15:19 Here's why.

15:20 Sentry doesn't just catch errors.

15:22 It catches all the stuff that makes your app feel broken,

15:25 the random slowdown, the freeze you can't reproduce, that bug that only shows up once real users hit it.

15:30 And when something goes wrong, Sentry gives you the whole chain of events in one place.

15:34 Errors, traces, replays, logs, dots connected.

15:38 You can see what's led to the issue without digging through five different dashboards.

15:42 SEER, Sentry's AI debugging agent, builds on this data, taking the full context,

15:47 explaining why the issue happened, pointing to the code responsible, drafts a fix,

15:52 and even flags if your PR is about to introduce a new problem.

15:56 The workflow stays simple.

15:58 Something breaks, Sentry alerts you, the dashboard shows you the full context,

16:02 Seer helps you fix it and catch new issues before they ship.

16:06 It's totally reasonable to go from an error occurred to fixed in production in just 10 minutes.

16:12 I truly appreciate the support that Sentry has given me

16:15 to help solve my bugs and issues in my apps, especially those tricky ones that only appear in production.

16:21 I know you will too if you try them out.

16:22 So get started today with Sentry.

16:24 Just visit talkpython.fm/sentry and get $100 in Sentry credits.

16:30 Please use that link.

16:31 It's in your podcast player show notes.

16:32 If you're signing up some other way, you can use our code talkpython26, all one word,

16:38 talkpython26, to get $100 in credits.

16:41 Thank you to Sentry for supporting the show.

16:44 Maybe just take a moment and speak to, maybe, I don't know if this venue will actually speak

16:49 directly to anybody who I was imagining here, but people who work with folks, what would you tell

16:54 somebody who works with a group who have some technical skill, who could create some of these

16:58 things that we're going to talk about, but the people who they've created for don't necessarily

17:02 think they need it or know that they need it. I've gone often on rants about how programming is a

17:09 superpower, not a replacement for your job, right? Yeah. That's a problem for a lot of people,

17:15 especially because you might use some new computer tools to supercharge your research.

17:20 But the article that you publish or the research output of that, the audience, they may not

17:25 be interested in hearing about that at all.

17:28 And so for most people who are working in this space, the tools, you have to use them

17:33 in such a way that you can talk about the research output without talking about the

17:37 tool.

17:38 And we have other venues to talk about the tools themselves, like the Journal for Open

17:43 source software and you can kind of get some of it out there. But that is a, that's the significant

17:48 challenge is convincing people that it, that it could be useful and then convincing the audience

17:53 that they should be interested in kind of the methods behind how some of the new research comes

17:57 up. Also, I think I'm a big believer that presenting stuff in the right order is really,

18:03 really important. If you present your research and it's beautiful and powerful and oh, look,

18:07 we've also, by the way, covered a hundred times more data than any prior research. Surprise,

18:12 I wonder how I did that.

18:14 And then people are like, this is amazing.

18:16 Then after you kind of hook them with the inspiration and what's possible,

18:19 then you're like, let me tell you about the tool.

18:21 And all of a sudden you're like, that's a cool tool, right?

18:22 This is not just like geekery, like programmer, you know,

18:26 Charlie Brown speak, wah, wah, wah, wah, wah.

18:28 You know, it's like, no, I'm listening.

18:29 Tell me now.

18:30 Yeah, exactly.

18:31 I mean, one of the things I think that really opens people's eyes

18:35 is a really powerful search interface.

18:38 You have all of this research data.

18:40 just put it behind Elasticsearch with some really good filtering on it. And all of a sudden you have

18:45 fast, rapid access to the data in a way you never had before. Like you were never scrolling through

18:51 the Excel spreadsheets and finding exactly what you wanted, like you were with this new search

18:55 interface. And that by itself is like so simple. We're so used to that in web development that

19:00 like everything needs to have a fantastic search now. But so many people have their data locked

19:05 behind, you know, a terrible search interface.

19:07 Yeah, just a few things to sort of expose that.

19:10 So this, give us a sense of what these data exploration web apps might look like.

19:14 These are probably kind of mostly stuck to the inside, kind of internal to the research

19:20 lab research team groups and so on.

19:22 These are probably not that public facing, right?

19:24 Almost everything we work on does end up having a public facing component.

19:28 So maybe the research itself is done, locked behind a user login.

19:34 That's just for the researchers.

19:36 But then they expose that research to the public, usually with a good search interface

19:41 and different pages for exploring their data and visualizations and things like that.

19:47 So yeah, everything we do ends up becoming a production public web app in the end.

19:52 And then another one of your categories, you put it was virtual research environments

19:57 like data entry, publishing, authoring, collaboration.

20:00 Tell us about that.

20:01 Yeah, so a good example of this maybe is one of the projects that... Well, actually, the best example of it is the project I worked on

20:08 during my PhD. It's called Apatosaurus. The short story behind the name is that it sounds like

20:16 apparatus. In textual criticism, when you are displaying and visualizing variant readings to

20:24 a base text, that form of visualizing it is a critical apparatus. A critical apparatus is a

20:32 a pretty boring website name, but Apatosaurus dinosaurs might make textual criticism sound fun.

20:37 Yeah, I do love dinosaurs. No, that's really cool. So this, this comes out as a web app. And I know

20:43 you also have some, you talked about some desktop apps as well.

20:46 Yep. Yep. That's right. So, yeah. So, so there's this people, people upload their,

20:50 their collation to this and then they can visualize it. And like there, there's a public

20:55 component of this as well, but really the backend is editing, editing a collation,

21:00 and adding notes to all of the different readings and stuff.

21:03 So I could show what the backend looks like, but we can also move on.

21:08 - Let's move on just because most people will not totally hear, but just give us a sense of like,

21:14 like what do people, what do you create for people so that they're like, yeah, I can use this app, right?

21:21 Like give us a sense of some of the features, I guess is what I'm getting to.

21:25 - Yeah, so another good example is we have a project at Harvard called Mapping Color in History.

21:33 And this is a collaboration with a lab.

21:37 This lab brings in pieces of artwork and they do spectral analysis on the pigments

21:42 so they can identify what was used to make a particular color of this red

21:48 or what was made to make this color of blue.

21:51 And then the idea is tracking how did people make those pigments over time,

21:57 over time and specifically in Asian art.

22:02 Is this the Dharmra, Puna, Puna?

22:05 No, this is mapping color in history.

22:08 I don't think it's up here.

22:09 Sorry about that.

22:10 Somewhere.

22:10 That's all right.

22:11 I'll find it.

22:12 Keep talking.

22:13 Okay.

22:14 So the front end is great.

22:16 You know, like the public end, this is people can explore by pigments

22:21 and then see the images that contain those pigments.

22:24 Now in the back end, what the researchers will be able to do is correlate exactly which

22:30 point of a painting the analysis was done on.

22:34 So they have this deep zoom image viewer where they'll zoom in and they'll select the point

22:39 where that was taken from.

22:41 So how else would you do that other than a digital interface to indicate on an image of

22:47 a painting where that spectral analysis was performed?

22:52 Sounds almost like astronomy in a weird way.

22:55 Oh, yeah.

22:55 We zoomed into here and we took a different spectrum of the painting and we realized that it's actually identical to this, you know, something crazy like that, right?

23:04 Yeah, yeah, yeah, that's right.

23:06 Yeah, so it's essentially a pigments, like a pigments database.

23:10 So the third category of these digital humanities projects that you put down was like data extraction, transformation.

23:19 In data science, they often say, you know, 80% of the work is the data wrangling, which is like cleaning, organization, just getting it so you could possibly start asking questions about it.

23:29 I'm sure you all do a lot of that.

23:31 Absolutely.

23:32 So often, the very beginning of a project might be an Excel sheet or several spreadsheets.

23:41 And the first task is to ingest these into, you know, a proper database.

23:46 Not so much MongoDB for us.

23:48 It's going into Postgres.

23:50 We're Django Shop.

23:51 We're Django Shop.

23:52 So it's going into Postgres.

23:55 And yeah, no, that is probably the number one challenge of the early stage is figuring out what the right data model is, what the right relationships are to model the data.

24:07 Doing that work is advantageous to everybody because, you know, it helps both the researchers who brought the data to think about it in a more organized way.

24:17 I mean, they've been trying to do that.

24:18 And they have the spreadsheets.

24:20 But now we're modeling out the data so that we can add it to database tables and then to use later.

24:27 So that works out well for everybody.

24:30 And yeah, absolutely.

24:31 Cleaning the data, getting dates, working with fuzzy dates, being able to parse July of 2020 or summer of 2020 and handling kind of all of those cases so that we do get dates in the end.

24:45 One of the crazy stories from data parsing history is one of the, I can't remember exactly what it was, you talked about biology tools or genetics tools earlier.

24:56 One of the groups that names genes had to change the name of a gene because it kept getting parsed by Excel into a date.

25:04 Yeah, I remember that.

25:05 I remember that.

25:06 That's right.

25:07 Yes.

25:08 So these are the weird edge cases I'm sure you run into.

25:11 Like it's not even supposed to be a date.

25:13 Why is this a date?

25:13 I don't know.

25:14 Why is it helping out here?

25:16 The code keeps crashing.

25:18 Like pandas parsed it as a date and it's not or whatever.

25:21 Absolutely.

25:21 Yeah.

25:22 Yeah.

25:22 So yeah, usually lots of test suites around that ingest process until we've got it.

25:27 Now, once we've got it in, usually the research is ongoing and then we're able to provide

25:32 them now a new cleaned interface to do the additional data entry as the project is going.

25:38 And that's usually a win-win for everybody.

25:40 Sure.

25:40 And so this sort of ETL ingestion side of everything is it's like, don't worry,

25:46 Darth has got it for you.

25:47 And then we'll provide you like a database connection to start working.

25:51 Or do you give them the tools and then they kind of iterate on them?

25:54 And how much is this you and how much is this you providing like CLI tools and stuff

26:00 or notebooks over to people?

26:03 I'd say most of the people that we're working with are aware of the technical tools,

26:08 but they don't want a database connection.

26:10 So we are giving them, we're doing the ingest and then building a platform where they can begin interacting with their data.

26:17 Yeah, I'm sure they don't want one.

26:20 Maybe you give them an app though, right?

26:22 With like Elasticsearch and other things that they can.

26:25 No, absolutely.

26:25 Yeah, that's what we do.

26:26 Yeah, okay.

26:27 Yeah, we give them a web platform to begin exploring, to begin publishing.

26:34 So I was thinking that you said you're a Django shop, which is cool.

26:38 It sounds, though, to me like describing what you're doing, just imagining how this is.

26:43 You're probably creating these projects often.

26:46 How often does one of these projects actually last?

26:49 Or how many of them do you iterate?

26:53 I'm trying to get a sense.

26:53 Do you work on stuff for a year or is it like every two weeks we're on a new project?

26:58 It's why I think of us as like an agency.

27:00 Because we get to work on greenfield projects fairly often, like you're imagining.

27:04 Which would not be the case normally at a big university IT department.

27:09 So, you know, maybe two or three projects a year, two or three big ones a year.

27:15 And then we have to put to bed a few a year as well.

27:18 Because these things, they're funded with grant money.

27:21 And then the grant money runs out and it's time.

27:24 And then we have to figure out what do we do with it now?

27:26 We don't want to lose the data and this way of presenting it.

27:31 But we can't keep paying for Elasticsearch.

27:33 Yeah, of course.

27:34 I'm certainly, we're going to dive into that because that is, but let's save that for the

27:37 end.

27:37 It seems like that's the arc of the story of these things.

27:40 But I certainly think it's something that you don't think about that much, right?

27:44 Like you said, it was only a hundred dollars a month for this.

27:47 And we got a big grant.

27:48 There's a bunch of, no big deal.

27:49 But like when the grant's out, who's on the hook for a hundred dollars a month and making

27:53 sure it survives upgrades and all that kind of business.

27:56 No, that's right.

27:57 Yeah.

27:57 So my original question when I started on this path was thinking like, do you, how do you

28:02 get started on these?

28:03 Do you have like a big framework or a cookie cutter sort of thing or something like this

28:07 is how we do it because it plugs into all this other automation and tools we built for

28:11 the last 10 projects.

28:13 You know, that's kind of a unique position.

28:14 A lot of companies build one website for themselves and that's their app or they're

28:19 an agency that goes across so many, so much variation.

28:21 They can't do that kind of stuff.

28:22 Right.

28:23 That's right.

28:24 That's right.

28:25 That's a good question.

28:26 We have things that we reuse.

28:29 Some of them are open source, different search components and things that we maintain that

28:36 we'll use across projects.

28:37 And we have tried to do the cookie cutter Django project.

28:41 The truth is, each project is different enough that really we like to evaluate it from first

28:47 principles as we're evaluating it and thinking, what is the best technology to use?

28:55 Yeah.

28:56 So yeah, we don't have a cookie cutter.

28:59 We don't have a kind of a meta framework for bootstrapping them because they're sufficiently

29:04 different from each other that we...

29:05 I find that too.

29:07 I find that too.

29:08 The idea of how we could just grab this cookie cutter or copier.

29:12 Are you familiar with copier?

29:14 People out there might be familiar with that.

29:15 It's a little bit like cookie cutter with the bonus that you can update it later if you

29:21 change your mind about something, like actually change this project to use Postgres rather

29:24 than SQLite or something, which is pretty cool. But every time that I do, every time I try to work

29:30 with one of those projects, even ones that I've created for myself, I'm not, I hate not anyone.

29:34 I'm like, oh, it's like 75% awesome and 25%. I just got to take this stuff out. You know,

29:39 I'll just, I'll just do it from scratch. It's not, how hard is this? I'll just create a few folders

29:43 and put a few things in there and I'll copy the one, like the pyproject.tom or like the one thing

29:48 that's like, how do I do this again? I'll just copy that and we're good to go. Yeah. I mean,

29:52 That's what I find.

29:53 That's what I find.

29:53 I find it, it seems like a really brilliant idea, but in practice, it hasn't saved us time yet.

30:00 No, I mean, maybe it's a case study.

30:02 Like, okay, let's see what they're doing for this one.

30:04 Oh, that is interesting how they're integrating this other thing maybe,

30:07 but as a true foundation, I find it in theory awesome.

30:11 In practice, I just end up not doing it for various reasons.

30:14 Don't know why.

30:14 I'm gonna save this for later.

30:16 Because the question I'm about to ask you is gonna send us just down a rat hole.

30:21 So instead, before we go down the rat hole, maybe we could, not that one, maybe we could

30:27 talk about, I mean, you talked about some, but let's maybe just feature some of the projects

30:32 that are maybe more well-known that you guys have done.

30:35 Sure.

30:35 Yeah, good.

30:36 So yeah, one of them is called the Amendments Project.

30:40 And this is, I didn't know this until I started working on this project, that there are, there

30:46 There have been thousands of, I think it's 22, at least 22,000 proposed amendments to

30:52 the United States Constitution that never went anywhere.

30:56 And so kind of the goal of this project is to show that there have been lots of attempts

31:02 to amend the Constitution, but actually the Constitution is frozen.

31:06 I mean, it's not actually amendable anymore, at least not in the politics of any time recently.

31:12 So this is a database.

31:14 I cannot imagine a situation where the U.S. Constitution gets amended.

31:19 It has to be unanimous across all the states, right?

31:21 Is that right?

31:22 I can't remember.

31:23 I don't know.

31:23 I remember off the top of my head if it has to be unanimous,

31:25 but it certainly has to be across party lines.

31:28 Yeah, it's got to be pretty darn close if it's not at all.

31:32 It's like time travel or travel to speed of light.

31:36 Could be theoretically possible.

31:38 Probably not going to happen.

31:40 No, it's hard to see.

31:41 It's hard to see.

31:42 Yeah.

31:42 So this is from a historian at Harvard.

31:46 And so it's a database of all and the full text from all of these amendments.

31:53 And, you know, it's from the public's point of view, it's a Postgres full text vector search interface for finding and filtering through on all of the different amendments that have been proposed.

32:08 I love it.

32:08 Yeah, this is a nice looking site.

32:10 We work with a designer.

32:12 she's very good yeah of course like an agency would right yep yep nice so we'll

32:17 get a really pretty rich search interface and then off you go I have no idea even

32:22 what I would search for but yeah well you can always search for something

32:25 religious something abortion related there's gonna be lots of things there I

32:29 thought all those also like guns but like I don't want to go down I'm not sure I

32:32 even want to go down there right awesome though this looks super useful maybe

32:37 someday we'll have a functional government again we'll see let's let's

32:41 change it or maybe we'll go down and it's folklore like look at you so all right so yeah so another

32:45 really great uh project at least from a content point of view uh that's interesting um the research

32:51 that it's doing um is the fin folklore database um which so in in in celtic storytelling you know

33:01 um moms have been telling and telling stories to daughters and and and and people have been

33:08 telling stories for a very long time hundreds or a thousand years about um finn mcummel who is a

33:14 hero a hero from irish mythology some of it some of it based in you know historical events but it

33:21 goes back it goes back so far um so there are there's many hundreds or thousands of of of these

33:29 stories that have been spread and versions of these stories that have that have been told and

33:33 And so some of them are audio recordings where somebody like some researcher has gone out to an island off the coast of Scotland and recorded somebody telling their version of the hero of Finn and his band of heroes.

33:47 You know, they defend Scotland and Ireland from invaders and attackers.

33:53 Very exciting stories and stuff and a team of characters.

33:59 So there's audio recordings and then there's documents, like written documents that contain

34:05 these.

34:05 And so this is a database of kind of all of those all in one place with, on the public

34:11 side, a nice search interface for discovering them, you know, either using the map view or

34:18 searching.

34:18 Yeah, that's cool.

34:19 I got my map view for some random thing I searched about here.

34:22 Amazing.

34:23 But this is pretty interesting, all these different tellings and stuff.

34:26 Oh, and yeah, one of the big challenges with this project is that it's fully internationalized.

34:33 So it's available in English.

34:35 Everything is available in English, Scottish Gaelic, and Irish Gaelic, but that extends

34:40 into the database.

34:41 So usually people have multiple names recorded for them.

34:45 And so, yeah, you may have one person with any number of names in different languages,

34:51 sometimes more than one Scottish name, that kind of thing.

34:54 And so the data model on this one is quite messy, but sensible.

35:00 But yeah, it's quite a lot of different kinds of data to wrangle.

35:03 And then with all of the translations for each thing.

35:05 Yeah, that's wild.

35:06 It's not just, we need the user interface of this thing to translate about.

35:12 That's way more, right?

35:13 Yeah, yeah, it is that.

35:14 It is that.

35:15 And then it is also, yes, all the items in the database have a translation or can.

35:22 This portion of Talk Python To Me is brought to you by us.

35:25 I'm thrilled to announce a brand new app built for developers created by yours truly.

35:30 It's called Command Book.

35:32 You know that thing you do every morning?

35:34 Open up six terminal tabs, CD into this directory, activate that virtual environment,

35:39 run the server with --reload.

35:40 Now, CD somewhere else, start the background worker, another tab for Docker,

35:45 another one to tail production logs.

35:46 Every tab just says Python, Python, Python, Docker tail.

35:50 and you're clicking through them going, which Python was that again?

35:53 Where my app is running?

35:55 Then sometime later, your dev server silently dies because it tried to reload

35:59 while you're in the middle of a code edit, unmatched brace, a half-written import or something.

36:04 Now you're hunting through tabs to figure out which process crashed

36:07 and how to restart it.

36:08 My app, CommandBook, gives all of these long-running commands a permanent home.

36:13 You save a command once, the working directory, the environment,

36:17 pre-commands like git pull, and from then on, you just click run.

36:20 You can even group commands together to start and stop everything for a project

36:24 with a single click.

36:25 It also has what I call honey badger mode, auto restart on crash.

36:29 So when your dev server goes down mid-reload, command book just brings it right back up

36:34 and does so over and over until the code is fixed.

36:37 It also detects URLs from your output so you're never scrolling through thousands of lines of logs

36:42 just to figure out how to reopen your web app.

36:44 And it shows you uptime, memory usage, and all sorts of cool things about your process.

36:49 The whole thing is a native macOS app.

36:51 No Electron, no Chromium, just 21 megs.

36:54 And it comes with a full CLI.

36:55 So anything you've configured in the UI, you can fire off from your terminal

36:59 with just a single command.

37:00 Right now it's macOS only, but if there's enough interest,

37:04 I'll build a Windows version too.

37:05 So let me know.

37:07 Please check it out at talkpython.fm slash command book app.

37:11 Download it for free, level up your developer workflow.

37:14 The link is in your podcast player show notes.

37:16 That's talkpython.fm/command book.

37:19 I really hope you enjoy this new app that I built.

37:22 You want to work in the native language of the people who did that part of the folklore

37:26 or whatever, right?

37:27 Yeah, well, and people are still speaking those languages.

37:30 So people who would use this to, you know, like somebody may have heard a story from

37:34 their mom or dad and are now would like to find other versions of that story.

37:38 And they live in a part of Scotland where they speak Scottish Gaelic as their first language.

37:42 They can still access the site.

37:43 And then that mapping color history one, that's another one of the public ones that you said is pretty major.

37:49 Yeah, that's right.

37:50 Yeah.

37:50 So, yeah, that's a pigments database.

37:53 You can search by either English color names like blue and find all of these Asian paintings that have blue or a particular kind of pigment of how they made the blue.

38:04 Yeah, nice.

38:05 So what's the open source story?

38:08 You're creating all these apps, maybe some of these frameworks.

38:11 There's got to be some tools.

38:12 Is there a big desire or already an effort to have a lot of these things open source or is it too niche or is it just like this is the advantage of Harvard has is other universities don't get this?

38:27 No, it's something we talk about quite a bit.

38:30 Usually these things start, usually they start closed source during development.

38:35 And then we work with the faculty and we talk about how we can take, you know, like the repo for the web app, how we can take that public.

38:45 And so we've done that for a number of projects.

38:48 Not all of them are.

38:50 But the ideal is that they all make their way into the open, and especially when they become archived.

38:56 Sure.

38:56 Yeah, that's a good way to help them live on.

38:58 And they might even go into GitHub's Arctic Vault, which is crazy.

39:03 I don't know if people know about that out there, but GitHub has, quite a while ago, started taking copies of all of the repos and backing them up and storing them in the Arctic vault.

39:14 It's kind of cool.

39:15 I really, really, really hope we never need that, but it's kind of neat.

39:18 Yeah, me too.

39:20 Usually universities have their own archival system, so any important research data is usually part of that system as well.

39:30 I see.

39:30 Okay.

39:31 Yeah.

39:32 Obviously, right?

39:32 Like I'm just, I can't remember where it was.

39:34 It was somewhere, I think it was South Korea or Taiwan where like seven years of government

39:40 data got lost or something like that.

39:41 It was really, really bad recently.

39:43 There was a fire and I think they had backups, but maybe just into the building, you know,

39:47 like we'll put that out.

39:48 We'll back it up to the hard drive over here.

39:50 Not good.

39:51 No, not good.

39:52 You definitely want this stuff to survive.

39:54 I mean, academia has this history of like tomes that have survived the past and really,

40:00 really long lived information.

40:02 Right.

40:02 besides the Library of Alexandria or something like that, maybe.

40:05 That's what we want.

40:06 That's what we want.

40:07 We want it to, yeah, we want it to last.

40:09 Absolutely.

40:10 So maybe that's a good time to sort of talk about the trailing end.

40:14 I think there's a lot of interesting things going on here.

40:18 Just like you've run out of money, not because you actually run out of money.

40:23 The grant is done and you've either spent or given back or whatever

40:26 with the remaining little bits of money.

40:28 It's always a weird balance with research.

40:30 It's like, oh, we got $3,000 left on this research grant.

40:33 What are we going to do with it?

40:34 It's not like, oh, we're going to give it back.

40:35 We just didn't need it.

40:36 It's like, we're going to find a way to like fund a student to do a little more work or

40:41 whatever.

40:41 But eventually the grant is over.

40:43 That's right.

40:44 You've got some expensive app access to a big database because it needs a big search or

40:49 a lot of compute or something.

40:50 That's right.

40:52 Everything during, like, I mean, anything, anything that's a, that's a Django app.

40:56 We deploy to AWS using containers, which isn't the cheapest way to host anything.

41:05 But that's for the most part the Harvard way.

41:10 And it is robust and is reliable.

41:12 And we don't have a DevOps person on call on the weekend to rescue one of these apps.

41:22 So having them reliable is good.

41:25 Okay, so it's on AWS and paying for the containers, paying for that Elasticsearch cluster,

41:33 the RDS Postgres database.

41:36 Okay, well, even if somebody wants to start paying for that out-of-pocket,

41:40 all of those little services, they add up to enough that we need to do something

41:44 when the project hits end of life.

41:46 And so our gold standard that we've developed so far is asking, can this become a static website?

41:55 Can we bake this out into all HTML files and acknowledge that there will be some trade-offs?

42:01 We will trade off some searching.

42:04 You know, it's not gonna have Elasticsearch.

42:06 Doesn't mean that it won't have any search though.

42:08 So we'll trade out Elasticsearch and it'll be very difficult to add new data,

42:13 but that's okay because it's being archived.

42:15 So can we get it into a static site?

42:18 And that's challenging depending on how you've set it up.

42:20 So we now have projects where we set them up from the beginning to be archivable like this.

42:26 And one of them is called Water Stories.

42:29 And it was a companion to an art installation at the Radcliffe Institute on the Harvard campus.

42:36 And so this was this live site during the duration of the art installation where people could come in and add stories that they had about water onto an iPad.

42:46 And then those went up to our database.

42:49 we built that with something called Django bakery which if you opt in and you use all of their

42:54 class-based views the way that they're meant to be used then you can bake this out into static files

43:00 when you're done very low effort that was perfect that is such a cool idea and mad props to them for

43:05 ASCII art logos come on now I feel like that should be in the view source if it's not but

43:11 this is such a cool idea because you can you can just take a working site you guys are a Django

43:17 shop. So you have a lot of your sites are written in Django and you just go make it static, right?

43:22 Essentially. Yes. And, and what's, what's, what's really great about it is if they wanted to make

43:27 a change and they have, they have asked since we, since we made it static, they've asked for a

43:31 couple of changes. So locally, I just Docker compose up this whole application, make the change

43:37 in the Django admin and rebake the site. And so it's, it can still be updated. Something,

43:42 if you've never tried this, like something like, Hey, can we just add one more menu item?

43:47 And you're like, no, no, no, we're not adding the menu item because you want that.

43:50 That means we're changing 7,300 pages because they all bake in the whole HTML.

43:56 Right?

43:56 Exactly.

43:57 Yeah, exactly.

43:58 But if that's in my, in my Django database and my SQLite file, then no problem at

44:02 all because then I just rebake it.

44:04 Yeah, yeah, exactly.

44:05 Absolutely.

44:06 So I think this is super neat.

44:09 There's also frozen, frozen flask.

44:13 If I could get rid of all the ads, I do not need a Yeti thing, whatever that is.

44:17 the glass, not the mythical thing, but frozen flask, which does a similar thing for flask

44:25 apps. If you're a flask person probably would work with court. Don't know for sure, but probably.

44:30 So that's a pretty interesting idea as well. throw that in there. but also what else?

44:37 Also you talked about search, right? That can be, can be such a problem. And I'm a huge fan of your

44:45 recommendation here with a page find. Tell us about page find. So this has been, I think it's been a

44:50 bit of a game changer in how functional one of these archived sites can remain. So we're actually

44:56 in the process of that amendments website that searches across 22,000 full texts of amendments.

45:04 We are in the process of sunsetting that, and that will become a static site. And for that search,

45:09 we already have an internal demo that proves that we can replace that Postgres full search

45:16 with PageFind. You lose vector search. Yeah. You've kind of got to get really

45:22 true keyword matching. Yeah. Yeah, that's right. But you still get filtering. I mean,

45:27 and really faceting and filtering is when it comes to discovery of things, I mean, I find

45:34 that's really what's useful. So filtering these amendments by state or by the Congress that was

45:40 active at the time or by the person who co-wrote it. All of those are totally great in PageFind.

45:50 And the keyword search is just fine in PageFind. One of the things I really like about it is that

45:55 it takes your index and it chops it up into lots of little files that can just fly across the

46:00 network. So it's a very fast search. It's not a huge network load, even if your index is

46:07 initially very large. And it essentially cuts it up somewhat alphabetically. So if your search

46:14 starts with T, or I should say a better word for audio, if it starts with W, then it will load up

46:20 the index for words that start with W and fly that over the network instead of the whole thing.

46:26 So it's pretty slick and it has a great Python API.

46:29 So to do the proof of concept for the amendments search, I just took a database dump and then manually indexed with a Python script into PageFind.

46:40 Wait, there's a Python API for PageFind?

46:43 Yeah. So the way PageFind works, I should have said that, is the way most people will use it

46:48 is by normally PageFind consumes HTML. So you give it access to your dist folder.

46:56 Oh, okay.

46:57 And then it crawls through all of your HTML files.

47:00 And you can do great things like adding little HTML tags that are just for PageFind,

47:05 that give it the filtering ability, or that you want to sort by something.

47:09 And so that's great.

47:11 Or you can just call PageFind from Python or from TypeScript and just build that index manually.

47:18 Well, thanks a lot, David.

47:19 I have another thing I've got to go research.

47:21 This is awesome.

47:22 I'm a huge fan of PageFind, as I said.

47:24 on my personal website, mkennedy.codes, is just a pure stat.

47:29 It starts in Markdown and ends up in HTML.

47:31 But if you add page find in, you get a super rich, if you want to just know, you want to talk about,

47:36 like what was about Docker, it shows you really nice results,

47:40 pulling out the different parts of the page and sections that talk about it,

47:43 like the headers and then what is said.

47:45 And it even does like sub, sub word, you know, like you just type doc,

47:50 it finds all the words that match that.

47:52 And what I really like about it is a couple of things

47:54 it's instant. It basically is like nearly instant. If you type a few things, it gets way faster

47:59 because it's pulling down. And if you go and look in the network console here and you type

48:05 something, you can see that it's actually pulling in these little tiny fragments, which this one's

48:10 coming off disk cache in three milliseconds, right? But it breaks your index into a bunch of very small

48:16 page find fragments that I think it's like, it starts with anything that starts with the word

48:21 DO. These are all the prebuilt results and stuff like that. Right.

48:25 That's right. That's right.

48:26 Yeah. That's super cool.

48:27 Yeah. One of our open source projects that, that we maintain is a view of a

48:34 view JS component library for page find so that we can style it and reuse it

48:39 across different projects.

48:41 Oh, that's awesome. I love it.

48:42 Yeah. I think this really unlocks it.

48:44 And I mean, you go to so many, so many sites, like their documentation or just

48:48 their web app in the search is so bad.

48:51 You type something and it's like thinking, spinning, spinning, spinning, spinning.

48:57 And then like five seconds later, it gives you kind of janky results.

49:00 And if you just like throw a page find in there, it's, you can't type fast enough to

49:05 outrun the results.

49:05 You know what I mean?

49:06 No, that's right.

49:07 Yeah.

49:07 Too many static site search solutions, they use like a, like a JSON blob that you, that

49:12 you have to pull down and, and then iterate through.

49:15 You know, what's worse.

49:16 and I see this a lot, would be if you go to google.com

49:21 and then you would say effectively site colon whatever

49:24 and then you search Docker, right?

49:26 They basically pull that.

49:29 You know, they just say search this and you just get Google results for your site.

49:33 And obviously it's, I mean, Google's fine, but it's just.

49:36 No, I find that unusable, really.

49:38 I do too.

49:38 It really, you're like, ah, geez.

49:41 But now I'm super excited to realize I can do that from my dynamic content as well.

49:46 So with the Python integration.

49:48 OK, nice.

49:51 What about something truly static?

49:53 Have you looked at Hugo and some of the other type of things?

49:56 Sure.

49:57 So when I see you've even got the tab up for the SUMEB project,

50:02 which is-- that's essentially a database of many, many specimens

50:09 taken from the SUMEB mine.

50:11 So in the--

50:12 Oh, it is.

50:13 Yeah, yeah, it is.

50:13 So if you click on Minerals database, you open up that search interface and that's powered by PageFind.

50:19 Oh, this is?

50:21 Yes.

50:22 I forget what I was...

50:23 I see.

50:24 You guys even hooked into...

50:26 I was thinking just like pure static, like Hugo, like...

50:30 Oh, yes. Yes. Yes.

50:31 So this is an Astro site.

50:33 So for this website, we have this as an Astro site so that we have a little...

50:37 Because with Astro, they make it so easy to pull in like view components.

50:42 So like our page find is a custom view JS component library with Astro.

50:47 You can use React components, you can use the view components, but what it does is it's just

50:52 a static site generator. Fantastic. So a little bit more designable

50:57 than like Hugo or something. Here's your markdown file. Good luck with that.

51:00 Yeah. I love Hugo though. Yeah. I use Hugo for different personal sites here and there,

51:05 and it's just so fast and easy to get up and running. But yeah, it's great.

51:08 - Great, great when it's a good friend.

51:09 - That's what my website's written in, it's in Hugo.

51:12 But if I'm integrating with anything else, I used to kind of like split it up,

51:15 like this part's Hugo and this part's like a Python app.

51:17 And it's pretty easy to get something that'll take a bunch of markdown files

51:21 and just turn them into HTML and just put a page template around that.

51:25 So I've kind of stepped away from mixing and matching that

51:29 as much as I used to.

51:30 So now if I got a static section of a dynamic site, but that doesn't address,

51:34 has nothing to do with the archival side of things, right?

51:38 Because the idea is that the thing that I'm describing is gone on purpose.

51:42 That's right.

51:42 So you've got some, we've got Django Bakery.

51:46 I threw out Frozen Flask, and I'm sure there's a ton more that neither of us are aware of at the moment.

51:52 So Django Bakery was really good for that purpose.

51:56 And we're keeping our eyes open for projects that it's a good fit for.

52:01 But that was a pretty simple website.

52:03 It needed a dynamic backend, but it was quite straightforward.

52:06 And for Django Bakery, you have to opt into inheriting from their class-based views.

52:11 I see.

52:12 So if you're doing, for example--

52:13 You've got to dig ahead of it, yeah.

52:15 Yeah, yeah, yeah, absolutely.

52:17 Yeah, hard to add retroactively.

52:18 Probably impossible.

52:20 Now, our other websites, like the fin example and the mapping color example, those are APIs.

52:27 That's a Django API, Django REST framework for one, GraphQL for the other.

52:32 One has a view front end, one has a React front end.

52:34 OK, well, Django Bakery just isn't isn't going to work very well for like serializing JSON.

52:39 Yeah, it's like awesome.

52:40 Here's your unrendered JavaScript front end code and it's just going to look empty or something.

52:45 Yeah.

52:46 So it is a good reason to consider using like vanilla Django templates when possible,

52:52 like for that reason.

52:53 But those were, those were inherited from the vendors, those two sites.

52:59 And we've made a lot of progress on those.

53:01 So, you know, what, what to do in that, like in that situation, Django Bakery isn't an option. And those projects are not end of life

53:10 yet. So we have some time, but we're, we're, we're, so what we're doing is strategizing, okay,

53:15 how will we rescue them? How will we keep them alive once, once somebody needs to stop paying

53:20 for hosting? And we have, we have ideas. We have, I think there's, there's clever, interesting

53:26 things out there. We'll have to keep looking into it. There are some pretty interesting ideas. And

53:34 that ran in a container, you could just have WebAssembly, but still have it go, right?

53:41 Sort of a local loopback type of thing.

53:43 Yeah, I'm really interested in this one because it enables essentially the full functionality

53:51 of the live site to exist as what is just a static site.

53:55 So because of Pyodide and projects like PyScript, we can run Python in the browser and we can

54:03 run SQLite in the browser. And now we can even run Postgres in the browser with PG Lite. So if

54:09 we can run all those things in the browser, then couldn't we have Django hosted right in the browser?

54:15 And you can. So there's a proof of concept that proves it's possible called Django WebAssembly.

54:23 And if you load this up, it'll let you log in to the Django admin. And you're not logging into

54:29 anybody's backend, you're logging into your own browser where this is running in a service worker.

54:36 Awesome. Look at that. Oh, hold on. I told me what the password was. Very secure.

54:40 Matt, password.

54:42 Well, it can be entirely insecure because, yeah, you're just, it's running right in your own browser.

54:47 Yeah, that's awesome. And here we are, Django admin. Incredible.

54:50 Yeah, so I'm pretty interested in this. You've got to convert an RDS Postgres database

54:55 into either SQLite or something like PGLite, but I think that's all doable.

54:59 So I think it's an exciting possibility.

55:02 Yeah, for sure.

55:03 I do think, so maybe you have a rich query system that you're powering by your database

55:08 that's really heavy.

55:09 Exactly.

55:10 And it's got a bunch of data that's like, here's all of our working data

55:13 that you might ask questions about.

55:15 Maybe you just convert that to page find to help you find the pieces

55:18 and then just keep the operational data and maybe like even a SQLite with like the Django RRM,

55:23 you can just switch the connection, keep talking to it.

55:25 I mean, there's possibilities to just get something not too terrible

55:28 Well, it's not the same, but not that far off.

55:31 Yeah, exactly.

55:32 And then it goes on GitHub pages and it can live hopefully forever.

55:35 I mean, it feels like GitHub will last forever, but it'll last longer than funding will anyways.

55:41 It's definitely going to last longer than just something that we can't pay for anymore, right?

55:48 I don't know how long GitHub's going to be around for, I think a while, but you never know, right?

55:53 It seems like stuff's going to last forever, then it gets changed.

55:57 We had subversion.

55:59 Now it's completely gone, right?

56:00 Just 20 years, 15 years later, but still, I think 100% there.

56:05 Yeah.

56:05 But if somebody can, if something ever happened, somebody just needs to copy that,

56:09 that folder of HTML, CSS and JavaScript files and dump it into an S3 bucket or somewhere else.

56:15 And then it can continue living there.

56:17 So it's a good option.

56:19 It's a great option.

56:20 It's a really, really good option.

56:21 I mean, I guess one of the long-term concerns might be what if the WebAssembly standard changes so much that it's not supported anymore?

56:31 But you could probably bite-wise convert it if you had to, you know, like somebody would probably be able to create one.

56:37 Yeah, that would be unfortunate.

56:39 So I suppose if that happens, I mean, if that happens, yeah, we're booting up one of these projects is like booting up an emulator for some old DOS game.

56:49 Right, right.

56:49 Well, I mean, I guess let's think about this for a second.

56:52 Somebody got, oh gosh, what was the chain?

56:55 This is the whole, JavaScript, the PyCon talk where got like Firefox

57:04 compiled into, not WASM, into, ASM JS or something like that.

57:10 So it was run like Chrome was running Firefox, which was running, I think

57:14 doom, which was also ASM JS.

57:17 If we can do that, we could get something that would run, that would read old Web

57:22 Assembly into new WebAssembly if it really mattered to the world.

57:24 Absolutely.

57:25 Yeah.

57:26 Especially if it's in a public repo that people who care about the data can,

57:30 can rescue it somehow.

57:31 Yeah.

57:32 What about like a virtual machine?

57:34 You know, I agree.

57:35 Yeah, absolutely.

57:36 Could have saved me some, take a snapshot of Ubuntu LTS, some version,

57:42 and just what are we going to do?

57:44 Everything we do is Dockerized.

57:46 Everything is in a container.

57:47 So in the worst case scenario, we could give somebody the image, and they could run it if

57:51 they have Docker.

57:53 I think that's a nice peace of mind to know that no matter what, something will be able

57:57 to run this container.

57:59 And even in, I don't know if you've used GitHub, what is it called, Codespaces.

58:05 I archived one project.

58:07 It was kind of dramatic and sudden that it needed to be archived, so without much time

58:12 to do anything.

58:13 And it was a Ruby on Rails project.

58:15 And I'm not a Rails developer, but I was able to get it archived in a way

58:19 that anybody could, with one command, go to the repo on GitHub and boot it up in Codespaces

58:27 and then have it live running from their Codespace.

58:30 And so that works too.

58:32 Very cool.

58:32 I think as WebAssembly grows, there'll be more possibilities for these types of things.

58:38 Yeah, amazing.

58:39 I'm pretty excited about PageFind having a Python API.

58:42 didn't realize that. So I'm going to be doing something with that for sure. So what else?

58:46 Let me ask you one more thing before I kind of let you wrap up with some final thoughts here.

58:51 What about AI? Oh, that's a good question. So AI, I mean, there's like, in my story,

58:58 there's like one interesting part of AI, which is that I got started and self-learned everything I

59:04 needed to about software development to begin doing this right before ChatGPT really came on

59:10 was able to do real programming yeah you're like four years of legit programming before right so i

59:17 think i mean so i was thinking i was thinking when i was thinking about how i got into it i thought

59:21 what if i was four years later starting my phd and wanting to do these tools um i would have been

59:28 able to accomplish what i needed to for my research without acquiring the technical skills and that

59:34 would have been that's a good thing i'm not sure if that's good about it it could be both i would

59:37 would have thought it was a good thing. I would have thought it's a good thing. But in my hands

59:43 now, like a software engineer, AI is more powerful in my hands now than it would have been then.

59:52 So I can make it work for me. Yeah, I can make it work for me in a way that I couldn't have been

59:57 able to then. So I'm thankful for that, but it's something I think of. I don't want to say it's

01:00:02 necessarily a bad thing, but it definitely marks a difference, a difference in time between other

01:00:07 people who are maybe wanting to get into digital humanities, they're humanities researchers. They

01:00:13 want to add some digital tools. You know, I think this will kind of, this will probably knock people

01:00:18 off of the more technical path because it's not needed. I think it will too. And I think that that

01:00:22 might be a negative. When you were telling me your story originally, I was thinking kind of like,

01:00:27 how neat is it that you didn't sign up for, and the people you're working with probably didn't

01:00:32 intend to sign you up for learning true software development.

01:00:36 But look at this cool and interesting job that you now have that you never

01:00:41 would have imagined.

01:00:42 I'm sure when you signed up for your PhD, you're like, you know what I'm

01:00:44 going to do when I get my PhD, I'm going to go X, Y, like, I'm going to

01:00:47 join the Darth program.

01:00:48 Like, no, probably not.

01:00:49 Right.

01:00:50 But here you are.

01:00:51 And I think that's actually a really interesting knock on effect for a lot

01:00:54 of researchers and people in grad schools, they're kind of put into this

01:00:59 programming adjacent type of thing.

01:01:01 You know, and a lot of folks sort of are like, actually, that's pretty interesting.

01:01:04 I'm going to kind of lean into that.

01:01:06 And I think AI might knock, like you said, knock people off that path to some degree.

01:01:11 Yeah, yeah, definitely.

01:01:12 So that's just like one part of the AI story.

01:01:15 The other one is that, like how we use it.

01:01:18 It's great for data extraction, pulling data out of different, you know, to make these

01:01:25 search interfaces more powerful, to extract different data from them.

01:01:30 That's just one example where it's been handy.

01:01:33 We're looking for ways that it can really empower faculty.

01:01:39 We're still very much in the exploration phase of how we can use it and provide it to faculty as a digital humanities tool.

01:01:48 Sure. I was thinking pretty much when I asked the question of it, it's just like two parts.

01:01:52 One, how is it? Are you guys using it to help take projects?

01:01:56 Well, that would have been a month. No, actually, it's three days.

01:01:58 You know what I mean?

01:02:00 that. And then if people are asking, you know, a professor comes along and says, and we want our

01:02:05 own custom AI thing, or we're using Harvard's internal one that we're allowed to use, but we

01:02:13 won't be able to use it once the grant runs out. You know what I mean? Yeah. Yeah. I think one,

01:02:17 one good example of this type of thing is that what we're starting to get is faculty who are

01:02:23 vibe coding and now, and we are going to teach them. We're going to teach them how to do it.

01:02:28 You know, instead of having them.

01:02:31 Yeah, it's absolutely a skill.

01:02:32 Yeah, no, it is.

01:02:33 It is.

01:02:34 Instead of copy and pasting from ChatGPT into VS Code, having them learn Copilot, maybe even having them download Cursor.

01:02:43 Download some real dedicated tools to get this done to make them more productive.

01:02:48 So, yeah, educating about how to do it is one thing.

01:02:53 You asked if we're using it.

01:02:54 We have access to Copilot.

01:02:58 and that's great. I can't say that we've shipped anything in three days instead of a month yet,

01:03:04 but one anecdote is that right now I'm doing some really interesting processing of music audio files,

01:03:13 and somebody asked to have a beatboxer if I could chop that file up so that all of the individual

01:03:19 sounds that the beatboxer makes are identified in a file. And so I'm using some music libraries,

01:03:26 Python library called Librosa. There's some complicated math in there. It's a little bit

01:03:32 too much for me. It's no problem for Claude. Claude knows how to do that math. And then,

01:03:36 and I use my expertise to string it together to get a good output.

01:03:39 Yeah. Awesome. You got time for one more quick question before we'll clap things up.

01:03:44 For sure.

01:03:45 Raymond out there, Raymond Yees asks, it says, it'd be good to hear how Harvard uses containers on AWS

01:03:51 and its reliability. It's reliable, not cheapest way to host things. Are you thinking about moving

01:03:56 moving that or is it not that much? Okay, I'll tell you about a failed experiment.

01:04:03 We were using ECS and we're still using ECS. So that's AWS's main, you know, it's not Kubernetes,

01:04:11 but it's one step down with their horizontal scaling container clusters. And I wanted to move

01:04:17 us onto a single EC2 instance because our projects are popular, but they're not so popular that we

01:04:23 actually have to worry about horizontal scaling.

01:04:25 Right.

01:04:26 It's not like it's front page in New York Times.

01:04:30 I guess it probably could be.

01:04:31 But even so, for the static sites, they probably still can take it.

01:04:35 Yeah.

01:04:35 So I priced it out and I got an example deployed, an example project deployed, and was able

01:04:42 to confirm that it would indeed be much cheaper.

01:04:45 And it was deployed in a similar way using AWS CDK.

01:04:49 So it's all infrastructure is code all the way down.

01:04:52 But it turns out there's all kinds of compliance.

01:04:54 When you are in charge of the VM at like a big university,

01:04:58 or I'm sure any corporate setting, if you are in charge of the VM and the OS on it,

01:05:04 then you have to know that you have the latest patches in.

01:05:07 You have to know that you have latest Ubuntu.

01:05:09 And then there's other things, different observability things

01:05:13 that you have to have in place that are not usually required

01:05:17 if you're running in a container cluster like ECS.

01:05:21 So it ends up being a lot less work and much easier to achieve compliance if we run containers

01:05:28 or some other serverless thing.

01:05:31 If I run all my personal projects, they all run in a single virtual machine, but we're

01:05:37 running in containers.

01:05:38 Yeah.

01:05:39 And you've got all the SOC 2 stuff and all those different things, right?

01:05:42 Like there's layers.

01:05:43 Yeah, that's right.

01:05:44 Yeah.

01:05:44 I mean, I'll mention that, but what I didn't say is that in that 2019, when I started learning

01:05:50 Python. I discovered Talk Python almost immediately. And one of the first episodes that I listened to

01:05:55 was the other digital humanities. Cornelius Van Litt. He was an awesome guest.

01:06:01 That's right. Yeah. And I thought that was great. And that was also a bit about manuscripts,

01:06:06 a little bit more on the image side than the text side. And I didn't understand everything

01:06:11 that everybody was saying, but I just, I kept tuning in. And I think because of that,

01:06:16 Because Talk Python was like this, you know, I've been remote working for most of my time.

01:06:22 And Talk Python has been kind of like that conversation with the open source community

01:06:27 that's been always in my ear.

01:06:28 And I think that made, you know, a difference, making me feel like I understood the software

01:06:34 landscape and like the developer culture and what was going on.

01:06:37 And then the different Python libraries and what was possible.

01:06:41 So to people who are interested in taking things in a more technical direction, I think

01:06:47 it's helpful just to find a few things like that, that give you an insight into that world.

01:06:53 And the more you listen to it, the more you start to hear the same acronyms and the same

01:06:59 things said enough that you start to feel like, okay, now you're part of the club.

01:07:03 I really appreciate that.

01:07:05 That's cool.

01:07:06 I've certainly had people reach out to me and say things that at first didn't make any

01:07:09 sense to me.

01:07:10 Like I've been listening for six weeks now and it's starting to make sense what you're talking about.

01:07:14 Like, why have you been listening for six months when it made no sense?

01:07:16 That's insane.

01:07:17 But a lot of people use listening to the podcast, is it mine and others, as language immersion, right?

01:07:24 Like I could get Duolingo and I could learn Portuguese

01:07:28 or I could move to Brazil for a month.

01:07:30 You know what I mean?

01:07:31 And then I would really learn.

01:07:32 - Yeah, exactly.

01:07:33 - Right.

01:07:34 - Exactly.

01:07:34 No, I think there's truth to that.

01:07:36 And some of the things I did was, you know, search through, like search the word deployment, because I'm trying to get my head around how to

01:07:43 deploy for the first time. And I just want to hear people talk about it. Like I could read about it.

01:07:47 I could read the tutorial, but I just want to hear people talk about deployment to get a sense of what

01:07:52 actual deployment sounds like. There's something really different when you're learning or trying,

01:07:57 even you're maybe an experienced programmer, but not in this particular area to hear a human

01:08:01 side of it, not just the docs, not a sterile. These are the four steps, but like, I love it.

01:08:08 I mean, it's probably why I created the show.

01:08:10 It's because I didn't hear those stories.

01:08:11 We got to tell those stories.

01:08:13 Awesome.

01:08:13 I appreciate that.

01:08:14 So super cool.

01:08:15 All right.

01:08:16 So if other people are listening, maybe one of your pieces of advice is keep listening.

01:08:21 You'll get there.

01:08:22 Yeah.

01:08:22 And if anybody is in the humanities and somehow found their way onto this episode with no technical experience,

01:08:30 I just would give the caution of, like, you know, the anecdote that if AI coding had been

01:08:37 around the way it is now when I was learning, I wouldn't be doing digital humanities at

01:08:43 Harvard.

01:08:43 I wouldn't have been able to get into this field.

01:08:46 I wouldn't have known about it.

01:08:47 So I guess just think about that when you're learning and applying new tools.

01:08:52 I don't really know what the right fix for that is.

01:08:55 That's a very challenging problem.

01:08:56 I mean, you can say I'm just literally not going to fire it up.

01:08:59 But I mean, we used to hunt through Stack Overflow and the web and over and over.

01:09:03 And if you're really stuck or you really don't understand, like they're good at explaining

01:09:06 stuff too.

01:09:07 You just got to really stay in a learner's mindset, not just press the easy button and

01:09:12 make this thing and move on.

01:09:13 Easier said than done.

01:09:14 Easier said than done.

01:09:15 So yeah, I want to leave this with kind of a thought about how much things like Python

01:09:22 and these tools and technology can really empower stuff that you wouldn't think is even

01:09:27 related, like understanding old manuscripts and how painting is connected or changed over time and

01:09:34 stuff, right? Those sound very much disjointed from tech and software, but they really are

01:09:40 superpowers that you can bring to your work, whatever your industry is. I know our field of

01:09:45 study, I know there's some sociologists out in the audience and I'm sure others as well.

01:09:50 All right. Final thoughts, David, close it out. You said it great. I mean, you know,

01:09:55 Just applying these technical tools to old questions, that is the core of digital humanities.

01:10:02 When I first started hearing about this, I thought, I really don't know how this ties

01:10:05 together.

01:10:05 And after seeing it a few times, I definitely see the power of it.

01:10:08 And I thank you for your time coming on.

01:10:11 Thank you for sharing your look and the look inside of your team and inside of a small piece

01:10:16 of Harvard.

01:10:17 I really like these kinds of episodes because it's hard to see this from the outside, right?

01:10:23 like you just see the results, but you don't see like the inner workings of the team

01:10:27 and the motivation and stuff.

01:10:28 So thank you so much for being here.

01:10:31 And yeah, bye everyone.

01:10:33 This has been another episode of Talk Python To Me.

01:10:36 Thank you to our sponsors.

01:10:37 Be sure to check out what they're offering.

01:10:38 It really helps support the show.

01:10:40 Take some stress out of your life.

01:10:42 Get notified immediately about errors and performance issues in your web

01:10:46 or mobile applications with Sentry.

01:10:48 Just visit talkpython.fm/sentry and get started for free.

01:10:53 Be sure to use our code, talkpython26.

01:10:56 That's Talk Python, the numbers two, six, all one word.

01:11:00 This episode is brought to you by CommandBook, a native macOS app that I built

01:11:05 that gives long-running terminal commands a permanent home.

01:11:08 No more juggling six terminal tabs every morning.

01:11:10 Carefully craft a command once, run it forever with auto-restart,

01:11:14 URL detection, and a full CLI.

01:11:16 Download it for free at talkpython.fm/commandbook app.

01:11:19 If you or your team needs to learn Python, We have over 270 hours of beginner and advanced courses on topics ranging from complete beginners to async code, Flask, Django, HTML, and even LLMs.

01:11:32 Best of all, there's no subscription in sight.

01:11:35 Browse the catalog at talkpython.fm.

01:11:37 And if you're not already subscribed to the show on your favorite podcast player, what are you waiting for?

01:11:42 Just search for Python in your podcast player.

01:11:44 We should be right at the top.

01:11:46 If you enjoy that geeky rap song, you can download the full track.

01:11:49 The link is actually in your podcast blur show notes.

01:11:51 This is your host, Michael Kennedy.

01:11:53 Thank you so much for listening.

01:11:54 I really appreciate it.

01:11:56 I'll see you next time.

01:12:08 I'm out.