Learn Python with Talk Python's 270 hours of courses

Awesome Text Tricks with NLP and spaCy

Episode #477, published Fri, Sep 20, 2024, recorded Thu, Jul 25, 2024

Do you have text that you want to process automatically? Maybe you want to pull out key products or topics of conversation? Maybe you want to get the sentiment? The possibilities are many with this week's topic: NLP with spaCy and Python. Our guest, Vincent D. Warmerdam, has worked on spaCy and other tools at Explosion AI and he's here to give us his tips and tricks for working with text from Python.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Vincent D. Warmerdam is a seasoned Python developer and data scientist who has worked extensively on spaCy and related tools at Explosion AI. He’s well-known in the Python community for co-founding PyData in Amsterdam, contributing to open-source projects like scikit-learn, and maintaining scikit-lego. Vincent also runs CalmCode, a site offering concise programming tutorials and tips (including ergonomic keyboard advice). During this episode, he shares practical insights about natural language processing (NLP), tips for handling large text datasets (like Talk Python transcripts), and thoughts on leveraging large language models versus specialized tools like spaCy.

What to Know If You're New to Python

  • Getting Started with NLP and spaCy: Helpful if you want a structured approach to text processing projects.
  • Understand how Python virtual environments work (e.g., venv or conda) to install tools like spaCy without conflicts.
  • Familiarize yourself with basic file handling and iterators in Python—generators are especially powerful when working with large text files.
  • Knowing simple data structures like lists and dictionaries will help you follow how tokens, lemmas, and named entities are stored and processed.

Key Points and Takeaways

  1. Generators for Handling Large Text Data Vincent explained how using Python generators helps process transcripts line by line without loading everything into memory. This approach keeps the pipeline efficient and manageable, even when analyzing gigabytes of text.
  2. spaCy’s Named Entity Recognition (NER) Out of the box, spaCy can detect entities like products, places, and people. Vincent noted how the default English models can often catch Python packages by treating them as “products,” speeding up the initial NLP work.
  3. Customizing spaCy for Niche Use Cases Pre-trained models are powerful but not perfect for specialized domains (e.g., Python project names). Vincent highlighted how quickly you can adapt or retrain spaCy with a small labeled dataset, especially after using an LLM or rule-based system for initial filtering.
  4. Working with Lemmas for Search Lemmatization brings words down to their base form (“goose” → “goose,” “geese” → “goose”). This approach is helpful for building a robust search engine—Vincent noted you can make transcripts more searchable if you store these base forms.
  5. Rule-based NLP with spaCy Matchers Not everything requires machine learning. Rule-based matchers in spaCy allow you to specify lexical or part-of-speech patterns. An example is detecting “Go” only if it’s labeled as a noun, thereby catching the programming language rather than the verb.
  6. LLMs vs. Specialized Libraries Vincent calls out “LLM maximalism” as a pitfall. While large language models (like ChatGPT or Claude) are incredible for quick annotation or summarization, specialized libraries like spaCy run locally, are faster, and can outperform LLMs when given enough domain-specific training data.
  7. Using Disagreements to Improve Quality Vincent proposed comparing an LLM’s output against a smaller custom spaCy model for the same NLP task. Whenever they disagree on a label or entity, that example is a high-priority candidate for human review and annotation, speeding up model improvements.
  8. Fun with Keyboard Ergonomics and Community Engagement Vincent also touched on how physical ergonomics—like testing different keyboards—improved his ability to code comfortably. This underscores a broader point: The Python community often shares personal stories, tips, and open-source contributions that shape how they work day to day.

Interesting Quotes and Stories

  • Vincent’s take on LLM hype: “I am not in favor of LLM maximalism. I think they’re fascinating tools, but not everything should be run through ChatGPT.”
  • On learning NLP with a fun dataset: “If you’re a Python person, these transcripts are a perfect domain because you already have the subject matter expertise to fix mistakes.”
  • Workflow tip: “When those two models disagree, something interesting is happening—you’ve likely found the tricky examples worth annotating.”

Key Definitions and Terms

  • Token: A piece of text (often a word or punctuation) identified by a tokenizer.
  • Lemma/Lemmatization: The root or dictionary form of a word, helping unify variations.
  • Named Entity Recognition (NER): Detecting and classifying “real world” concepts (people, places, products) in text.
  • LLM (Large Language Model): A highly trained model (e.g., ChatGPT) with broad, general-purpose language abilities.
  • Prompt Injection: Manipulating LLM prompts or instructions to yield unexpected results.

Learning Resources

Overall Takeaway

This conversation highlights both the depth and accessibility of NLP with Python. Vincent emphasizes that while large language models are revolutionizing text processing, specialized tools like spaCy remain indispensable for fast, targeted projects. Anyone—especially Python devs—can become effective at text analysis by focusing on practical workflows (like using generators) and carefully combining rule-based systems with statistical models.

Links from the show

Course: Getting Started with NLP and spaCy: talkpython.fm

Vincent on X: @fishnets88
Vincent on Mastodon: @koaning

Programmable Keyboards on CalmCode: youtube.com
Sample Space Podcast: youtube.com

spaCy: spacy.io
Course: Build An Audio AI App: talkpython.fm
Lemma example: github.com
Code for spaCy course: github.com

Python Bytes transcripts: github.com
scikit-lego: github.com
Projects that import "this": calmcode.io
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon