Learn Python with Talk Python's 270 hours of courses

How data scientists use Python

Episode #422, published Fri, Jul 7, 2023, recorded Wed, May 31, 2023

Regardless of which side of Python, software developer or data scientist, you sit on, you surely know that data scientists and software devs seem to have different styles and priorities. But why? And what are the benefits as well as the pitfalls of this separation. That's the topic of conversation with our guest, Dr. Jodie Burchell, data science developer advocate at JetBrains.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Dr. Jodie Burchell is a Data Science Developer Advocate at JetBrains. Originally an academic in psychology (with a PhD focusing on hurt feelings and relationship dynamics), she later moved into industry as a data scientist, then transitioned to her current advocacy role. Jodie draws on her experience merging psychology, statistics, and data science to help individuals build practical projects and strengthen community ties in the Python ecosystem.

What to Know If You're New to Python

If you’re newer to Python but want to follow this conversation more easily, here are a few points:

  • Python is especially popular for data exploration and machine learning because of its rich ecosystem of libraries (e.g., Pandas, NumPy).
  • Data scientists often use tools like Jupyter Notebooks to run Python code in smaller, more approachable chunks rather than building entire applications from the start.
  • Python’s flexible nature allows you to experiment, prototype, and then gradually learn more advanced concepts over time.

Key Points and Takeaways

  1. Differences Between Data Scientists and Software Developers
    Data science work often focuses on discovery, exploration, and modeling rather than long-term, production-level code. While software developers prioritize reliability, maintainability, and scalability from day one, data scientists first aim to see if an idea or data angle is even valid.

  2. Role of a Data Science Developer Advocate
    As a developer advocate, Jodie bridges the gap between product teams and the community. She helps inform JetBrains about data scientists’ needs and conversely educates data scientists on best practices and new tools. This involves speaking at conferences, creating educational material, and ensuring data-focused features are front and center.

  3. Jupyter Notebooks for Data Exploration
    Jupyter is central to many data scientists' workflows because it supports interactive coding, immediate visualization, and explanatory text in a single document. Jodie pointed out how its literate programming style (mix of Markdown and code) is excellent for reproducibility, though real-time collaboration is limited unless you use cloud-based solutions like JupyterLab or DataSpell’s hosted offerings.

  4. Iterative and Experimental Coding
    Data scientists frequently write “throwaway” or short-lived code as they refine hypotheses. The code is not always production-ready, yet it’s critical for finding patterns and relationships in data. When or if something proves valuable, an engineer (or ML engineer) may repackage it into a more robust system.

    • Links / Tools:
  5. Importance of Team Integration
    Separating data scientists and developers entirely can lead to friction when transitioning models to production. Embedding data scientists into engineering teams, or at least keeping close collaboration, ensures that concerns like latency, scaling, and production monitoring are addressed early on.

    • Links / Tools:
      • FastAPI (A frequent choice for productionizing Python models)
  6. Benefits of Python for Data Science
    Python’s biggest strength is its simplicity and readability, allowing non-traditional programmers (e.g., psychologists, statisticians) to quickly pick up data analysis. Jodie noted that, compared to compiled languages, Python’s “just enough” design is ideal for rapid iteration on machine learning and data science projects.

    • Links / Tools:
      • NumPy.org (Foundation for data arrays in Python)
  7. Working with Large Data (Performance + Tools)
    When teams hit memory or performance constraints, they can scale with frameworks like Dask or migrate to specialized libraries like Polars. The conversation highlighted how these projects extend Pandas-like APIs to cluster computing or Rust-based performance.

    • Links / Tools:
  8. Reproducibility and Maintainability
    While data science code might not need long-term support, reproducibility and good documentation are crucial so future teams (or your future self) can pick up where you left off. Tools such as Git, containerization, or cloud-based notebooks help preserve environment setups and data pipelines.

  9. Production Responsibilities and ML Engineering
    A critical question for any organization is “Who maintains the code that goes live?” Jodie emphasized that if a single “unicorn” data scientist handles both modeling and production, they may be pulled away from valuable research. Hence, many companies form dedicated ML engineering or MLOps teams to manage continuity.

    • Links / Tools:
      • FastAPI again for RESTful deployment
      • MLflow or Kubeflow (not specifically mentioned in detail, but commonly used in MLOps)
  10. Powerful Open-Source Libraries
    The discussion touched on libraries like spaCy (for NLP) and Transformers (for large language models) to highlight how many specialized frameworks exist and how open-source communities are advancing rapidly. Even for complex topics like neural networks, user-friendly frontends (Keras, for instance) can help new practitioners dive in quickly.

Interesting Quotes and Stories

  • “Data science is for everyone. If you want it, it’s for you.”
    Jodie highlighted her belief that Python and machine learning are accessible fields, encouraging a diverse range of backgrounds to enter.
  • “I actually had a PhD in hurt feelings.”
    Jodie’s unique background in psychology underscores that many data scientists hail from disciplines outside of software engineering, showing how Python can bridge the gap.

Key Definitions and Terms

  • Data Wrangling: The process of cleaning, structuring, and enriching raw data into a more usable format.
  • Jupyter Notebook: A web-based interactive environment that combines live code with explanations and visualizations.
  • Vectorized Operations: NumPy-based computations that apply operations across entire arrays without explicit Python loops, often leading to faster performance.
  • ML Engineer: A specialized role that bridges data science and software engineering, focusing on deploying and maintaining models in production.

Learning Resources

Here are some resources to go deeper:

Overall Takeaway

The worlds of data science and software development often feel separate, but a shared foundation in Python can unify them. Whether you’re conducting early-stage explorations in Jupyter Notebooks or deploying large-scale models in production, communication and collaboration are vital. Jodie’s story demonstrates that creative and interdisciplinary backgrounds are powerful assets, and with Python’s flexible ecosystem, anyone motivated can solve complex data problems and contribute valuable insights.

Links from the show

Jodie on Twitter: @t_redactyl
Jodie's PyCon Talk: youtube.com
Deep Learning with Python book: manning.com
Keras: keras.io
scikit-learn: scikit-learn.org
Matplotlib: matplotlib.org
XKCD Matplotlib: matplotlib.org
Pandas: pandas.pydata.org
Polars: pola.rs
Polars on Talk Python: talkpython.fm
Jupyter: jupyter.org
Ponder: ponder.io
Dask: dask.org

Explosion AI's Prodigy discount code: get a personal license for 25% off using the discount code TALKPYTHON.
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon