How data scientists use Python
Episode Deep Dive
Guests Introduction and Background
Dr. Jodie Burchell is a Data Science Developer Advocate at JetBrains. Originally an academic in psychology (with a PhD focusing on hurt feelings and relationship dynamics), she later moved into industry as a data scientist, then transitioned to her current advocacy role. Jodie draws on her experience merging psychology, statistics, and data science to help individuals build practical projects and strengthen community ties in the Python ecosystem.
What to Know If You're New to Python
If you’re newer to Python but want to follow this conversation more easily, here are a few points:
- Python is especially popular for data exploration and machine learning because of its rich ecosystem of libraries (e.g., Pandas, NumPy).
- Data scientists often use tools like Jupyter Notebooks to run Python code in smaller, more approachable chunks rather than building entire applications from the start.
- Python’s flexible nature allows you to experiment, prototype, and then gradually learn more advanced concepts over time.
Key Points and Takeaways
Differences Between Data Scientists and Software Developers
Data science work often focuses on discovery, exploration, and modeling rather than long-term, production-level code. While software developers prioritize reliability, maintainability, and scalability from day one, data scientists first aim to see if an idea or data angle is even valid.- Links / Tools:
- JetBrains.com (Home of PyCharm, DataSpell)
- Links / Tools:
Role of a Data Science Developer Advocate
As a developer advocate, Jodie bridges the gap between product teams and the community. She helps inform JetBrains about data scientists’ needs and conversely educates data scientists on best practices and new tools. This involves speaking at conferences, creating educational material, and ensuring data-focused features are front and center.- Links / Tools:
- PyCharm (IDE mentioned for Python development)
- DataSpell by JetBrains
- Links / Tools:
Jupyter Notebooks for Data Exploration
Jupyter is central to many data scientists' workflows because it supports interactive coding, immediate visualization, and explanatory text in a single document. Jodie pointed out how its literate programming style (mix of Markdown and code) is excellent for reproducibility, though real-time collaboration is limited unless you use cloud-based solutions like JupyterLab or DataSpell’s hosted offerings.- Links / Tools:
- Jupyter.org (Project Jupyter)
- JupyterLab (Remote-friendly environment)
- Links / Tools:
Iterative and Experimental Coding
Data scientists frequently write “throwaway” or short-lived code as they refine hypotheses. The code is not always production-ready, yet it’s critical for finding patterns and relationships in data. When or if something proves valuable, an engineer (or ML engineer) may repackage it into a more robust system.- Links / Tools:
- scikit-learn.org (Machine learning library often used in prototypes)
- Links / Tools:
Importance of Team Integration
Separating data scientists and developers entirely can lead to friction when transitioning models to production. Embedding data scientists into engineering teams, or at least keeping close collaboration, ensures that concerns like latency, scaling, and production monitoring are addressed early on.- Links / Tools:
- FastAPI (A frequent choice for productionizing Python models)
- Links / Tools:
Benefits of Python for Data Science
Python’s biggest strength is its simplicity and readability, allowing non-traditional programmers (e.g., psychologists, statisticians) to quickly pick up data analysis. Jodie noted that, compared to compiled languages, Python’s “just enough” design is ideal for rapid iteration on machine learning and data science projects.- Links / Tools:
- NumPy.org (Foundation for data arrays in Python)
- Links / Tools:
Working with Large Data (Performance + Tools)
When teams hit memory or performance constraints, they can scale with frameworks like Dask or migrate to specialized libraries like Polars. The conversation highlighted how these projects extend Pandas-like APIs to cluster computing or Rust-based performance.- Links / Tools:
- Dask.org (Parallel computing and big data)
- Polars.dev (Rust-based DataFrame library)
- Links / Tools:
Reproducibility and Maintainability
While data science code might not need long-term support, reproducibility and good documentation are crucial so future teams (or your future self) can pick up where you left off. Tools such as Git, containerization, or cloud-based notebooks help preserve environment setups and data pipelines.- Links / Tools:
- Git-SCM.com (Git version control)
- Links / Tools:
Production Responsibilities and ML Engineering
A critical question for any organization is “Who maintains the code that goes live?” Jodie emphasized that if a single “unicorn” data scientist handles both modeling and production, they may be pulled away from valuable research. Hence, many companies form dedicated ML engineering or MLOps teams to manage continuity.- Links / Tools:
- FastAPI again for RESTful deployment
- MLflow or Kubeflow (not specifically mentioned in detail, but commonly used in MLOps)
- Links / Tools:
Powerful Open-Source Libraries
The discussion touched on libraries like spaCy (for NLP) and Transformers (for large language models) to highlight how many specialized frameworks exist and how open-source communities are advancing rapidly. Even for complex topics like neural networks, user-friendly frontends (Keras, for instance) can help new practitioners dive in quickly.
- Links / Tools:
- SpaCy.io
- HuggingFace.co (Transformers library)
Interesting Quotes and Stories
- “Data science is for everyone. If you want it, it’s for you.”
Jodie highlighted her belief that Python and machine learning are accessible fields, encouraging a diverse range of backgrounds to enter. - “I actually had a PhD in hurt feelings.”
Jodie’s unique background in psychology underscores that many data scientists hail from disciplines outside of software engineering, showing how Python can bridge the gap.
Key Definitions and Terms
- Data Wrangling: The process of cleaning, structuring, and enriching raw data into a more usable format.
- Jupyter Notebook: A web-based interactive environment that combines live code with explanations and visualizations.
- Vectorized Operations: NumPy-based computations that apply operations across entire arrays without explicit Python loops, often leading to faster performance.
- ML Engineer: A specialized role that bridges data science and software engineering, focusing on deploying and maintaining models in production.
Learning Resources
Here are some resources to go deeper:
- Data Science Jumpstart with 10 Projects: Covers practical projects to build real data science experience.
- Move from Excel to Python with Pandas: Great if you’re comfortable with spreadsheets and want to shift into Python-based analysis.
- Fundamentals of Dask: Learn parallel and distributed computing strategies in Python.
Overall Takeaway
The worlds of data science and software development often feel separate, but a shared foundation in Python can unify them. Whether you’re conducting early-stage explorations in Jupyter Notebooks or deploying large-scale models in production, communication and collaboration are vital. Jodie’s story demonstrates that creative and interdisciplinary backgrounds are powerful assets, and with Python’s flexible ecosystem, anyone motivated can solve complex data problems and contribute valuable insights.
Links from the show
Jodie's PyCon Talk: youtube.com
Deep Learning with Python book: manning.com
Keras: keras.io
scikit-learn: scikit-learn.org
Matplotlib: matplotlib.org
XKCD Matplotlib: matplotlib.org
Pandas: pandas.pydata.org
Polars: pola.rs
Polars on Talk Python: talkpython.fm
Jupyter: jupyter.org
Ponder: ponder.io
Dask: dask.org
Explosion AI's Prodigy discount code: get a personal license for 25% off using the discount code TALKPYTHON.
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy