Learn Python with Talk Python's 270 hours of courses

Lifecycle of a machine learning project

Episode #359, published Sun, Apr 3, 2022, recorded Tue, Mar 22, 2022

Are you working on or considering a machine learning project? On this episode, we'll meet three people from the MLOps community: Demetrios Brinkmann, Kate Kuznecova, and Vishnu Rachakonda. They are here to tell us about the lifecycle of a machine learning project. We'll talk about getting started with prototypes and choosing frameworks, the development process, and finally moving into deployment and production.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests introduction and background

  • Demetrios Brinkmann started on the sales side of an ML tool startup. The company eventually folded, but Demetrios went on to create the MLOps Community with an open and vendor-neutral mindset. Now he’s fully invested in connecting machine learning practitioners through meetups, Slack discussions, and interviews.
  • Kate Kuznecova began her career studying business and economics. She moved into data analytics and later a data science bootcamp. After building predictive models in areas like text analysis and commercial data, Kate now focuses on practical data solutions, frequently collaborating with data engineers and bridging business and tech.
  • Vishnu Rachakonda works at the intersection of healthcare and data science. He first learned ML while exploring ways to improve biotechnology and has experience building ML models for medical devices. Vishnu joined the MLOps Community early on and credits it as a pivotal place to discuss, learn, and solve real-world machine learning challenges.

What to Know If You’re New to Python

If you’re getting started with Python in the context of machine learning and data projects, here are key points that will help you follow along:

  • Python’s data science ecosystem (NumPy, pandas, scikit-learn) offers a strong foundation for machine learning tasks.
  • Virtual environments (e.g. venv or Conda) can help manage dependencies and keep your projects organized.
  • Jupyter notebooks or VS Code “notebook cells” let you explore ideas interactively, but remember to eventually structure your code with proper scripts or modules.
  • Leverage version control with Git or a Git-based GUI to track your experiments and share reproducible results.

Key points and takeaways

  1. The MLOps Community and Its Purpose The guests highlighted the MLOps Community as a place for ML practitioners to learn, share experiences, and discuss tools in a neutral environment. It started as a way to build authentic connections during pandemic downtime and grew into a large Slack network with meetups, newsletters, and roundtables.
  2. When to Choose Machine Learning Over Traditional Software A recurring theme: machine learning shouldn’t be used just for the sake of it. If a simpler if-else approach suffices, then start there. ML adds complexity and is best applied where clear business-value metrics can be measured and improved.
    • Links and tools:
      • "The first rule of ML is don’t use ML" blog post by Eugene Yan (referenced by guests)
  3. Importance of Business Metrics Rather than focusing on raw model accuracy alone, make sure the project goals align with critical business metrics—like user engagement, cost reduction, or improved diagnostics. This clarity often determines how sophisticated your solution needs to be.
  4. Prototyping and Iteration Guests stressed quick experimentation using Jupyter notebooks, small datasets, and minimal viable models. This helps validate ideas quickly. Tools like PaperMill from Netflix or solutions like DVC (Data Version Control) can further streamline and document these experiments if needed.
  5. Reproducibility and Version Control While software engineering practices such as Git, CI/CD, and testing can be overlooked by new data scientists, incorporating them early dramatically improves collaboration. Even partial automation of data checks and environment setups leads to more consistent results.
  6. Data Engineering Collaboration Data engineers are often the silent heroes ensuring reliable data pipelines, transformations, and storage. Their involvement can make or break ML deployments. Kate described how synergy between data scientists and data engineers lets everyone leverage specialized strengths.
  7. From Research to Production Productionizing machine learning often demands different tooling: frameworks like FastAPI or serverless platforms can expose models as APIs. The complexity you pick should match your scale: spinning up Kubernetes for a single small model is often overkill.
  8. Interactive Dashboards and Sharing Results Dashboards like Streamlit make it easy to show non-technical colleagues how ML predictions or data explorations work in real time. Kate uses Streamlit to provide a “no-code” interface for text analysis or structured data, improving buy-in from stakeholders.
  9. Avoiding “You Are Not Google” Pitfalls Many companies adopt big-tech solutions without big-tech scale. The guests urged caution: large-scale orchestration technologies (like super complex pipeline frameworks) might be overkill for smaller or mid-range data. Evaluate real-world needs before jumping into advanced stacks.
  10. Community, Peer-Learning, and Networking Several guests called out real-time question-answering in Slack as an indispensable resource: from model debugging to conceptual discussion. ML can be isolating in smaller companies. Actively engaging in a community can help fast-track solutions and uncover best practices.

Interesting quotes and stories

  • “The first rule of machine learning is don’t use machine learning.” A playful but pointed reminder that simpler solutions can sometimes suffice.
  • “We realized no one wanted our sales calls anymore, so we built a community for them to come to us.” Demetrios on how the MLOps Community was born from necessity during the pandemic.
  • “When you properly test code, you get ‘built-in documentation.’ The tests show how your code is meant to behave.” Reflecting the guests’ emphasis on robust testing.

Key definitions and terms

  • MLOps: A practice focused on streamlining and scaling the deployment, maintenance, and monitoring of machine learning models in production.
  • Reproducibility: Ensuring that experiments (especially in ML) can be run again with identical data and code to yield consistent results.
  • Feature Store: A centralized place where curated, documented, and versioned features are stored for serving models consistently in production and training.

Learning resources

Looking to enhance your Python or ML skills? Here are some relevant courses:

Overall takeaway

Machine learning development becomes significantly more powerful when approached as an end-to-end process: Start simple, focus on high-impact business metrics, apply solid testing and version control practices, and only scale to advanced tooling when necessary. The MLOps Community shows the value of connecting with peers to learn about real-world ML challenges. Ultimately, a successful ML project balances a clear sense of business needs, technical simplicity, and collaborative synergy among data engineers, scientists, and stakeholders.

Links from the show

Demetrios Brinkmann: @DPBrinkm
Kate Kuznecova: linkedin.com
Vishnu Rachakonda: linkedin.com

MLOps Community: mlops.community
Feature stores: mlops.community
Great Expectations: github.com
source control: DVC: dvc.org
StreamLit: streamlit.io
MLOps Jobs: mlops.pallet.com
Made With ML Apps: madewithml.com
Banana.dev: banana.dev
FastAPI: fastapi.tiangolo.com
MLOps without too much Ops: towardsdatascience.com
NBDev: nbdev.fast.ai
The "Works on My Machine" Certification Program: codinghorror.com
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon