Learn Python with Talk Python's 270 hours of courses

State of Data Science in 2021

Episode #333, published Fri, Sep 10, 2021, recorded Thu, Sep 9, 2021

We know that Python and data science are growing in lock-step together. But exactly what's happening in the data science space in 2021? Stan Seibert from Anaconda is here to give us a report on what they found with their latest "State of Data Science in 2021" survey.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guest Introduction and Background

Stan Seibert has a background in physics and particle physics research where he heavily used Python for data analysis, making him a prime example of a scientist-turned-software-engineer. He manages open-source teams at Anaconda working on projects like Numba, Dask, and recently Pyston (an initiative to speed up Python itself). His history spans using BASIC on an Osborne 1 computer as a kid to being at the forefront of data science tooling at Anaconda.

What to Know If You're New to Python

If you’re brand new to Python and want to follow along more easily:

  • Understand basic Python data structures (lists, dictionaries, strings, numbers) as they’re frequently used in data science.
  • Get comfortable with installing packages (e.g., via pip or conda) since data science relies on many external libraries (e.g., NumPy, pandas).
  • Learn about “environments” so you can avoid version conflicts—Conda or virtual environments help isolate different projects’ requirements.

Key Points and Takeaways

  1. State of Data Science Survey 2021 The discussion centers on Anaconda’s “State of Data Science in 2021” report, which surveyed thousands of data practitioners worldwide. The survey looked at the growing prevalence of data science across many industries, from tech and finance to automotive, and gauged how companies view investment, open-source usage, and the pandemic’s effect on data initiatives.
  2. Role of Python in Data Science Growth Python and data science seem intertwined, with nearly a third of Python developers using macOS, significant Windows usage, and Linux in production. Python’s allure lies in it being both easy to start with and powerful enough for advanced numeric and AI tasks. Stan emphasized that Python’s ecosystem of libraries (NumPy, pandas, etc.) makes it an industry standard.
  3. Anaconda Distribution and Conda Packaging Anaconda’s mission is to simplify packaging and distribution for scientific Python. Installing Fortran or C++ dependencies can be complicated—Conda abstracts that complexity. This approach especially benefits Windows users, historically the hardest environment to manage for science libraries.
  4. ARM and Apple Silicon (M1) for Data Science Apple’s move to the M1 architecture is exciting but has introduced challenges for tooling. Many data science projects rely on lower-level C, C++, or Fortran, requiring significant changes to support M1 natively. Still, Rosetta 2 emulation is surprisingly fast—only around 20% slower for many tasks, buying developers time while the ecosystem catches up.
  5. Supply Chain Security and Open Source Governance The survey revealed that about half of participants’ organizations use managed repositories or private mirrors for open-source packages. Others run vulnerability scanners or manual checks against public CVE databases. Data science teams now face the same supply chain security concerns as core dev teams.
  6. Distributed Teams and Centers of Excellence Organizations use different models to embed data science across departments. Some rely on embedded data scientists within marketing, finance, or R&D; others create a data science “Center of Excellence” to define best practices and governance. Each approach balances domain expertise with consistent tooling.
  7. Impact of COVID-19 on Data Science The pandemic influenced data science spending in diverse ways. Some companies shrank budgets due to uncertainty, while others increased spending as data-driven decisions became crucial. Whether data science was viewed as essential or experimental often dictated this difference.
  8. Challenges Moving Models to Production Many teams struggle with recoding or integrating data science models with production stacks in Java, .NET, or other languages. Alternatively, they might want to transition from R to Python for performance or tooling reasons. This friction was one of the most common production blockers.
  9. Contributions to Open Source A high percentage of organizations now encourage open source engagement—about 65% according to the Anaconda survey. Having internal policies supporting contributions is a major shift from the past when many were wary of any open sourcing. This signals healthier ecosystems and better collaboration.
  10. Pyston and Python Performance Initiatives In addition to specialized JITs like Numba for numeric loops, broader efforts like Pyston aim to make Python faster overall. Anaconda recently hired the Pyston team to accelerate development. These projects complement CPython’s ongoing work to optimize Python 3.11 and beyond, underscoring the importance of performance in the data world.

Interesting Quotes and Stories

  • Stan on data cleaning: “No data set is perfect… You can’t ever clean the data just once, because what you’re doing is preparing it for the questions you’re going to ask.”
  • Stan on ARM performance: “Rosetta 2 on average is sort of like a 20% speed hit, which for an entire architecture switch is amazing.”
  • On open source: “Encourage your organization to look at the open source libraries they rely on most, and give back to the maintainers or projects that matter most to them.”

Key Definitions and Terms

  • Conda / Conda Forge: A package manager and repository ecosystem that simplifies installing Python libraries with compiled dependencies across platforms.
  • Rosetta 2: Apple’s translation layer allowing x86-based apps to run on Apple Silicon (M1) devices.
  • Center of Excellence (CoE): A centralized group or department setting standards and best practices, in this context for data science.
  • CVEs: Common Vulnerabilities and Exposures, referencing publicly disclosed security issues in software.

Learning Resources

Here are some resources to learn more and go deeper.

Overall Takeaway

Data science has matured in both reach and complexity, fueled in large part by Python’s robust ecosystem and user-friendly nature. The “State of Data Science in 2021” shows not just Python’s ongoing dominance, but also the unique roadblocks enterprises face—ranging from security best practices and environment setups to bridging domain expertise and technical implementation. Tools like Conda, plus new performance initiatives like Pyston, ensure that Python remains both accessible and powerful for data-driven discovery in the years to come.

Links from the show

Stan on Twitter: @seibert
State of data science survey results: know.anaconda.com

A Python Data Scientist’s Guide to the Apple Silicon Transition: anaconda.com
Numpy M1 Issue: github.com
A Python Developer Explores Apple's M1 (Michael's video): youtube.com
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon