Building a data science startup (panel)
On this episode, I welcome back 4 prior guests who have all walked their own version of this path and are currently running successful Python-based Data Science startups.
Episode Deep Dive
Guests Introduction and Background
Ines Montani (Explosion)
- Co-founder of Explosion, creators of the popular NLP library spaCy.
- Specializes in building developer tools for machine learning, with a focus on production-grade NLP solutions.
- Emphasizes the importance of giving developers control and transparency in their ML workflows.
Jonathan Morgan (Yonder)
- Founder of Yonder (formerly focused on automated ML, now on identifying agenda-driven groups online).
- Background in web development and machine learning engineering.
- Has done extensive work mapping and analyzing how groups propagate information (or disinformation) on social media.
Matthew Rocklin (Coiled)
- Creator of Dask and founder of Coiled.
- Has extensive experience scaling Python workloads, including parallel computing and large-scale data processing.
- Focuses on enterprise solutions and infrastructure around Dask, enabling teams to seamlessly scale out their Python and data workflows.
William Stein (CoCalc)
- Founded the open-source mathematics software SageMath and then CoCalc.
- Transitioned from a tenured mathematics professor to full-time startup founder.
- CoCalc makes collaboration on Jupyter Notebooks, terminals, and other data science tools accessible in the cloud, often used in education.
What to Know If You're New to Python
Here are a few tips from the conversation to help you get the most out of the episode if you're still learning Python:
- Python’s open-source ecosystem is vast and can be a foundation for launching products and services.
- Many data science businesses build around existing libraries (e.g., pandas, spaCy, Dask) while adding their own unique twist or service.
- Community-driven packages can become the kernel of a profitable business model when built thoughtfully.
- If you want a structured way to start Python from scratch, check out Python for Absolute Beginners: A thorough introduction to core Python concepts for those just starting out.
Key Points and Takeaways
- Building a Data Science Startup around Open Source
Companies like Explosion (spaCy), Coiled (Dask), Yonder, and CoCalc have each built a commercial offering around an open-source foundation or ecosystem. This approach can attract a passionate user base and rapidly showcase real-world proof of concept. However, they had to carefully balance what remains free and what becomes the commercial product (e.g., advanced tooling, platforms, or services).
- Links and Tools
- Explosion: https://explosion.ai/
- spaCy: https://spacy.io/
- Coiled: https://coiled.io/
- CoCalc: https://cocalc.com/
- Yonder: https://www.yonder-ai.com/
- Links and Tools
- Transitioning from Academia to Entrepreneurship
William Stein shared how juggling a professorship while running a growing startup (CoCalc) was unsustainable—eventually prompting him to leave the university. Academic skills in research and teaching offer insight into user pain points (especially educators), but the business world moves faster and measures success differently (e.g., revenue, user adoption).
- Links and Tools
- SageMath: https://www.sagemath.org/
- Links and Tools
- Business Models for Data Science (Consulting vs. Product)
Early on, many open-source projects or small teams do consulting to bootstrap product development. While consulting brings in revenue and fosters user feedback, it’s easy to become stretched thin. Several panelists discussed eventually productizing their offerings so they could focus on a single, scalable platform or service.
- Links and Tools
- Getting Started with NLP and spaCy (Talk Python Training)
- Links and Tools
- Finding the Right Funding Model Some companies (Explosion, CoCalc) avoided or delayed major venture capital funding to remain independent, while Yonder and Coiled moved toward VC once they recognized a large market opportunity. Both strategies require awareness of tradeoffs, including slower growth but more control versus faster growth with investor demands and potential dilution.
- Challenges of Productizing Data Science and MLOps
The panel pointed out how turning data science prototypes into production systems is a challenge—requiring versioning, infrastructure, security, and real-time monitoring. Tools like Dask, spaCy, or Prodigy (a commercial annotation tool from Explosion) were born precisely because bridging “prototype” to “production” is often more complex than building models.
- Links and Tools
- Prodigy: https://prodi.gy/
- Dask: https://www.dask.org/
- Links and Tools
- Building a Community and Developer Experience
User adoption was faster because these founders focused on building straightforward developer experiences—helping people quickly get value from the library or platform (e.g., spaCy’s user-friendly API, or Coiled's minimal friction for scaling Dask). Developers adopting a tool or library can become a product’s champions, pushing for it within their organizations.
- Links and Tools
- spaCy docs: https://spacy.io/usage
- Coiled docs: https://docs.coiled.io/
- Links and Tools
- Fitting the Product to the Customer (Ownership vs. Cloud Services) Ines Montani explained Explosion’s preference for a model where developers have full local control (downloadable software rather than only a hosted service). The panelists noted many businesses or universities want to “own their data” and minimize long-term subscription dependencies. Yet, fully hosted solutions (like Coiled, CoCalc) also find success by handling tricky operational details for customers.
- The Importance of Developer-Empathetic Design
Jonathan Morgan highlighted how bridging data science insights to non-technical end users (e.g., communications specialists) requires empathy and user-focused design. Similarly, each founder stressed that developer-friendly interfaces, or straightforward UI, can differentiate a successful product from a merely interesting tech demo.
- Links and Tools
- Yonder’s website for examples of specialized data insights
- Links and Tools
- The Role of Consulting as a Springboard
Even though consulting can distract from product development, it’s an effective way to learn a customer’s real needs, refine the core library or platform, and secure early revenue. By repeatedly integrating the same open-source solutions for clients, founders discover consistent pain points, eventually leading to a product that solves them.
- Links and Tools
- Fundamentals of Dask (for scaling Python with Dask)
- Links and Tools
- Regulations, Administration, and Overhead Whether dealing with multiple states or countries, universities, or regulated industries, founders face administrative obstacles—vendor enrollment, procurement policies, compliance forms, or legal complexities. Although tedious, the panel advised it’s less daunting once you recognize these steps are standard, and many specialized services (e.g., payroll platforms, PEOs) exist to streamline operations.
Interesting Quotes and Stories
William Stein on balancing two careers: “There were actual moments where the site was under denial of service attack, and I was about to teach a class. I literally had to choose which responsibility to handle.”
Ines Montani on shipping commercial tools in ML: “If you're a power user of our open-source, eventually you’ll want to train your own models. You can do that effectively if the tools let you own your data and truly control the process.”
Matthew Rocklin on consulting vs. product: “We started as a consulting shop—lots of direct user feedback. Eventually, it became a no-brainer to productize Dask in a service that just worked for teams.”
Key Definitions and Terms
- MLOps: A set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently.
- Bootstrapping: Starting a business with little or no external funding, using personal finances or customer revenue.
- SaaS (Software-as-a-Service): A software distribution model where customers can access software via a subscription, typically hosted in the cloud.
- VC (Venture Capital): Private equity financing provided by investors to startups and small businesses with perceived long-term growth potential.
- Open-Source Licensing: Legal frameworks governing how software can be freely used, modified, and shared.
Learning Resources
If you want to dive deeper into the topics from the conversation, here are some excellent next steps:
- Python for Absolute Beginners A complete, step-by-step guide to learning Python from the ground up.
- Getting Started with NLP and spaCy Perfect if you’re interested in text processing, NLP, and building out language-driven solutions.
- Fundamentals of Dask Dive deeper into how to scale Python for large datasets and parallel computing.
- Build An Audio AI App For exploring real-world AI applications, specifically with speech-to-text and summarization features.
Overall Takeaway
The journey from idea to viable data science startup involves blending technical expertise with practical business considerations, whether that’s balancing academic work, deciding on funding models, or learning the nuts and bolts of productizing a library. Many successful Python-based businesses, such as spaCy, Coiled (Dask), Yonder, and CoCalc, are anchored in open source. They leverage community support and focus on developer-friendly solutions. If you identify a genuine need—especially one you’ve lived or observed firsthand—then pair that with sustainable revenue and a supportive community, you can build a thriving startup in the data science space.
And now for a song
Data Science Flow - (A fun rap-style tribute to data science)
(Chorus) Data science in the lab, Python on the track Stacking up solutions, code is how we interact Talk numbers, talk clusters, gotta keep it exact We’re takin’ data to the max—there’s no turnin’ back
Verse 1 We start off with pandas, wranglin’ rows and columns GroupBy, pivot tables, handle data like a boss and NumPy for the arrays, linear algebra quick When the size gets too big, we scale out with a trick Maybe Dask in the mix, parallel on every core Compute that big data, HPC behind the door We test all them models, scikit-learn doin’ right Train a random forest, see them features take flight
(Chorus) Data science in the lab, Python on the track Stacking up solutions, code is how we interact Talk numbers, talk clusters, gotta keep it exact We’re takin’ data to the max—there’s no turnin’ back
Verse 2 We got Jupyter notebooks, speakin’ truth in each cell Plotly or matplotlib to visualize so well We store it all in cloud or we keep it local-side Manipulatin’ data has us feelin’ full of pride When the code’s on production, we push out with Docker Or serve an API with FastAPI shocker Models in the pipeline, automated flows MLOps on the daily, watch them metrics grow
Bridge Scripts flowin’ smooth, rap lines so sweet Data science in Python can't be beat From prototypes to real-world feats Collaboration keep us all up on our feet
(Chorus) Data science in the lab, Python on the track Stacking up solutions, code is how we interact Talk numbers, talk clusters, gotta keep it exact We’re takin’ data to the max—there’s no turnin’ back
Outro Raise a toast to the coders, from novices to pros Solvin’ real problems with machine-learn flows We jam on the data, watch the insights glow Python plus analytics—it’s the data science show!
Links from the show
Twitter: @_inesmontani
Explosion AI: explosion.ai
Matthew Rocklin
Twitter: @mrocklin
Coiled: coiled.io
Jobs @ Coiled: jobs.lever.co/coiled
Jonathon Morgan
Twitter: @jonathonmorgan
Yonder AI: yonder-ai.com
William Stein
Twitter: @wstein389
CoCalc: cocalc.com
Talk Python Live Streams: talkpython.fm/youtube
Sentry Promo Code: TALKPYTHON2021
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy