Learn Python with Talk Python's 270 hours of courses

Gene Editing with Python

Episode #335, published Fri, Sep 24, 2021, recorded Wed, Sep 15, 2021

Gene therapy holds the promise to permanently cure diseases that have been considered life-long challenges. But the complexity of rewriting DNA is truly huge and lives in its own special kind of big-data world.

On this episode, you'll meet David Born, a computational biologist who uses Python to help automate genetics research and helps move that work to production.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guest Introduction and Background

David Born is a computational biologist at Beam Therapeutics, where he works on using Python and large-scale computing to advance gene editing research. He started from the biology side of the world, earned his graduate degree studying genetics, and then picked up Python to tackle complex data analysis tasks. At Beam Therapeutics, David focuses on developing software pipelines and infrastructure to automate and scale genetics research, including CRISPR-based gene editing data. His day-to-day involves analyzing vast sequencing datasets, collaborating with lab researchers, and orchestrating large workloads in the cloud.

What to Know If You're New to Python

This episode touches on how Python is leveraged in biology and big data contexts. Having a basic grasp of running scripts, working in notebooks, and understanding packages like NumPy and pandas can be helpful. Many of David’s workflows rely on standard Python practices—such as version control, containerization with Docker, and an understanding of web frameworks like Django—to ensure reproducibility and scalability. Here are a resource to deepen your knowledge:

  • Documentation for pandas and NumPy: Focus on data manipulation and fundamental numeric operations.

Key Points and Takeaways

  1. Precision Gene Editing and CRISPR CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) is a molecular machine that can be directed to precisely edit specific DNA sequences. David discusses how beam therapies target single-gene mutations—such as those involved in sickle cell disease—and use CRISPR-based approaches to correct them. The complexity of tracking, validating, and scaling these gene edits leads to massive data demands, making Python’s tooling invaluable.
  2. Managing Big Data in Genetics Gene editing experiments produce large sequencing datasets that must be processed for accuracy, reproducibility, and compliance. Python’s data science ecosystem (e.g., pandas, NumPy, Jupyter notebooks) lets researchers quickly prototype analysis steps. Once validated, these steps become more formalized pipelines for production use.
  3. AWS Cloud Infrastructure for Scalability To handle massive sequencing runs—sometimes tens of gigabytes to terabytes of data—Beam Therapeutics uses AWS services extensively. David highlighted using S3 for data storage, AWS Lambda for event-driven tasks, and AWS Batch for highly parallel workloads. This serverless and on-demand model drastically simplifies managing thousands of compute cores.
  4. Infrastructure as Code with AWS CDK One of the pivotal components enabling reproducible and maintainable cloud infrastructure is the AWS Cloud Development Kit (CDK) in Python. Rather than hand-configuring services through the AWS console, David’s team stores infrastructure definitions in source control and programmatically deploys them.
  5. Orchestrating Pipelines with Luigi and Other Workflow Managers Large bioinformatics pipelines often involve chaining together many third-party tools and custom Python scripts. David’s team leans on Luigi for Python-based workflows. Other popular options mentioned are Nextflow, Airflow, and DAXter, each providing a system to build complex directed acyclic graphs (DAGs) of tasks.
  6. Reproducibility in Computational Biology Biotechnology and pharmaceutical research must be reproducible for regulatory and scientific reasons. David highlighted how Docker containers, pinned library versions, and source-controlled data workflows help ensure consistent results. They also build pipelines so that external partners—such as the FDA or other collaborators—can rerun analyses with the same code.
  7. From Jupyter Prototypes to Production Many data experiments begin as quick prototypes or proof-of-concepts in Jupyter notebooks. Once validated, the code moves into more robust scripts with standardized libraries, eventually placed into orchestrated Docker containers for large-scale runs. This systematic process ensures a smooth transition from data exploration to production pipelines.
  8. Django and Database Choices for Genetic Data David’s team uses Django with a MySQL backend to manage large amounts of metadata and experimental records. While some worry about ORMs slowing down, their well-indexed schema and optimized queries have kept performance high. The Django REST Framework also helps expose internal services as needed.
  9. Scaling Out HPC Jobs Occasionally, the team needs truly massive CPU resources—10,000+ cores for days—to simulate or analyze certain molecular events. Rather than buying a supercomputer, AWS Batch orchestrates ephemeral EC2 instances in a cluster of Docker containers. They carefully test pipeline logic on smaller subsets of data to avoid expensive mistakes on large-scale runs.
  10. Remote Development with VS Code On a day-to-day basis, many of the scientists use Visual Studio Code’s remote development extension to interact directly with cloud VMs or on-prem machines. This setup allows them to maintain a consistent Python environment, run computations on more powerful servers, and keep iteration loops tight.
  1. Team Collaboration and Cost Management Large bioinformatics computations can cost tens of thousands of dollars. David emphasized the importance of collaboration between experimental scientists and software engineers, plus frequent checks and tests for correctness before launching large runs. With good communication and versioning, they keep mistakes and wasted compute to a minimum.

Interesting Quotes and Stories

  • Big-Picture Vision: David mentioned, “One day we could say, ‘Hey Alexa, how do we cure sickle cell disease?’ and it might recommend the gene editing steps to fix it.” This underscores the future-facing approach at Beam Therapeutics.
  • High-Performance Computing Scare: There was a story about how a code misconfiguration can bring down entire clusters or fill up a disk quickly. It reminds everyone to test carefully and version-check your pipelines before a large-scale run.
  • Crossing Biology and Software: David described how coming from a purely biological background, it took just a few weeks to become productive in Python—reinforcing how accessible it can be for domain experts in any field.

Key Definitions and Terms

  • CRISPR: A molecular tool allowing precise DNA sequence targeting and cutting for gene editing.
  • AWS Batch: A managed service to run hundreds or thousands of containerized batch computing jobs at scale on AWS.
  • Luigi: A Python-based workflow manager for building complex data processing tasks (DAGs).
  • ORM (Object Relational Mapper): A layer allowing you to interact with databases via class-based APIs, such as Django’s models, rather than raw SQL.
  • Docker Container: A lightweight virtualization method that bundles code, runtimes, and dependencies into portable units.

Learning Resources

Overall Takeaway

Python has become central to the future of gene editing research, as seen in Beam Therapeutics’ efforts to tackle enormous genetics datasets with robust and reproducible pipelines. Through a combination of established Python libraries, containerization, and cloud-based scale-out infrastructure, teams like David Born’s can quickly iterate on new genetic insights and move proven methods into production. This episode highlights not only the cutting-edge science but also the power of Python to bring domain experts and software engineers together to solve real-world problems—potentially changing healthcare and saving lives in the process.

Links from the show

David on Twitter: @Hypostulate
Beam Therapeutics: beamtx.com

AWS Cloud Development Kit: aws.amazon.com/cdk
Jupyter: jupyter.org
$1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud: arstechnica.com
Luigi data pipelines: luigi.readthedocs.io
AWS Batch: aws.amazon.com/batch
What is CRISPR?: wikipedia.org
SUMMIT supercomputer: olcf.ornl.gov/summit
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon