Python in Medicine and Patient Care
Episode Deep Dive
Guest Background
Our guest for this episode was Dr. Somak Roy, an Associate Professor and Director of Molecular Pathology at Cincinnati Children’s Hospital. He’s a molecular pathologist who analyzes genomic data (DNA, RNA) to diagnose and manage pediatric cancer. Dr. Roy also has a clinical informatics background and uses Python extensively in his day-to-day work for building bioinformatics pipelines, clinical applications, and lab information systems.
Dr. Roy’s Path to Using Python in Medicine
- Early Interest: Started programming in VB.NET and C# for simple medical web applications but found the .NET world less friendly for specialized data science and bioinformatics tasks.
- Discovering Python: Shifted to Python due to its large and active community, extensive libraries for scientific computing, and significant bioinformatics tooling (e.g., Biopython).
What Molecular Pathologists Do
- Lab Workflow: Dr. Roy’s lab does diagnostic and treatment-guiding genomic tests. They receive tissue or blood samples from pediatric patients, extract DNA/RNA, perform next-generation sequencing, and analyze the data.
- Computational Challenges: Genomic data can be very large (gigabytes to terabytes). Accuracy and reproducibility are paramount because decisions affect patient care.
Running Python in a Hospital Environment
- Regulatory Requirements: Clinical labs must prove tools are reliable and keep strict version control of software. Changes to pipelines require validation for consistent sensitivity and specificity.
- Infrastructure: Uses on-premises Kubernetes clusters, containerization (Docker), CI/CD-like workflows, and versioned pipelines to ensure reproducibility and auditability.
Key Tools and Packages Discussed
Below are the major Python packages and projects referenced during the conversation. Each solves specific challenges in Dr. Roy’s lab and broader molecular pathology:
- Biopython
- A foundational set of libraries and tools for bioinformatics work in Python.
- Offers parsers and utilities for handling common file formats (FASTQ, FASTA, BAM, etc.) and routine tasks in genomics.
- CNVkit
- A toolkit to detect copy number variations (CNVs) in sequencing data.
- Helps measure how many copies of a particular gene region exist in tumor samples (crucial for understanding tumor behavior).
- HGVS
- A Python library that parses and validates HGVS-formatted variant descriptions.
- Used to name or “normalize” the notation of genomic variants (e.g., describing a mutation at the gene, transcript, and protein levels).
- OpenPyXL
- A Python library to read and write Excel files.
- Useful in the lab’s interim workflows where results or QC data still need to be generated or shared as spreadsheets.
- Hera
- A Python SDK for defining workflows on top of the Argo Workflows engine in Kubernetes.
- Helps orchestrate bioinformatics pipelines as DAGs (directed acyclic graphs) while keeping everything in Python.
- In-Silico Mutagenesis Tools (“in-sim” referenced)
- Allows artificially introducing rare or hard-to-obtain mutations into real sequencing data, helping the lab validate pipelines without waiting for a real sample.
- Developed at the University of Chicago.
LLMs and AI in Molecular Pathology
- Growing Use Cases: Dr. Roy highlighted that AI and large language models could help with:
- Variant culling (e.g., Google’s DeepVariant).
- Literature searches for novel or rare mutations.
- Predicting functional or clinical impact of novel mutations.
- Current Caution: While AI tools are promising, referencing accurate scientific literature remains challenging if the model hallucinates or cites invalid references. Specialized, domain-trained models are likelier to succeed in clinical genomics.
Overall Takeaway
Python’s broad ecosystem and ease of use make it an invaluable tool in a tightly regulated clinical environment. Dr. Roy’s experience underscores how Python packages—ranging from bioinformatics libraries to DevOps tooling—enable efficient, reproducible, and scalable genomic analysis. By pairing robust scientific libraries (e.g., Biopython, CNVkit) with modern deployment (Docker, Kubernetes, Hera), clinical labs can quickly innovate while maintaining the reliability demanded by patient care.
Links from the show
Cincinnati Children's Hospital: cincinnatichildrens.org
CNVkit: Genome-wide copy number: readthedocs.io
cnaplotr: github.com
hgvs: readthedocs.io
openpyxl: readthedocs.io
Hera is an Argo Python SDK: github.com
insiM: in silico Mutator software for bioinformatics: github.com
Bamsurgeon: github.com
pysam - An interface for reading and writing SAM files: readthedocs.io
Scientists rename human genes to stop Microsoft Excel from misreading them as dates: theverge.com
BioPython: biopython.org
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy