Python in Medicine and Patient Care

Episode Deep Dive Links Transcript

Python is special. It's used by the big tech companies but also by those you would rarely classify as developers. On this episode, we get a look inside how Python is being used at a Children's Hospital to speed and improve patient care. We have Dr. Somak Roy here to share how he's using Python in his day to day job to help kids get well a little bit faster.

Play on YouTube

Watch the live stream version

Episode Deep Dive

Guest Background

Our guest for this episode was Dr. Somak Roy, an Associate Professor and Director of Molecular Pathology at Cincinnati Children’s Hospital. He’s a molecular pathologist who analyzes genomic data (DNA, RNA) to diagnose and manage pediatric cancer. Dr. Roy also has a clinical informatics background and uses Python extensively in his day-to-day work for building bioinformatics pipelines, clinical applications, and lab information systems.

Dr. Roy’s Path to Using Python in Medicine

Early Interest: Started programming in VB.NET and C# for simple medical web applications but found the .NET world less friendly for specialized data science and bioinformatics tasks.
Discovering Python: Shifted to Python due to its large and active community, extensive libraries for scientific computing, and significant bioinformatics tooling (e.g., Biopython).

What Molecular Pathologists Do

Lab Workflow: Dr. Roy’s lab does diagnostic and treatment-guiding genomic tests. They receive tissue or blood samples from pediatric patients, extract DNA/RNA, perform next-generation sequencing, and analyze the data.
Computational Challenges: Genomic data can be very large (gigabytes to terabytes). Accuracy and reproducibility are paramount because decisions affect patient care.

Running Python in a Hospital Environment

Regulatory Requirements: Clinical labs must prove tools are reliable and keep strict version control of software. Changes to pipelines require validation for consistent sensitivity and specificity.
Infrastructure: Uses on-premises Kubernetes clusters, containerization (Docker), CI/CD-like workflows, and versioned pipelines to ensure reproducibility and auditability.

Key Tools and Packages Discussed

Below are the major Python packages and projects referenced during the conversation. Each solves specific challenges in Dr. Roy’s lab and broader molecular pathology:

Biopython
- A foundational set of libraries and tools for bioinformatics work in Python.
- Offers parsers and utilities for handling common file formats (FASTQ, FASTA, BAM, etc.) and routine tasks in genomics.
CNVkit
- A toolkit to detect copy number variations (CNVs) in sequencing data.
- Helps measure how many copies of a particular gene region exist in tumor samples (crucial for understanding tumor behavior).
HGVS
- A Python library that parses and validates HGVS-formatted variant descriptions.
- Used to name or “normalize” the notation of genomic variants (e.g., describing a mutation at the gene, transcript, and protein levels).
OpenPyXL
- A Python library to read and write Excel files.
- Useful in the lab’s interim workflows where results or QC data still need to be generated or shared as spreadsheets.
Hera
- A Python SDK for defining workflows on top of the Argo Workflows engine in Kubernetes.
- Helps orchestrate bioinformatics pipelines as DAGs (directed acyclic graphs) while keeping everything in Python.
In-Silico Mutagenesis Tools (“in-sim” referenced)
- Allows artificially introducing rare or hard-to-obtain mutations into real sequencing data, helping the lab validate pipelines without waiting for a real sample.
- Developed at the University of Chicago.

LLMs and AI in Molecular Pathology

Growing Use Cases: Dr. Roy highlighted that AI and large language models could help with:
- Variant culling (e.g., Google’s DeepVariant).
- Literature searches for novel or rare mutations.
- Predicting functional or clinical impact of novel mutations.
Current Caution: While AI tools are promising, referencing accurate scientific literature remains challenging if the model hallucinates or cites invalid references. Specialized, domain-trained models are likelier to succeed in clinical genomics.

Overall Takeaway

Python’s broad ecosystem and ease of use make it an invaluable tool in a tightly regulated clinical environment. Dr. Roy’s experience underscores how Python packages—ranging from bioinformatics libraries to DevOps tooling—enable efficient, reproducible, and scalable genomic analysis. By pairing robust scientific libraries (e.g., Biopython, CNVkit) with modern deployment (Docker, Kubernetes, Hera), clinical labs can quickly innovate while maintaining the reliability demanded by patient care.

Links from the show

Somak Roy: linkedin.com
Cincinnati Children's Hospital: cincinnatichildrens.org
CNVkit: Genome-wide copy number: readthedocs.io
cnaplotr: github.com
hgvs: readthedocs.io
openpyxl: readthedocs.io
Hera is an Argo Python SDK: github.com
insiM: in silico Mutator software for bioinformatics: github.com
Bamsurgeon: github.com
pysam - An interface for reading and writing SAM files: readthedocs.io
Scientists rename human genes to stop Microsoft Excel from misreading them as dates: theverge.com
BioPython: biopython.org

Watch this episode on YouTube: youtube.com
Episode #470 deep-dive: talkpython.fm/470
Episode transcripts: talkpython.fm

---== Don't be a stranger ==---
YouTube: youtube.com/@talkpython

Bluesky: @talkpython.fm
Mastodon: @talkpython@fosstodon.org
X.com: @talkpython

Michael on Bluesky: @mkennedy.codes
Michael on Mastodon: @mkennedy@fosstodon.org
Michael on X.com: @mkennedy

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Python is special. It's used by the big tech companies, of course, but it's also used by those you would rarely classify as developers.

00:07 On this episode, we get a look inside how Python is being used at a children's hospital to speed and improve patient care.

00:15 We have Dr. Somak Roy here to share how he's using Python in his day-to-day job to help kids get well a little bit faster.

00:22 This is Talk Python To Me, episode 470, recorded June 23rd, 2024.

00:29 Are you ready for your host?

00:30 There he is.

00:31 You're listening to Michael Kennedy on Talk Python To Me.

00:34 Live from Portland, Oregon, and this segment was made with Python.

00:38 Welcome to Talk Python To Me, a weekly podcast on Python.

00:44 This is your host, Michael Kennedy.

00:45 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both on fosstodon.org.

00:53 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

00:58 We've started streaming most of our episodes live on YouTube.

01:02 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:10 This episode is brought to you by Sentry.

01:12 Don't let those errors go unnoticed.

01:14 Use Sentry like we do here at Talk Python.

01:17 Sign up at talkpython.fm/sentry.

01:21 And it's brought to you by Posit Connect from the makers of Shiny.

01:25 Publish, share, and deploy all of your data projects that you're creating using Python.

01:29 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Reports, Dashboards, and APIs.

01:36 Posit Connect supports all of them.

01:38 Try Posit Connect for free by going to talkpython.fm/Posit.

01:43 P-O-S-I-T.

01:45 So, Mac, welcome to talkpython.fm.

01:46 Awesome to have you here.

01:47 Hey, thank you, Michael, for the introduction.

01:50 I'm here.

01:52 Excited.

01:53 Yeah, I'm pretty excited to be talking about medicine and all the stuff that you guys are

01:58 doing with Python.

01:59 And I really like these kinds of shows because I think it's important to highlight that Python

02:04 is not just for web developers and pure data science machine learning people, but it's used

02:09 by this huge spectrum of people doing all sorts of interesting stuff and solving real problems

02:13 with it, right?

02:14 And it sounds like you fall pretty solidly in that category.

02:16 Yeah, absolutely.

02:17 It was like Python has been sort of this discovery as I've gone through my career as a physician.

02:23 And it's interesting how to begin with when computer science initially when I was training

02:29 and growing up, it was hard to imagine medicine and computer science sort of being hand in hand

02:34 together.

02:35 But now I think things have progressed and there's a lot of technology that's now in medicine that

02:41 allows you to do all kinds of things.

02:43 And of course, as I've discovered Python, it brings in kind of the toolkit and the ability

02:51 to be able to achieve and solve problems in a way that I think it's not been envisioned before.

02:57 So it's a very exciting time.

02:59 Yeah, it is a very exciting time.

03:01 And I think it's just getting better and better.

03:03 Before we get too far into this, tell people a quick bit about yourself, quick introduction.

03:07 Yeah, absolutely.

03:07 So I'm Soma Croy.

03:09 I am a molecular pathologist.

03:13 It's a type of physician who deals with looking at the genome of either a patient or a patient's tissue.

03:23 And we essentially look at all of these things in a way to be able to help manage a patient's treatment.

03:34 In my current position, I am an associate professor and the director of molecular pathology at Cincinnati Children's Hospital.

03:42 My lab is a clinical lab that is under the division of pathology.

03:49 We do a lot of work that pertains to kids in terms of helping them diagnose and manage pediatric cancer,

03:58 as well as infectious disease that happened in this age.

04:04 I mean, molecular pathology, I essentially trained back in India as a physician, did my MD.

04:12 Then I came here, started in Pittsburgh.

04:15 I did my training in pathology and lab medicine.

04:19 I specialized in molecular pathology.

04:21 Then I was there in Pittsburgh.

04:23 I worked for some time and then opened in Cincinnati Children's.

04:27 Excellent.

04:28 Do you work directly with patients or do you get samples sent to you from other doctors and then you process them and analyze them?

04:36 Yeah, that's a good question.

04:37 So I do not work with patients directly.

04:40 It's a kind of, it's a subspecialty in medicine where my lab works with the samples that have been collected in the patient,

04:48 either from the OR or a procedure or from the radiology suite.

04:54 And then we work on that tissue or the blood sample or a bone marrow sample that comes to us.

04:59 And yes, then all the testing that we perform is off from that specimen.

05:05 And then once we generate the clinical reports back, they go back to the patient's chart and to the patients,

05:12 to the clinicians who are treating and managing them.

05:16 And then that way it helps how, you know, how they're able to then, you know, get a diagnosis and then give the appropriate treatment and the management to the patient.

05:24 Yeah, excellent.

05:25 So, yeah, you must see a lot of, a lot of different stuff flying, flying through the lab you have to analyze.

05:30 So how did you go from I'm studying medicine to I'm writing Python code and running automation?

05:38 And what was that process like?

05:41 Well, that was a, that was an interesting journey for me.

05:45 So before medicine and biology came into my life, I started off, it was second grade, I believe, when, you know, my dad, he got me a computer at that time,

05:59 which is a, you know, 64 kilobytes small machine.

06:03 And it was, I think it was for Toshiba MSX computers where you can, you could write like GW basic code and, you know, some, some basic, you know, predefined hex code.

06:14 And you can, you know, run small applications on that.

06:16 That was my starting point.

06:18 It was, it was super exciting for me.

06:20 And I think from there on the journey went into, as I went through, you know, school, you know, high school and then college.

06:29 You know, medicine was, I would say, you know, biology was something that intrigued me.

06:35 And at the same time, I also got interested in genetics, looking at DNA sequences.

06:42 And, you know, just, I had a natural sort of, you know, liking for the fact that I could, you know, study about the cell or the genome, the DNA and RNA.

06:55 But also I realized that there was a lot of math and competition that you can use to sort of slice and dice data.

07:02 And at that time, you know, the place that I grew in India was a very small place.

07:06 So we didn't have access to resources like internet.

07:10 So I, when I, my exposure to internet was when I actually went to med school for the first time.

07:15 Oh yeah, you can actually connect to like other computers.

07:18 So that's when I started my medicine.

07:21 So obviously I did, I did my training in medicine in medical school back in India.

07:26 That's when I started to, you know, connect and talk to a lot of people.

07:30 And some of my friends who actually were already, you know, writing apps at that time using like, you know, Java applets on browser.

07:38 And so I started to make some connection in terms of, you know, images and how, you know, learning how some of those things can be used in medicine.

07:47 And so radiology, during my radiology rotation, that was my first real life.

07:53 It was my realization that actually in medicine, you can use computers a lot to handle a lot of these images, x-rays, CT scans.

08:00 And I think as it went on, when I came to the U.S., that's where, you know, it really started off.

08:06 During my residency in pathology here, I actually connected with my mentors, Dr. Anil Parwani and Dr. Leland Pantanois.

08:18 They are well-known pathology informaticists.

08:22 They've spent a lot of their time sort of dwelling in the world of pathology, medicine, and, you know, computer science.

08:30 And so that is when I could actually realize that, yes, you can, you know, do a lot of innovative stuff by, you know, developing apps, algorithms, analyzing, you know, either image data or molecular data.

08:42 And so that is when I started to go into designing an app, which was a very simple web app that was a project.

08:52 I was working with one of my mentors.

08:54 And so the idea was that, you know, we had a lot of these pathology images and we wanted to create a little browser app that would display these images as thumbnails.

09:04 And then clicking that could enlarge the image, shoulder the display.

09:07 And so I used, at that time, it was .NET framework to be ASP.NET.

09:13 And so, you know, created a little app using Visual Basic, slowly I then migrated using C# in the same environment.

09:22 And that time I started my advanced fellowship training in molecular pathology.

09:29 That's when I, you know, that's when I started there.

09:33 That's when I realized there's a lot of genomic sequencing data where essentially you're dealing with a lot of strings and numbers.

09:40 And, you know, you have to make a lot of sense in terms of, you know, this large volume of data that comes in.

09:44 If we're working with, so the kind of data that you're working with for, say, this genetic stuff.

09:49 Yes.

09:50 For us, for when you're studying the genomics, so how much data is in, say, one strand of DNA?

09:58 How much of that do you actually care about?

09:59 Like, give me, give us a sense of sort of how much data we're talking.

10:03 Right.

10:04 So it really depends on what is being done.

10:07 And so when we look at, so when we talk about genomics, it is really designed on how the experiment is done.

10:15 So, for example, if we just simply look at the entire human genome, we are talking about three billion alphabets.

10:23 Essentially, it's, you know, the combination of four alphabets, A, T, and A, T, G, and C.

10:30 So these are the four nucleotides of the DNA sequence, and the RNA has one additional one, which replaces A.

10:37 But the idea is that it's a mix and match of these sequence.

10:40 And so if you think about the entire human genome as a single thread of A, T, G, and Cs in various combination, you're looking at three billion alphabets.

10:51 And so what happens is when we do these sequencing experiments where you would take, you know, the DNA molecule from a bunch of cells within a tissue, and then either we read all the three billion base pairs.

11:06 And typically, the way the sequencing is done is you read all of these sequences and, you know, from many molecules.

11:13 And so you'll have multiple copies of that when you're, you know, translating from a molecular, like a chemical molecular structure to a DNA sequence on, say, a flat file on a, you know, in a file system.

11:28 And so if you look at that large scale of data, like the entire genome, we are talking of, you know, hundreds of gigabytes, maybe even terabyte worth of data.

11:37 Then there are other more practical approaches when we look at the genome, when we, and especially this is something that we use for day-to-day patient care, which is referred to as targeted sequencing.

11:47 What that means is instead of the three billion base pairs, we focus on those regions of the human genome that are of most pertinent use, or, you know, that we, at least as the current field of genomics, that we understand what to do with it.

12:03 And so there are certain genes that, at least in the space I work with, cancer genomics, that are, I would say, close to about maybe 2,000 to 2,000 genes, which are known to be cancer-associated.

12:18 And of that, roughly about 500 to 700 genes, where we know that, you know, they have, they've been studied and demonstrated that there are certain types of abnormalities in those genes in terms of the sequence changes, that they have certain meaning from, you know, in context of tumor in order to make a diagnosis, or to understand if the tumor is aggressive or benign, or if there are certain treatments that could be applied to those tumors.

12:46 And that's specifically linked to the kind of sequence change you see in those, in that region of the genome.

12:54 And that, you know, we are talking about, practically speaking, and we talk about, you know, the targeted testing that we do, it's a very small fraction of the large genome.

13:04 Typically, there's a term known as exome sequencing, and exome sequencing refers to sequencing all those regions of the human genome where it at least codes, you know, encodes for one or the other anodated gene.

13:18 That is typically about 1 to 2% of the entire gene.

13:22 And so if we further narrow that down to about say 500, 600 genes, that one would typically sequence for practical cancer molecular testing.

13:31 That's, I would say that's probably about 10th, maybe slightly less than that of the genome, but it's a very high yield from a clinical standpoint.

13:40 Sure.

13:41 Because the most alteration you will find that would make a, that would help with the clinical treatment is high.

13:46 So, if you're going to talk about that data set, that is, you know, it's complex in a different way, because just looking at the raw sequence data would be, you know, somewhere in, you know, I would say 1 to 20 gigs from a single, you know, sequence file, but it entirely depends on how deep we go.

14:08 So, for example, when we talk about sequencing, as I mentioned before, when we sequence a molecule, we can sequence it either at certain depths.

14:17 That means what level of redundancy you want to be able to read that molecule.

14:22 Sometimes we read the molecules, you know, 20 to 30 times.

14:26 So that's referred to as 30x, or sometimes we'll read that, you know, 500 times.

14:31 So that would be 500x.

14:33 Do you do that because you want to make sure you don't misread the gene?

14:38 Yes.

14:38 So, right.

14:40 So what happens is the greater the depth of sequence, so typically for, you know, such large panels that we sequence in a clinical setting, we usually target about 1500x to 2000x.

14:52 That means we're reading that 2000 times.

14:54 So the more the depth it is, the possibility of identifying a certain variation or genomic alteration that is present at a very low level.

15:03 For example, say, you know, you have a tumor cell and within that, only 2% of the cells have this mutation.

15:10 Others don't.

15:10 And so when you're looking for or hunting for these, you know, needles in a haystack, you really want to maximize the amount of depth you have to be able to pick those things up.

15:19 So it really depends on how deep we go.

15:21 The more deep we go, the more data it is.

15:23 And so it can scale up to, you know, almost several hundred gigabytes.

15:28 Sure.

15:28 Yeah, I've always wondered about how you can go and read somebody's genetics and then not make a mistake when you're, you know, reading, using chemicals to read.

15:38 So, but it's really ridiculous how much data is there.

15:42 Off by one, a C for a G or whatever is a bad thing, right?

15:47 Right.

15:47 But it is, you know, I think as the technology has matured, there's, you know, there's always, there's nothing, there's nothing 100%, you know, in terms of the error profile for the enzyme that has been used to work, the technology that is reading the actual fluorescence, converting that to, you know, signal.

16:04 There's always statistical values and probabilities that are associated with, you know, what is the probability that it is wrong or, you know, incorrect or correct.

16:12 But within that, you know, frame and where the current technology is, it's pretty accurate for, if not all, you know, many of the regions of the genome.

16:21 And so it's mind-baffling how it works.

16:23 Yeah, it's, it really is quite amazing.

16:26 It's one of the modern marvels of science for sure.

16:29 It is, it is.

16:30 This portion of Talk Python To Me is brought to you by Sentry.

16:34 Code breaks.

16:35 It's a fact of life.

16:36 With Sentry, you can fix it faster.

16:39 As I've told you all before, we use Sentry on many of our apps and APIs here at Talk Python.

16:44 I recently used Sentry to help me track down one of the weirdest bugs I've run into in a long time.

16:50 Here's what happened.

16:51 When signing up for our mailing list, it would crash under a non-common execution pass, like situations where someone was already subscribed or entered an invalid email address or something like this.

17:03 The bizarre part was that our logging of that unusual condition itself was crashing.

17:08 How is it possible for our log to crash?

17:12 It's basically a glorified print statement.

17:14 Well, Sentry to the rescue.

17:16 I'm looking at the crash report right now, and I see way more information than you'd expect to find in any log statement.

17:22 And because it's production, debuggers are out of the question.

17:25 I see the traceback, of course, but also the browser version, client OS, server OS, server OS version, whether it's production or Q&A, the email and name of the person signing up.

17:38 That's the person who actually experienced the crash, dictionaries of data on the call stack and so much more.

17:42 What was the problem?

17:43 I initialized the logger with the string info for the level rather than the enumeration dot info, which was an integer based enum.

17:53 So the logging statement would crash saying that I could not use less than or equal to between strings and ints.

17:59 Crazy town.

18:01 But with Sentry, I captured it, fixed it, and I even helped the user who experienced that crash.

18:07 Don't fly blind.

18:08 Fix code faster with Sentry.

18:10 Create your Sentry account now at talkpython.fm/sentry.

18:14 And if you sign up with the code TALKPYTHON, all capital, no spaces, it's good for two free months of Sentry's business plan, which will give you up to 20 times as many monthly events as well as other features.

18:28 So I think you're a little bit unusual, a little bit weird in the sense that you got into as your first sort of programming thing to bring to apply to your science and medicine side of things was C-sharp rather than or on vb.net rather than something like Python or R or something.

18:49 So maybe talk a bit about that experience, contrast it with Python.

18:53 Like, why do you end up moving to Python?

18:55 Yeah.

18:56 So, you know, I think the reason I started using vb.net C-sharp was, I would say, most, it was probably influenced a lot by, at the time when I was doing, you know, my med school in India as to, you know, what was available at that time.

19:13 It was not something I would just go to the internet and like start getting a lot of resources as, you know, one would do now.

19:18 So it was pretty much like, you know, this is the book I have available and that's about the document.

19:23 Okay.

19:23 That's the only thing.

19:24 So you start.

19:24 Yeah.

19:24 That's what I use.

19:26 Right.

19:26 And then, but the thing is when I started applying C-sharp and it was mostly C-sharp and a little bit of C++ that I, you know, started to get into like with some of the non-genetic stuff initially, well, the project I'm working on.

19:40 It was not too bad because I was able to accomplish most of the task, but then once I got into genomics and I came, so the way, you know, professionals who get into genomics and molecular pathology, there are a couple different routes.

19:58 So either the physician, you know, people who are physician trained and they kind of have a formal background in medicine and they kind of, you know, do a specialized training and then they, you know, they become molecular pathologists after getting more certified.

20:11 There is the other route, which is more of a research background where sort of, you know, people have spent a lot of their time, you know, in really deep research.

20:18 They've, you know, they've learned a lot of genomics hands-on either from a computational background or from a more laboratory, like a wet laboratory background.

20:27 And so they've spent, you know, they've obviously done their PhDs and postdoc training and then sort of, you know, come into the molecular pathology field.

20:34 You know, people starting there tend to have more of a, you know, a formal computational training.

20:42 So they're getting, you know, they usually get, obviously when you start with a research lab, R, Python are sort of like the most common tools that are used for any kind of data analysis and data set visualization.

20:52 Coming from more of a foreign medicine background, you know, typically in real,

20:57 you know, when we, you know, get training in, you know, clinical informatics or pathology informatics, often it is very, you know, kind of a, I would not say corporate base, a very, you know, formal application development space.

21:11 So it's a lot of, you know, you know, Windows base, .NET, C-sharp, that kind of thing.

21:17 So Standard enterprise stack.

21:19 Yeah.

21:20 Java or .NET is a choice.

21:21 Right.

21:21 Yeah.

21:22 Right.

21:22 Yeah.

21:22 Okay.

21:23 Yeah.

21:23 In bioinformatics, I mean, at least in genomics, bioinformatics, you know, kind of the ecosystem of tools available, it's, it's a, it's a mishmash of everything.

21:31 It's, there's, for anything which is very competition intensive, like when you're trying to align sequences to the human genome, those are very intensive tasks.

21:39 And typically it's a lot of, you know, C, C++, Java that's involved in some of these very mainstream tools that are available.

21:50 More recently, I think we are seeing Rust coming into the picture as well.

21:54 There's some applications.

21:55 And then of course, you know, Python and R, you know, like the predominant, I think, tool sets, the programming language that are used to solve all of these problems.

22:04 So when I sort of started my molecular pathology fellowship and I got into, now I had to do this project that involved, you know, manipulating all the sequence data to a point where we will be able to develop an application that would help sort of, you know, it's a web-based application that could help, you know, for other pathologists and faculty to read that sequencing data and, you know, digest in a very way that's easy for them to look at it.

22:32 Rather than going to the Linux, you know, terminal and opening up like, you know, raw files and things like that.

22:36 So I used, that was my first project was to use C# in that context.

22:41 But I quickly realized that there was a lot of these algorithms that were natively either written in R or Python and then having to, you know, incorporate those functionalities was not as easily possible.

22:52 So I had to rewrite a lot of those things in C# primarily.

22:55 You know, it was a good learning curve, but I think from a main primitive perspective was getting really difficult.

23:00 And so that's when the realization was that I think the combination of Linux and Python was, you know, I had to move towards that.

23:08 Yeah.

23:09 C# probably from the timeframe that you're thinking about, didn't really have a great package manager story, not to the same degree that the Python does.

23:16 Although they do pretty, they do pretty good now over in the land.

23:19 Right.

23:20 Yeah.

23:20 All right.

23:21 So a good question from Chris in the audience says, is there a reason to use Python specifically?

23:26 Like, are there some special sauce packages that make it attractive?

23:29 It sounds like that's kind of what you were getting at.

23:31 You found more solutions to these algorithms than, you know, available in Python than in C#?

23:37 Yeah.

23:38 Or whatever languages.

23:39 Yeah.

23:40 Right.

23:41 So I think, I mean, I think the simple answer is yes.

23:43 I think the community and the amount of work that has been done in this particular space with genomics.

23:49 I mean, when you are really searching for applications, it kind of falls into these three categories of, you know, anything which is a high performance component, you know, program that is usually in the Rust, in a C++, C, a lot of, you know, of those languages, a little bit of Rust and Java.

24:05 And then the other bin is essentially kind of, you know, split up into Python and R. I think for me, Python was, and I think I'm sure others have shared the same way where it's almost like, wow, this is amazing.

24:18 Like coming from C#, it was a little bit of a change because there's no more, like, you know, curly braces.

24:23 Think about the whole thing.

24:26 Did you miss your semicolons?

24:28 Kind of.

24:29 Like even now, sometimes when I write like a little bit of JavaScript, I feel like, oh, yeah, okay.

24:35 This is my term.

24:36 Exactly.

24:37 But, you know, not too bad.

24:39 I think what I got onto was like the simplicity of the language and how powerful it was when, like, if I'm thinking about, you know, you know, I'm not too bad.

24:48 I think about, you know, it was interesting when I had to do something like there was an algorithm where I had to parse out certain, you know, strings in a way where it required some known workflows that we use to do, like, variant annotations when we are toss referencing databases and putting them together.

25:09 You know, when you look for, like, you know, when you look for C# packages, I mean, there's really nothing there for native YouTube.

25:16 So you have to write a lot of those things.

25:18 In Python, it's the amount of time that is spent in developing those things is much faster.

25:23 And the development time itself is quick because you either get an idea of somebody who's already done the work or there's a more formal package that you can use.

25:32 So I think initially when I started off, BioPython was a very interesting collection of packages.

25:38 I think it was, like, you know, a tool suite essentially written to, you know, to have all these functions available for very common day-to-day tasks.

25:47 You know, I want to query a certain region of the, you know, bam file or I want to parse out certain things in the fastq file to look at, you know, some of the sequences or doing, you know,

25:57 counting number of sequences in a given file and, you know, getting read counts, things like that was, it's all out of the box.

26:05 And so that was sort of like the first thing to go about, this is amazing.

26:09 I mean, you just say somebody's already done the work and just putting on top of it.

26:13 So instead of creating these, I'll just like use those.

26:17 Perfect.

26:18 Right.

26:19 So that was one.

26:21 And the other motivation to use Python was, you know, say, for example, why not R?

26:25 You know, why Python?

26:27 Because R is a very rich ecosystem in, you know, at least in Genomics and Visualization.

26:30 So I think the second thing was in terms of the idea that I was working on was having to develop a web application and all of these bioinformatics, you know,

26:42 toolings and algorithms running sort of in the back end.

26:44 And so at that time, it was like, okay, well, you know, Python, I've not heard much about in terms of web application.

26:49 Mostly it was, you know, again, this big, like, you know, C#, .NET.

26:53 That was why I started off, you know, with that.

26:56 But then at that time, you know, there was Django and then Flask was sort of coming in.

27:00 It was a very minimalistic, you know, sort of application.

27:03 So I started focusing on that.

27:04 It was very easy with Flask to, you know, get up and running with very simple, you know, applications to do that.

27:11 I didn't try much into Django just because it was too bloated for me.

27:15 But, you know, Flask was great.

27:17 And then what I realized was you can create a simple web application, but then at the same time, you can use all your, you know,

27:24 Biopithons and all the wonderful biopithetics packages in the back end.

27:28 So it's like a single language that lets you do both.

27:31 And so I was like, this is great.

27:33 It was just, I don't have to go anywhere to learn, you know, a third or a fourth or a fifth different program language.

27:38 And this just gets the job done.

27:40 Yeah.

27:41 Keeping in mind that you are actually main, your main job is medicine, not programming, right?

27:45 It's not like you're a CS person who's just all after, out to learn all the languages, right?

27:50 Right, right, right.

27:51 So that, that definitely is, and again, that's a, that's a huge, you know, I would say I'm, again, as you said, I'm in a, I'm sort of in an unusual position where I'm, you know, a physician, but I also do a lot of these application developments.

28:04 So that certainly is an important point in terms of how much time I have to be able to develop these prototypes.

28:11 And then obviously, you know, typically the way it works is at least right now here, you know, where I am currently working, I have an excellent and amazing team of developers and bioinformations who really do a lot of the development work on the front and back end.

28:26 And so for me to be able to take additional time out of my, you know, the clinical and the patient care work is limited.

28:33 So if I can get whatever prototype I'm thinking of or developing the application fast, then, you know, that's, that's what I'm going for.

28:40 And so you can hand it off to the team and let them polish it up and product, make it production ready, basically.

28:46 Yeah.

28:47 Yeah.

28:48 Yeah.

28:48 I was wondering how much time of your, your job do you get to spend on these kinds of things, you know, finding new packages, optimizing or improving the ways that you're working on stuff versus just sort of handing it off to the folks you work with and, and keeping, you know, focus more on the medicine side.

29:05 Yeah.

29:06 So, it, you know, that, that, I think it's a good question.

29:10 I think it's, it's evolved over time as I have, you know, being sort of, you know, when I was in training and then being a faculty and then, you know, faculty in this new position.

29:19 you know, one of the things I did was as part of my, certification was to, you know, to get board certified in clinical informatics.

29:27 That's a discipline by itself that, you know, involves, a lot of, you know, it's a very broad field in terms of, you know, informatics and healthcare.

29:35 And then one of the buckets there is, you know, software development.

29:38 And so I was, you know, I was quite interested sort of in, in that field.

29:43 And so most of my time in terms of being able to, you know, devote to, you know, finding new packages or, trying to, you know, write up an application that could solve a problem or coming up with prototypes.

29:57 Or it was done in a way that sort of aligned with the work I was doing.

30:01 And so it would be days when I'm on clinical service where I'm mostly, you know, working on sort of with, you know, with, patient care related matters.

30:09 So those weeks would be, you know, obviously, very busy.

30:12 I would have, you know, I would wake up at like extremely early in the morning, spend the first two hours, four to six AM just, you know, working on this.

30:19 And then I get back to like, you know, the critical world.

30:21 And then there will be weeks when I'm off clinical service.

30:23 So I, you know, I'm not responsible for any, patient care related work.

30:27 And those weeks would be where I would spend time in terms of, you know, doing these, you know, investigating into sort of some of these packages and, you know, coming up with new ideas, exploring what is all, you know, what is available in terms of certain problems that I've been doing.

30:41 certain problems that I was solving.

30:42 And, you know, that time sort of, you know, my quote protected time professionally, was spent in that.

30:48 And so that would be, you know, maybe a week spent into like, Hey, we are trying to look into this variant annotation tool.

30:54 And then we'd want to, you know, write wrappers around it.

30:57 So it, becomes, easy for, you know, our labs operation to be able to use that.

31:02 And so, so kind of that, that's how it works.

31:04 So some of those, either early mornings or, you know, the weeks I'm off clinical services and how, how that works.

31:11 Yeah.

31:12 So, let's get started.

31:13 So, let's get started.

31:14 This portion of talk Python to me is brought to you by Posit, the makers of Shiny, formerly RStudio and especially Shiny for Python.

31:21 Let me ask you a question.

31:23 Are you building awesome things?

31:25 Of course you are.

31:26 You're a developer or data scientist.

31:27 That's what we do.

31:28 And you should check out Posit Connect.

31:30 Posit Connect is a way for you to publish, share, and deploy all the data products that you're building using Python.

31:37 People ask me the same question all the time.

31:40 Michael, I have some cool data science project or notebook that I built.

31:44 How do I share it with my users, stakeholders, teammates?

31:47 Do I need to learn FastAPI or Flask or maybe Vue or ReactJS?

31:52 Hold on now.

31:53 Now, those are cool technologies and I'm sure you'd benefit from them, but maybe stay focused on the data project.

31:58 Let Posit Connect handle that side of things.

32:00 With Posit Connect, you can rapidly and securely deploy the things you build in Python.

32:05 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Ports, Dashboards, and APIs.

32:12 Posit Connect supports all of them.

32:14 And Posit Connect comes with all the bells and whistles to satisfy IT and other enterprise requirements.

32:20 Make deployment the easiest step in your workflow with Posit Connect.

32:24 For a limited time, you can try Posit Connect for free for three months by going to talkpython.fm/posit.

32:30 That's talkpython.fm/posit.

32:33 The link is in your podcast player show notes.

32:36 Thank you to the team at Posit for supporting Talk Python.

32:39 Is it changing fast?

32:41 So I'll give you an analogy that you could tell me about your space.

32:45 So on one hand, in Python web world, you mentioned Flask and Django.

32:49 You know, Flask and Django, while they are evolving, are kind of, they're kind of the way they have been and they're pretty stable.

32:56 And if you learn Flask five years ago, you're still good to use Flask today.

32:59 Yeah.

33:00 Or is it more like FastAPI, Pydantic, msgspec, just there's something new all the time that you got to keep learning to bring in.

33:10 Are there a ton of new packages just coming online or is there a set of really solid ones?

33:15 So I think it's both yes and no.

33:20 And so it depends on what, you know, what area we're working on.

33:23 So right now in the clinical lab that I'm directing here, you know, when I came here in 2020, it was, you know, when we started off from scratch.

33:32 So essentially the idea was to be able to bring up a pediatric cancer sequencing infrastructure that was not available.

33:39 And so it was ground up from, you know, the lab to personnel, to space, to competition, and such and everything.

33:45 And so we kind of have this sort of, you know, two big bubbles in that operation from an informatics perspective.

33:51 One of them is the, you know, we essentially are in the process of developing our custom lab information system.

33:59 That's essentially a web app.

34:01 And so we have that space and the other space is bioinformatics.

34:04 And so bioinformatics is a lot of the custom scripting or the applications we develop is Python based.

34:11 Some of them we do with the Golang, you know, when we need a little bit of performance aspect.

34:17 And then the other aspect is the web app.

34:20 So from a web app perspective, when we, when I started here, we actually started, we use FastAPI.

34:26 So that's kind of, you know, that was so, you know, the idea was that, well, you know, since you're starting from scratch, and I came to know about FastAPI at that point of time, the whole thing was about, you know, acing away.

34:37 I was, I was pretty much sold on that aspect.

34:40 And then, you know, I think the whole tool made a lot of sense.

34:43 I'm like, okay, well, this is, this is perfect time to be able to, I think when I started FastAPI was, you know, 0.5 or 0.6.

34:51 And so now obviously, you can see a lot of change happening there.

34:54 So, yeah, that definitely is a lot of, you know, fast pace.

34:57 And so we kind of do catching up in a sense where it has to be done in a, you know, in a careful way.

35:04 The reason is because from, you know, as compared to more traditional research lab testing where, you know, at the end, really, you know, there's a lot of discovery, there's a lot of excitement.

35:15 And at the end, it all translates into being sort of, you know, the data is presented at a conference or you publish that as a manuscript.

35:23 And that's the end point.

35:24 So if you move off from version one to version three of an algorithm, you know, you have to obviously make sure that your research, everything is reproducible, but beyond that is not a problem.

35:34 But when we're talking about the same thing in context of a clinical care for a patient, the room for error is very, very little.

35:41 You can't make mistakes.

35:42 And so the entire space of clinical testing is very regulated in that sense, because there's a lot of requirement that, you know, you have to perform that any change that's happened in your pipeline.

35:54 Say you're using, you know, some version of an application.

35:57 Now you upgrade to a newer version.

35:58 You have to demonstrate that the analytical performance in terms of sensitivity and specificity for that pipeline didn't change.

36:05 And so a lot of work is needed when you go, you know, do like a version upgrade.

36:09 So we keep those things very controlled and, you know, and careful versus some other things which are more in the R&D space.

36:17 There's a little bit more room to play around with tools.

36:21 Right. Yeah.

36:22 Chris was asking an audience a great question about basically, is it more exploratory and you just move really fast and don't really worry about tests and stuff like that.

36:29 It sounds like this is more of a production type thing.

36:32 Like if it you're going to run it over and over and if it gives a different answer at some point for testing for a disease or something, that's really bad.

36:39 You need it to be right all the time.

36:41 And so.

36:42 Yes.

36:43 So the rule is when we do, when we do new test developments or we bring a new algorithm, obviously that part of which we refer to, there's a formal term that we use in lab medicine.

36:52 So we have a lot of things called familiarization and optimization or ONF phase.

36:56 That's where, you know, there's a lot of flexibility in new tools, new version, trying out different things.

37:01 But once it moves from that into the validation phase and then once we deploy the application, once the deployment is there, it's a production application.

37:08 We don't touch it unless something really has to be tinkered with or there's a, you know, bug that we have to fix.

37:14 Who's in charge of running those apps?

37:16 Is that people on your team and your lab or is that the hospital or how?

37:21 Yeah.

37:22 So the way it is set up here is, so when I started off, I was the end of one.

37:27 So I started off with the fast AP application, I had to build up the, we have, you know, bioinformatics pipeline that I had initially authored.

37:36 But then when we went through the validation phase, I luckily had, you know, two people on staff who kind of were handling the bioinformatics on the front end.

37:45 And then eventually we had a third person who joined the team.

37:48 So then they were kind of, you know, helping me out with a lot of the actual groundwork of, you know, writing the code, getting tests done.

37:57 You know, going through the validation data, summarizing that for me, being a lab director, it is my responsibility ultimately to sign off on all those things.

38:06 So we're going to say, hey, okay, this is the validation and this is what is being demonstrated that, you know, your package or your pipeline or whatever you're working on demonstrates this level of sensitivity.

38:16 Then yes, I being a lab director say that, yes, this is working.

38:19 And once that happens, so we then deploy those applications in production.

38:23 You know, we use GitHub and the usual dev test prod cycle.

38:28 And so that's kind of how it works.

38:30 Well, do you have your own hardware or do you stuff like on DigitalOcean or AWS or?

38:35 So with healthcare data, there is generally a little bit of angst with, you know, data sitting on the cloud outside of the institution.

38:45 I would say like the institution that I work on is very, you know, that way it's quite, you know, very forward thinking and being able to use modern technology.

38:56 So what we have, what we started off and since it was, you know, everything being built up from scratch, we're taking the decision to keep things on print for beginning.

39:07 But we also kept in mind that at some point of time, you know, if the institution decides that, oh, we're going to switch our infrastructure to using, you know, AWS or Azure or whatever the platform is going to be, that we want it to be ready.

39:21 And so the way we had it set up and this is due to our amazing IS team here in, you know, at our institution.

39:32 So we had our own hardware that we got in terms of the actual servers.

39:38 And I had, we collaborated with the IS team to be able to help us build our Kubernetes infrastructure.

39:45 So we have a test and a prod Kubernetes cluster and then all our apps and the biofarmatics pipeline.

39:55 Well, the apps for now and the biofarmatics pipeline that we're looking forward, you know, in the near future to get deployed on these things.

40:02 As a matter of fact, what we've done it, our dev team, we start to, we do a lot of the development on Kubernetes as well.

40:08 And then we keep moving all these things as containerized applications.

40:12 That's excellent.

40:13 So really embracing containers and Docker and Kubernetes and that should make it super easy to move to wherever you want to go.

40:19 Right?

40:20 Anything that can run Kubernetes, you just push, push to that and you're good to go.

40:23 Right.

40:24 Right.

40:25 I mean, that's, it's, you know, it's, it's a little bit difficult to start with, but I think once we are in that, you know, stream, it is, it is much less effort to move things around.

40:34 Yeah. Last year I rewrote all of our servers and APIs and condensed six to eight servers all into one, just Docker cluster.

40:43 And it was a great decision.

40:44 But to me, it was also a little intimidating to play.

40:46 Well, here's one more thing I layer, I have to manage and understand.

40:50 And if something goes wrong there, then everything else still breaks.

40:53 And, but having it set up is really nice once you get used to it.

40:56 Yes.

40:57 Right.

40:58 All right. Let's talk a bit about, well, I have a question for you.

41:00 I want to talk about some of these packages.

41:02 Yeah.

41:03 That you've been, you've been saying are like a lot of the reasons you chose Python and it used a lot.

41:07 That's just great.

41:08 But before we get there, like I got the bio python.org website pulled up.

41:12 And the very first line is bio Python.

41:15 It's a set of freely available tools.

41:17 You know, open source, freely available.

41:20 Right.

41:21 So what does that matter to you guys?

41:22 On one hand, you have a ton of money being in the medical space.

41:27 It's really high stakes.

41:28 So paying for commercial software or commercial libraries is probably not the biggest worry.

41:34 On the other hand, open source is really nice.

41:37 Invest, being able to look inside is really nice.

41:39 Right.

41:40 So it means you don't have to deal with getting permission.

41:43 Right.

41:44 How does that fit into your world?

41:45 I know how it fits into like small startups and things like that.

41:47 But for a hospital, for example, what is free and open source mean to you guys?

41:52 I think it does have a lot of impact in terms of how we end up working and setting up these

42:01 things.

42:02 And obviously, I would, whatever I'm speaking is representing, you know, what it means from

42:08 sort of an operational standpoint.

42:09 You know, when we talk about molecular pathology, generally being able to bring up a clinical service

42:15 like that is a huge investment.

42:19 And so a lot of the investment is, and this is generally applicable to, you know, any institution

42:24 where something like this has been set up for patient care or clinical use.

42:31 The investment is primarily in a lot of the instrumentation and the reagents that we use are generally quite

42:42 expensive, which is sort of, you know, the, I would say when we talk about, you know, what is the

42:48 cost of a test when it is offered.

42:51 That cost factors in a lot of these, you know, operational costs that we need to, you know, buy

42:56 these expensive sequencing instruments, you know, the reagents that are used as consumables as

43:01 we do the tests over and over again, every, you know, every week, every month.

43:05 So from that standpoint, traditionally, the way things have been designed is, you know, I would say 10 years back when we were, you know, work with our finance team to say, okay, the cost of the test is going to be so and so based on all of these different inputs.

43:23 And so, you know, 10 years back, computation, bioinformatics, you know, all of these were not factored in without.

43:30 But now, as we are in that era where, you know, using GPUs on a regular basis to be able to do, you know, simple, I would not say simple, but, you know, routine work to get from the raw sequence data to be able to identify genomic variance.

43:46 Genomic variance that's getting common, you know, using FPGAs, using large clusters to be able to, you know, to perform these tasks.

43:52 And so now we are starting to see those costs getting in as part of the ultimate, you know, cost that goes to the patient for a test.

43:59 And so we try to minimize those things.

44:01 But one of the way to be able to minimize those things is to be able to choose between free open source versus something which is a commercial product.

44:07 And it's always a balance between the reliability and the service that you're able to get back saying, hey, you know, something breaks down.

44:14 We know, you know, there's a, there's an SLA, there's a certain, you know, assurance that this thing is going to have help versus open source free would be where we feel very confident in the code base.

44:27 We sort of, you know, sometimes what happens is when we, you know, when we kind of use some of these open source tools, we end up almost invariably having some wrapper around it to change things or being able to have some insight into the source code.

44:41 So it depends on that balance, you know, what we choose.

44:44 Yeah. How often do you fork, fork it and use your self maintained version versus just run what, what is publicly on PyPI and then maybe wrap it to orchestrate it a bit?

44:55 So I would say for the web application part of it, we don't really do a lot of forking. We kind of go with what it is. The only thing what we do is since we have the luxury of using a combination of GitHub and containers, and knowing the fact that the regulatory requirements require that you tightly version control all these things with, you know, history and all those things. We tend to, when we are developing these things and we are validating it before we do that, we try to stick to a fairly stable version.

45:24 So for example, things like, you know, beta or release candidates, we try to stay away from that, even if they have some desirable features, but unless we see like a full production, you know, version of that, we don't tend to switch to it. So we keep things like that without maintaining or without forking or kind of, you know, making modifications.

45:42 When we get into more of the bioinformatic stuff where we are actually trying to use an algorithm to solve a particular piece of part of the pipeline that is doing some data transformation, it depends on how much we want to change or modify.

45:58 That's when we sometimes fork it. Sometimes we fork where we know that, and this is the unfortunate reality in many of the scenarios where you have great open source tools, but after some time, due to whatever financial or business or other reasons that they stop maintaining.

46:13 And so essentially, we kind of get into this freeze mode. We tend to fork that so that at least we have that available. And then if we make any changes that we kind of keep it to that fork.

46:23 But generally, I would say it's probably in the 80-20 where 20% is where we fork it, make some change. Most of the times we try not to do that.

46:34 But yes, open source free tools have a big impact. A lot of their tools that we use as part of our bioinformatics pipelines, a matter of fact, which is kind of used in the community of molecular pathology to build these bioinformatics pipelines.

46:48 They tend to use a lot of open source tools. And the reason for that is, for example, it's not written in Python, but we have an algorithm called BWA. It's written by Professor Henley.

46:59 That's one of the algorithms that is almost like a de facto, I would say, when it comes to doing sequence alignment. That's a part of the pipeline. And so it's a tried and tested application for more than a decade now.

47:14 So really, there's not a, and this is a fairly stable, you know, algorithm or an application. So we don't tend to, it's well maintained, you know, from an open source perspective. So those obviously are highly, we highly rely on those. But, you know, there is this whole ecosystem of softwares that come under this rubric term of variant calling, where we're trying to identify these different variants.

47:34 there's a whole bunch of those and some are fairly well maintained they are open

47:41 source sometimes depending on the context you're using you need a license

47:47 if it is used in a commercial setting you don't need a license if it is in an academic setting like for example

47:53 when we do clinical testing in an institution such as where I am right now

47:59 that's an academic institution so typically it's not for profit and so obviously we

48:05 don't need licenses for that use but once it goes into a pure commercial space

48:11 where if the lab is doing all of this testing for profit then there's a license requirement

48:16 so we see a combination of these things showing up it's actually becoming

48:23 more common now with open source tools that at least in the genomic biofibic space

48:29 yeah excellent I think another benefit probably probably for you guys I know it's a benefit for a lot of organizations

48:36 is if you use the open source tools and you need to hire somebody new there's a good chance that they have experience

48:43 already with those tools whereas if you use something private expensive you know you might have to teach them from scratch

48:49 what the thing is right yes that is that is correct as a matter of fact it's fortunate that a lot of the people

48:56 who've done a lot of good work and have contributed to the genomics biofibic space

49:01 have you know the general tendency is whenever we are setting up any sort of

49:08 you know pipelines for you know DNA sequencing RNA sequencing or you know more

49:13 from a research perspective you know methylation sequencing single cell RNA-seq

49:17 you know UMI based error corrected there's a lot of there's a very thriving

49:25 open source space and so that really helps with people who come in even if they're

49:31 you know not familiar with these tools it's easy to get familiar with because there's a lot of

49:35 you know community backing that up or as you said when we have when we hire people

49:41 who already are coming from you know from a different lab or they've had some experience

49:45 but you know they come and say oh yeah we know how to do you know alignment

49:48 or I'm aware of these applications that use that it's it is much much easier

49:52 from a learning curve perspective rather than having to you know now open up a manual

49:56 and this is the you know proprietary thing that only works here and yes exactly

50:02 exactly cool all right well we coordinated a bit on a list of packages that

50:07 you've used in your lab or find really helpful for your work and we could touch on those

50:12 just a little bit yeah yeah so cnv kit genome-wide copy number from high throughput sequencing

50:19 I don't know what that means but tell us about it yeah yeah absolutely so

50:22 copying so cnv or copy number variation this is a type of genomic operation

50:28 where what happens is in a simplistic way at least when when we talk about

50:33 cancer the cancer cells sometimes for it to be able to survive it tries to use

50:40 different ways of doing that biologically one of the way to do that is certain genes

50:44 that help a cell to grow in the absence of nutrients or with very little

50:49 nutrition is certain genes if it has more copies of those genes than normal

50:54 then it will oh okay you know it's like you have more money than expected

50:58 you can do a lot of things so typically what happens is in a normal human genome

51:03 any cell that we pick up it will only have two copies of the gene one is coming from

51:07 your mom one is coming from with that in cancer what happens is in certain

51:10 scenarios if a gene that helps with growth of the cell or it helps the cell

51:16 to survive even without signal or nutrition if it has more copies of that

51:20 it will make 6, 8, 20, 50 copies it can survive versus there are certain

51:26 scenarios where if there is a gene that is supposed to regulate the cell

51:29 so it doesn't go haywire if the cancer is able to delete one of the cells

51:33 one of the genes then you only have one gene left you knock out the thing

51:36 and then that protective mechanism is gone so then the cancer cell can easily survive

51:40 so what happens is with CNV or copy number variations the idea is that we use the

51:46 high throughput sequencing data to be able to infer how many copies of these genes

51:51 do we have is it more than two is it less than two and so this particular

51:55 package it's a very very well established well maintained package in the community

52:02 that essentially does this thing is you give it the sequencing data and define

52:08 the regions of the genome that you're interested in you can also provide

52:13 like names for the regions like you know this region is you know gene BRAF

52:18 or this EGFR or whatever you're interested in and then what I'll do is I'll do all the analysis

52:23 to be able to tell you that okay well when we are comparing this particular tumor

52:27 against this reference set of you know 20 normal samples where we know that

52:32 you should only have two copies of the gene in this particular tumor we are seeing

52:36 there are 50 copies of the gene so it gives you kind of an output data that

52:41 numerical can tell you that you know what it does is it does a log 2 based

52:48 you know transformation of the tracing that okay after all this computation

52:52 when I compare to the normal this is 50 times more this is 20 times more

52:57 the expected copy or it is you know half of the amount of copy we need in terms of deletions

53:02 so that's what really it does and it's written in Python it uses a lot of

53:09 you know Python it has Python dependencies that use that have been written

53:13 in sort of you know in either C or like you know Python C bindings but at the end

53:20 it gives you that data and it has an internal visualization tool but I was not very happy

53:27 with you know how it was written so I ended up writing a wrapper which is called

53:32 CNA plotter it's open source it essentially uses the end data for from CNA

53:39 and then it gives you a nice visualization of the copy number so I think if you go

53:43 down if you scroll down there's an example images yeah you have this on GitHub

53:47 so people can if people want to use this it's right there right yep it's right there

53:52 yeah so I think at the very bottom of the images the screenshots there oh yeah

53:57 yeah right here yeah so for example the first image over here you can see this

54:02 you know it's a thin band of all these multicolor things and each one of them

54:06 is a single human chromosome so chromosome 1 2, 3, 4 so on and so forth and it

54:12 you know if you look in the image it is at you know the Y scale essentially is log 2

54:18 which is 0 and going up it is 1, 2, 3 and then it's a native scale on the lower side

54:23 so anything going above 0 means you have more copies than 2 going below is less

54:29 you know less than 2 copies and so if you see here in this example the plot here

54:33 you see the very end which says chromosome X is a single you know the band over here

54:39 is lower at negative 1 that means this is a male patient with a single X chromosome

54:44 as compared to females who have 2 you know 2 X chromosomes and so when we look at this

54:49 plot below here this is actually a plot from a you know a cell line a tumor cell line

54:55 that is abnormal and here we see there are 2 genes which are amplified one of them

55:00 is a gene known as TERT and the other gene is MDM2 so these 2 genes are again

55:07 one of those examples where it gives the tumor survival advantage over there

55:11 and so you can see here there are multiple copies of these genes as compared to

55:14 you know the baseline over here I see so that might predict something like

55:18 how survivable the cancer is right so if it is it going to be localized say where it happened

55:25 or it's going to like spread to other parts of the body or be difficult to treat

55:29 or be resistant to treatment yes so if this is you you want higher numbers

55:34 not lower numbers it all depends right I mean certain genes are good genes

55:38 for example if there are certain checkpoint genes if those numbers you know

55:43 if they have lower numbers you want to have two copies of them because if that

55:46 protective mechanism is gone you know the tumor becomes very aggressive again depending on

55:51 the tumor so it is all into context so if you're looking at the good genes

55:55 you want to have two copies of the good gene if you're looking at some of the bad genes

55:59 you don't want to have more than two copies of the bad genes one or zero is better

56:02 I got it I got it okay okay HGVS yes this is again a wonderful package that was initially

56:11 I think it was started by a person named Vise Hart he's I think he still

56:16 maintains it but there's a lot of like you know it's a very well publicly maintained

56:20 open source package there's a lot of you know community involvement in that as well

56:27 so what HGVS is it's a nomenclature system for you know giving a name to all these variations

56:35 so when you talk about I'm not sure Michael if you've heard about the term mutation

56:39 so mutation is a very commonly used term that refers to some kind of abnormality

56:44 in the genome in this case so what happens is there are these standards that are

56:51 that you know most clinical labs follow when they're putting all of this

56:55 information in the patient reports saying okay you know this particular tumor has

56:59 you know mutation in BRAF a mutation in EGFR some other gene and there is a certain

57:05 way that those mutations are described in terms of what sequence alterations

57:09 happening say at the mRNA level and what sequence alterations are happening at the

57:14 protein level so now in your protein you know you're missing these amino acids

57:18 or you have excess of these amino acids or something got switched from here to there

57:22 so there's a formal way of defining that and the guidelines of the group

57:27 that defines that is referred to as HGPS human genome variation society and so it's a

57:32 very complicated process where you have to do all these translations from

57:36 the you know the genomic scale where the numbering system starts from one to like

57:41 and you know whatever the length of your chromosome is in terms of ATGCs

57:45 and each chromosome has a different number and if you have a certain alteration that

57:49 is happening say in chromosome 7 at this particular position then you have

57:54 to translate that to the mRNA of that gene and then the protein of that gene

57:57 so it's a lot of math a lot of strings involved in that process and so essentially

58:02 this HGPS Python package provides all of those functionality as a wrap you can

58:07 create your translation you can essentially project the variant from your

58:13 genomic to the you know the mRNA to a protein level or vice versa you can

58:18 validate things so we ended up I actually wrote a paper about this when we

58:24 did a validation of how well this particular package works and so now in the lab

58:30 that I'm currently in we implement this thing for generating those nomenclases

58:34 so what happens is when we put a report out in the patient's chart and when

58:39 say our oncologist was treating the patient they want to know what did you

58:46 identify in this tumor genome they would read that nomenclature saying oh

58:49 okay this particular change in this BF gene this is significant I know that there

58:54 are therapies that are out there that we can use to treat this patient tumor

58:58 so that's what this nomenclature system is about so it's a very automated

59:03 system yeah and it normalizes it if there's multiple ways to represent it

59:08 very nice all right this one I'm familiar with open pi excel yeah I guess

59:15 you probably have a lot of data that either comes from or goes gets shared

59:19 out into excel right yes so what we do is we sort of are right now in our lab

59:24 we're kind of in this sort of you know kind of an interim phase where we

59:31 sometimes use excel to look at some data so traditionally speaking before

59:38 you know typically any lab that goes from you know zero to the point where you

59:44 have a web application that automates everything the intermediate phase is

59:47 using a lot of excel so it's very common in many labs to use excel for a lot

59:53 of things for qc for charts for tracking so we use this open pi excel for a few

01:00:03 things one of them is when we have a lot of you know the sequencing data that we

01:00:07 have to summarize and then generate a qc to be able to present that to essentially create

01:00:14 an excel document on the fly from the backend to provide that you know whatever data

01:00:18 they want to look at in terms of statistics or you know list of variants or

01:00:22 some form of you know calculation they want to do further that's where we use

01:00:26 this package typically we use it as part of our biofranics pipeline when

01:00:31 we have to generate those things but it's a very handy tool we actually use something

01:00:35 similar and I'm forgetting the name of the package that is used to generate

01:00:41 our document like we use some word documents for creating reports but we also use

01:00:46 python there to be able to summarize a lot of these data points and then create

01:00:49 a word document that you know it starts with a template of a word document and

01:00:53 then use python to fill up all these you know right here's where the graph

01:00:58 goes here's where the summary goes here's where the detected whatever it

01:01:01 goes yeah right yeah cool are you here's two things that overlap are you

01:01:06 familiar with this thing where scientists rename human genes to stop

01:01:11 excel from misreading oh yes yes absolutely oh my gosh this is crazy yes yes

01:01:22 it happens when we import a lot of this data coming from somewhere we'll

01:01:25 see entries like september 14th or you know march 19th these are not yeah this is a

01:01:33 big problem going in and out of excel and so as much as you can do in python or

01:01:37 any proper programming language rather than using excel but there was one

01:01:41 that was m-a-r-c-h-1 or march one yes or s-e-p-t-1 it's it's very funny some well

01:01:51 some of the gene names are funny but then excel you know gets it to the next level when it

01:01:56 changes the names this doesn't make any sense yeah yeah it doesn't make any sense

01:02:02 yeah all right on to the next one hera yes hera this is very interesting

01:02:07 so this is where i think um you know where in our instance we are going away from

01:02:15 standard web applications standard bioinformatics pipeline to really touching devops

01:02:19 using python and so one of the things that um typically uh we get to the point

01:02:25 when we scale up our bioinformatics pipeline where we have multiple samples and

01:02:28 multiple runs and everything needs to be orchestrated in a way where you

01:02:32 have uh you know while you're running your pipeline you have a lot of visibility into

01:02:36 how it works and so this is uh one of our um uh projects we are working on

01:02:41 to move our current bioinformatics pipeline the way it works you know kind of on a single

01:02:46 server to be able to use um the kubernetes cluster to actually deploy the uh

01:02:51 the long running pipelines onto that and so there are many options you know there are

01:02:56 a more standard sort of you know uh whittle based uh you know kind of you know

01:03:02 protocols that you can use uh to run on either cloud or hpc environments there

01:03:07 is a there's a very popular tool called next flow that is used to be able to you know

01:03:13 kind of create your data analysis pipe flow you can sort of define that and then use any

01:03:18 backend to deploy it um one of the things that we kind of when i was exploring this space

01:03:23 one thing that one of the things i came across was um uh you know the the

01:03:28 the whole sort of ecosystem that argo maintains with uh you know argo workflow and argo

01:03:34 cicd and all those things so workflow was interesting because argo provides that

01:03:38 way where you can sort of you know write your pipelines in a yaml format

01:03:41 and then have it you know deployed on the kubernetes cluster it really is

01:03:45 very native to the kubernetes cluster interesting it sounds a little bit like

01:03:49 ansible but for specifically for um bio type of projects right yeah so argo so

01:03:57 interesting thing is argo when you know when this argo workflows was set up

01:04:01 really for a lot of cicd automations in mind so you know it is yes you can run

01:04:06 data pipelines in general but never it was never at least in its description it

01:04:10 never describes use case sort of in bioinformatics or you know biology pipeline

01:04:15 analysis and similarly you know it it was like okay it's a generic tool you can

01:04:21 use it for whatever you want so i tried it out with using you know like a yaml

01:04:26 file and it was a simple four-step pipeline it was wonderful it was magical and the good

01:04:31 thing was uh with argo like the argo workflow when you install that on your

01:04:37 kubernetes cluster it comes with a native um web interface so it's you know if i'm

01:04:44 sure if you've heard about the workflow option with uh uh airflow so airflow

01:04:49 is a package that also you know there's a nice python SDK for that where you have

01:04:54 you know you deploy it on a kubernetes cluster we have all these amazing you know

01:04:58 visualizations to show what step you run or if there's some error there it'll do that argo does

01:05:03 the same thing so it has it has obviously the built-in capability to interact with it as an

01:05:07 api but then also there's a web interface that it'll deploy and it can have

01:05:11 visibility into every step of the process you can summarize and see the entire tree so that

01:05:16 was very interesting for us because we could get all that thing done in a

01:05:19 single thing in a single goal but then uh our challenge was or well our desire was that if you

01:05:25 could integrate that with our limb system that we were working on using fast

01:05:29 api so that it was hey if there's any python SDK and so that's where hera comes so hera

01:05:34 essentially is a SDK or a wrapper that's essentially talking to our goal but you

01:05:39 can define your pipeline steps as uh you know dax in python and so that makes

01:05:45 the process so super simple where you're now natively essentially we can

01:05:50 integrate that as a back end to our uh web application and so then it's almost like you know

01:05:56 it's python again from start to finish you're not you know getting out of

01:05:59 that and it's it's uh again it's a very well-maintained application so so we are

01:06:03 currently doing a validation to be able to make sure that or demonstrate that you

01:06:08 know it's equally performant when we compare to a more sort of uh native

01:06:13 shell based uh you know execution on the pipeline okay yeah this is new to

01:06:17 me i mean of course i know airflow but not hera cool all right hi in sim did i

01:06:22 grab the right one here uh package no i think it's it's similar uh let me uh see if

01:06:30 i can i can send you the link yeah throw it in the private chat here and i'll pull it up

01:06:35 yeah uh yeah okay there we go in sim yeah okay got it yeah so this is a this is a

01:06:46 very interesting space in uh next generation sequencing assay or for uh you know

01:06:52 high throughput sequencing assays so what happens is uh as i mentioned that one of

01:06:56 the things that's required for a clinical lab is to be able to perform a validation on

01:07:00 multiple samples of tumors that have certain mutations and then you can demonstrate that

01:07:05 you know yes the assay works because you have tested you know 100 samples that have

01:07:11 300 different you know mutations of genetic alterations and then you can demonstrate that yes your

01:07:16 pipeline or your assay was able to pick it up so you can say that you know your assay is you know

01:07:21 x percentage sensitive x percentage specific and you know your recall rate and things like

01:07:26 that so what happens is when you're trying to um get those samples that have these very difficult or

01:07:33 challenging variants to detect because they're just you know complex in how they occur

01:07:38 biologically in the cell is very difficult some of these are very rare there may be only two

01:07:43 samples in the entire world or so it's just not possible practically to you know get those samples

01:07:48 unless we wait for like you know 10 years to you know validate that so the idea is that if it is

01:07:53 possible to be able to use algorithms which can manipulate the existing sequence so for example

01:08:00 we have a sequence data from an existing real tumor sample but we can manipulate that in a way where

01:08:07 we introduce these mutations in a in silico so we can introduce you know snbs or insertion

01:08:13 deletion mutations and then use that same file to then feed into our pipeline biophonics pipeline and

01:08:18 say okay run through the entire data pipeline and at the end let's see if we are able to identify

01:08:22 those variants that we insert and so that's where this silico mutagenesis comes it's a

01:08:27 very hot topic it's a very relevant topic of interest to really fill this large gap in terms of

01:08:34 availability of rare variants and rare samples and how we can really improve sort of some of these rare

01:08:42 but very clinically significant edge cases where we don't want to miss those variants and actually see

01:08:46 those in real tumor samples and so this is a python package that was developed by at the university of

01:08:53 chicago as part of their clinical lab and so what really it does is it will take in a list of different

01:09:00 you know mutations for example in this plot i think they give examples of

01:09:04 you know insertion deletion insertions where you have extra sequence or deletion where you have certain

01:09:09 segments which are missing or snbs or single nucleoid variants where you have

01:09:14 one nucleotide that got switched with another one and so these are typically

01:09:17 that we practically see in in like real samples or real tumor samples but this is a

01:09:22 way to mimic that you know in a sample that does not have it and so this python

01:09:28 package is able to you know take that list from you say okay i have a list of

01:09:32 these 20 important mutations that i know from the public databases have been

01:09:37 reported but i want them to be inserted into my data set that was created from

01:09:42 say a set of three or four real tumors and then use that to challenge the pipeline

01:09:47 to say that hey can you still pick it up and so i see simulate these rare changes and then

01:09:53 yes test or exercise your setup right we've got a few more to cover but i

01:09:59 think we're getting a little bit short on time so let me just close this out with a

01:10:02 final question for you because i know this is the topic du jour what is uh what is ai and llms look like

01:10:11 for you guys is it does it matter is it really powerful is it super important

01:10:15 uh i mean genetics is kind of text data in a sense and so yes sort of in the space of

01:10:21 how it could apply right right it is it is uh it is a it is a text data and it's a lot of um

01:10:27 you know there's a when you talk about like a search space a lot of the search space is

01:10:32 very text based you know there is some numerical base was a lot of text based

01:10:36 search as well and i think uh across the entire spectrum from where we start with

01:10:41 very raw sequencing data to the point that we are trying to uh you know uh ask the

01:10:47 question that okay i found this rare or novel mutation in this particular gene what

01:10:52 does it mean you know what human has it been described what disease does it relate

01:10:56 to uh one of the things that we do as um molecular pathologies and this is sort of

01:11:02 where a lot of the medical work comes in is where we really go through a lot of the

01:11:06 medical literature what we have learned before new publications papers out there

01:11:10 that you know that have a lot of data in terms of you know studies that have done on

01:11:14 this particular gene and they've described like okay these alterations actually activate

01:11:18 the gene or is bad for the tumor or you know makes it treatment resistant um so you can

01:11:23 see the naturally a lot of text starts and happens and so in that space uh we are seeing

01:11:28 in the i would say in the past you know three to four years there's been a lot of uh

01:11:32 applications of ai tools that have come out um you know particularly in the space of variant

01:11:38 calling where we have this genomic sequence data and we're trying to identify variants uh you know

01:11:43 one of the examples uh that's been talked about a lot is the uh deep variant caller it's called deep variant

01:11:49 from uh from the team at google who developed that uh that uses a lot of the ai techniques to be able to

01:11:55 you know pick those things up uh there are some genomic databases that um we use for in silico

01:12:01 prediction for example if you have a variant we have no idea about it it uses there's a database called

01:12:07 dbscsnv that uses random forest um techniques and i think it used another algorithm to predict if a

01:12:14 certain site where there's a mutation can enhance uh abnormal mechanism called splicing versus not

01:12:21 uh similarly there's a lot of tools that are coming in and the llms i think are i would say not mainstream but i think

01:12:28 there's a lot of interesting research that is coming around there where people are trying to use um

01:12:32 llms for doing these more broader text search saying that hey you know i have these you know i don't know thousand articles

01:12:40 and i want to find these particular combination of uh words that you know you know uh there's a combination of a disease and a mutation

01:12:50 and what do i get back on that um i personally tried you know chat gpt with

01:12:54 different you know like uh phrases and uh questions about it what i've seen so far is and this is purely my personal experience i think

01:13:03 a lot of it reads very real but when you start to look into the references as to what it references then you quickly figure out

01:13:11 right just making that yeah this is not the real deal and so i think um i think it's uh you know i don't i'm not a very pessimistic person i would say

01:13:22 oh no this is all garbage but i think uh there is opportunity there it's just how

01:13:26 do you train it uh maybe there's a a space or an opportunity and it probably

01:13:31 already has been people are pursuing this is training a smaller model but

01:13:35 it's really deeply in genetics or it's trying not trying to use a model that tries to understand

01:13:40 everything right right yeah it's more yeah more in the medical literature or the genomic literature

01:13:46 to be able to like meaning is enhanced in that yes so i think there's active work going on there

01:13:51 but it's uh yeah it's i think it's making you know a lot of a lot of interesting research a lot of

01:13:56 potential impact on how you know we do things and obviously the tool sets that we currently use

01:14:02 we might expect in the next 10 years to change yeah for sure all right final thought here people

01:14:07 are listening they're maybe doing similar work to you how do they get started what would you tell them

01:14:14 get going with python and some of these packages yeah i mean i would uh you know my uh reflecting on my

01:14:20 own experience sort of you know in a very uh winded way that i ended up here i think um you know

01:14:27 python is i i i feel programming in general and i think python particularly as uh programming language

01:14:34 is a very low you know sort of you know entry point in terms of being able to really quickly get things

01:14:40 done like learn it easily and get things done i think it should be to me anybody anybody who's trying to pursue

01:14:46 something in biology or competition biology or bioinformatics i think this is the first thing it's it's something easy to do

01:14:55 um to be you know i would say relatively easy to do to be able to get from that anybody with you know

01:15:00 um a desire to learn this has analytical thinking uh i mean i i think investing into python is probably

01:15:10 the best bet because you can pretty much do anything you want uh that's what i tell you know when i tell

01:15:15 train people you know in my lab or i talk to other students is that if you want to spend your time you have

01:15:20 very little time because you're busy you know with your other things uh i think the one thing that can get some

01:15:25 of the job done and be still aligned with what you're doing is python um and after that i think it's you know it's a

01:15:31 it's a lot of self-driven learning where you know you're kind of you know looking into things but the good thing is i think the the

01:15:38 the python community is wonderful it's almost like i sit down and i think about oh i have to solve this problem

01:15:44 um probably there are 50 other people who think about that and maybe two people have already worked

01:15:49 on it so it's right they've already published it to pipe the eye and you're good to go absolutely i i

01:15:54 totally agree with that and you know people should take the couple of weeks get good at it and it'll

01:15:59 amplify it'll save you time definitely in the long run oh absolutely yes the only thing that i uh that i the only thing i would

01:16:08 say like an added thing is if somebody is learning python and then they do have an intention you know to

01:16:16 take it to the point where they would be involved in more serious um like you know application development

01:16:22 or maintaining an open source package or you know however they contribute to that i think learning a

01:16:26 little bit more like learning python in its real sense in terms of how to do it right you know there are

01:16:33 five ways of doing something correctly but i think uh there's one way that is consistent so that it's

01:16:39 again you know easily shared it's easily maintainable others can easily understand i think that would be

01:16:44 my second advice it is it takes a little bit time but i think it's well worth the effort to spend the time

01:16:49 writing you know idiomatic python code so it's um it's portable absolutely all right so mac thank you for

01:16:57 being on the show it's been great to get this look inside of what you all are doing with python

01:17:01 yeah thank you for having me on the show i appreciate that yep bye okay bye bye

01:17:05 this has been another episode of talk python to me thank you to our sponsors be sure to check out

01:17:11 what they're offering it really helps support the show take some stress out of your life get notified

01:17:16 immediately about errors and performance issues in your web or mobile applications with sentry

01:17:22 just visit talkpython.fm/ sentry and get started for free and be sure to use the promo code

01:17:28 talk python all one word this episode is sponsored by posit connect from the makers of shiny publish

01:17:35 share and deploy all of your data projects that you're creating using python streamlet dash shiny

01:17:40 bokeh fast api flask quarto reports dashboards and apis posit connect supports all of them try

01:17:48 posit connect for free by going to talk python dot fm slash posit p-o-s-i-t want to level up your python

01:17:55 we have one of the largest catalogs of python video courses over at talk python our content ranges from

01:18:00 true beginners to deeply advanced topics like memory and async and best of all there's not a subscription

01:18:06 in sight check it out for yourself at training.talkpython.fm be sure to subscribe to the show open your

01:18:12 favorite podcast app and search for python we should be right at the top you can also find the itunes feed

01:18:18 at /itunes the google play feed at /play and the direct rss feed at /rss on talkpython.fm

01:18:25 we're live streaming most of our recordings these days if you want to be part of the show and have your

01:18:30 comments featured on the air be sure to subscribe to our youtube channel at talkpython.fm/youtube

01:18:36 this is your host michael kennedy thanks so much for listening i really appreciate it now get out there

01:18:41 and write some python code you