#470: Python in Medicine and Patient Care Transcript
00:00 Python is special. It's used by the big tech companies, of course, but it's also used by those you would rarely classify as developers.
00:07 On this episode, we get a look inside how Python is being used at a children's hospital to speed and improve patient care.
00:15 We have Dr. Somak Roy here to share how he's using Python in his day-to-day job to help kids get well a little bit faster.
00:22 This is Talk Python to Me, episode 470, recorded June 23rd, 2024.
00:29 Are you ready for your host?
00:30 There he is.
00:31 You're listening to Michael Kennedy on Talk Python to Me.
00:34 Live from Portland, Oregon, and this segment was made with Python.
00:38 Welcome to Talk Python to Me, a weekly podcast on Python.
00:44 This is your host, Michael Kennedy.
00:45 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both on fosstodon.org.
00:53 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.
00:58 We've started streaming most of our episodes live on YouTube.
01:02 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.
01:10 This episode is brought to you by Sentry.
01:12 Don't let those errors go unnoticed.
01:14 Use Sentry like we do here at Talk Python.
01:17 Sign up at talkpython.fm/sentry.
01:21 And it's brought to you by Posit Connect from the makers of Shiny.
01:25 Publish, share, and deploy all of your data projects that you're creating using Python.
01:29 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Reports, Dashboards, and APIs.
01:36 Posit Connect supports all of them.
01:38 Try Posit Connect for free by going to talkpython.fm/Posit.
01:43 P-O-S-I-T.
01:45 So, Mac, welcome to talkpython.fm.
01:46 Awesome to have you here.
01:47 Hey, thank you, Michael, for the introduction.
01:50 I'm here.
01:52 Excited.
01:53 Yeah, I'm pretty excited to be talking about medicine and all the stuff that you guys are
01:58 doing with Python.
01:59 And I really like these kinds of shows because I think it's important to highlight that Python
02:04 is not just for web developers and pure data science machine learning people, but it's used
02:09 by this huge spectrum of people doing all sorts of interesting stuff and solving real problems
02:13 with it, right?
02:14 And it sounds like you fall pretty solidly in that category.
02:16 Yeah, absolutely.
02:17 It was like Python has been sort of this discovery as I've gone through my career as a physician.
02:23 And it's interesting how to begin with when computer science initially when I was training
02:29 and growing up, it was hard to imagine medicine and computer science sort of being hand in hand
02:34 together.
02:35 But now I think things have progressed and there's a lot of technology that's now in medicine that
02:41 allows you to do all kinds of things.
02:43 And of course, as I've discovered Python, it brings in kind of the toolkit and the ability
02:51 to be able to achieve and solve problems in a way that I think it's not been envisioned before.
02:57 So it's a very exciting time.
02:59 Yeah, it is a very exciting time.
03:01 And I think it's just getting better and better.
03:03 Before we get too far into this, tell people a quick bit about yourself, quick introduction.
03:07 Yeah, absolutely.
03:07 So I'm Soma Croy.
03:09 I am a molecular pathologist.
03:13 It's a type of physician who deals with looking at the genome of either a patient or a patient's tissue.
03:23 And we essentially look at all of these things in a way to be able to help manage a patient's treatment.
03:34 In my current position, I am an associate professor and the director of molecular pathology at Cincinnati Children's Hospital.
03:42 My lab is a clinical lab that is under the division of pathology.
03:49 We do a lot of work that pertains to kids in terms of helping them diagnose and manage pediatric cancer,
03:58 as well as infectious disease that happened in this age.
04:04 I mean, molecular pathology, I essentially trained back in India as a physician, did my MD.
04:12 Then I came here, started in Pittsburgh.
04:15 I did my training in pathology and lab medicine.
04:19 I specialized in molecular pathology.
04:21 Then I was there in Pittsburgh.
04:23 I worked for some time and then opened in Cincinnati Children's.
04:27 Excellent.
04:28 Do you work directly with patients or do you get samples sent to you from other doctors and then you process them and analyze them?
04:36 Yeah, that's a good question.
04:37 So I do not work with patients directly.
04:40 It's a kind of, it's a subspecialty in medicine where my lab works with the samples that have been collected in the patient,
04:48 either from the OR or a procedure or from the radiology suite.
04:54 And then we work on that tissue or the blood sample or a bone marrow sample that comes to us.
04:59 And yes, then all the testing that we perform is off from that specimen.
05:05 And then once we generate the clinical reports back, they go back to the patient's chart and to the patients,
05:12 to the clinicians who are treating and managing them.
05:16 And then that way it helps how, you know, how they're able to then, you know, get a diagnosis and then give the appropriate treatment and the management to the patient.
05:24 Yeah, excellent.
05:25 So, yeah, you must see a lot of, a lot of different stuff flying, flying through the lab you have to analyze.
05:30 So how did you go from I'm studying medicine to I'm writing Python code and running automation?
05:38 And what was that process like?
05:41 Well, that was a, that was an interesting journey for me.
05:45 So before medicine and biology came into my life, I started off, it was second grade, I believe, when, you know, my dad, he got me a computer at that time,
05:59 which is a, you know, 64 kilobytes small machine.
06:03 And it was, I think it was for Toshiba MSX computers where you can, you could write like GW basic code and, you know, some, some basic, you know, predefined hex code.
06:14 And you can, you know, run small applications on that.
06:16 That was my starting point.
06:18 It was, it was super exciting for me.
06:20 And I think from there on the journey went into, as I went through, you know, school, you know, high school and then college.
06:29 You know, medicine was, I would say, you know, biology was something that intrigued me.
06:35 And at the same time, I also got interested in genetics, looking at DNA sequences.
06:42 And, you know, just, I had a natural sort of, you know, liking for the fact that I could, you know, study about the cell or the genome, the DNA and RNA.
06:55 But also I realized that there was a lot of math and competition that you can use to sort of slice and dice data.
07:02 And at that time, you know, the place that I grew in India was a very small place.
07:06 So we didn't have access to resources like internet.
07:10 So I, when I, my exposure to internet was when I actually went to med school for the first time.
07:15 Oh yeah, you can actually connect to like other computers.
07:18 So that's when I started my medicine.
07:21 So obviously I did, I did my training in medicine in medical school back in India.
07:26 That's when I started to, you know, connect and talk to a lot of people.
07:30 And some of my friends who actually were already, you know, writing apps at that time using like, you know, Java applets on browser.
07:38 And so I started to make some connection in terms of, you know, images and how, you know, learning how some of those things can be used in medicine.
07:47 And so radiology, during my radiology rotation, that was my first real life.
07:53 It was my realization that actually in medicine, you can use computers a lot to handle a lot of these images, x-rays, CT scans.
08:00 And I think as it went on, when I came to the U.S., that's where, you know, it really started off.
08:06 During my residency in pathology here, I actually connected with my mentors, Dr. Anil Parwani and Dr. Leland Pantanois.
08:18 They are well-known pathology informaticists.
08:22 They've spent a lot of their time sort of dwelling in the world of pathology, medicine, and, you know, computer science.
08:30 And so that is when I could actually realize that, yes, you can, you know, do a lot of innovative stuff by, you know, developing apps, algorithms, analyzing, you know, either image data or molecular data.
08:42 And so that is when I started to go into designing an app, which was a very simple web app that was a project.
08:52 I was working with one of my mentors.
08:54 And so the idea was that, you know, we had a lot of these pathology images and we wanted to create a little browser app that would display these images as thumbnails.
09:04 And then clicking that could enlarge the image, shoulder the display.
09:07 And so I used, at that time, it was .NET framework to be ASP.NET.
09:13 And so, you know, created a little app using Visual Basic, slowly I then migrated using C# in the same environment.
09:22 And that time I started my advanced fellowship training in molecular pathology.
09:29 That's when I, you know, that's when I started there.
09:33 That's when I realized there's a lot of genomic sequencing data where essentially you're dealing with a lot of strings and numbers.
09:40 And, you know, you have to make a lot of sense in terms of, you know, this large volume of data that comes in.
09:44 If we're working with, so the kind of data that you're working with for, say, this genetic stuff.
09:49 Yes.
09:50 For us, for when you're studying the genomics, so how much data is in, say, one strand of DNA?
09:58 How much of that do you actually care about?
09:59 Like, give me, give us a sense of sort of how much data we're talking.
10:03 Right.
10:04 So it really depends on what is being done.
10:07 And so when we look at, so when we talk about genomics, it is really designed on how the experiment is done.
10:15 So, for example, if we just simply look at the entire human genome, we are talking about three billion alphabets.
10:23 Essentially, it's, you know, the combination of four alphabets, A, T, and A, T, G, and C.
10:30 So these are the four nucleotides of the DNA sequence, and the RNA has one additional one, which replaces A.
10:37 But the idea is that it's a mix and match of these sequence.
10:40 And so if you think about the entire human genome as a single thread of A, T, G, and Cs in various combination, you're looking at three billion alphabets.
10:51 And so what happens is when we do these sequencing experiments where you would take, you know, the DNA molecule from a bunch of cells within a tissue, and then either we read all the three billion base pairs.
11:06 And typically, the way the sequencing is done is you read all of these sequences and, you know, from many molecules.
11:13 And so you'll have multiple copies of that when you're, you know, translating from a molecular, like a chemical molecular structure to a DNA sequence on, say, a flat file on a, you know, in a file system.
11:28 And so if you look at that large scale of data, like the entire genome, we are talking of, you know, hundreds of gigabytes, maybe even terabyte worth of data.
11:37 Then there are other more practical approaches when we look at the genome, when we, and especially this is something that we use for day-to-day patient care, which is referred to as targeted sequencing.
11:47 What that means is instead of the three billion base pairs, we focus on those regions of the human genome that are of most pertinent use, or, you know, that we, at least as the current field of genomics, that we understand what to do with it.
12:03 And so there are certain genes that, at least in the space I work with, cancer genomics, that are, I would say, close to about maybe 2,000 to 2,000 genes, which are known to be cancer-associated.
12:18 And of that, roughly about 500 to 700 genes, where we know that, you know, they have, they've been studied and demonstrated that there are certain types of abnormalities in those genes in terms of the sequence changes, that they have certain meaning from, you know, in context of tumor in order to make a diagnosis, or to understand if the tumor is aggressive or benign, or if there are certain treatments that could be applied to those tumors.
12:46 And that's specifically linked to the kind of sequence change you see in those, in that region of the genome.
12:54 And that, you know, we are talking about, practically speaking, and we talk about, you know, the targeted testing that we do, it's a very small fraction of the large genome.
13:04 Typically, there's a term known as exome sequencing, and exome sequencing refers to sequencing all those regions of the human genome where it at least codes, you know, encodes for one or the other anodated gene.
13:18 That is typically about 1 to 2% of the entire gene.
13:22 And so if we further narrow that down to about say 500, 600 genes, that one would typically sequence for practical cancer molecular testing.
13:31 That's, I would say that's probably about 10th, maybe slightly less than that of the genome, but it's a very high yield from a clinical standpoint.
13:40 Sure.
13:41 Because the most alteration you will find that would make a, that would help with the clinical treatment is high.
13:46 So, if you're going to talk about that data set, that is, you know, it's complex in a different way, because just looking at the raw sequence data would be, you know, somewhere in, you know, I would say 1 to 20 gigs from a single, you know, sequence file, but it entirely depends on how deep we go.
14:08 So, for example, when we talk about sequencing, as I mentioned before, when we sequence a molecule, we can sequence it either at certain depths.
14:17 That means what level of redundancy you want to be able to read that molecule.
14:22 Sometimes we read the molecules, you know, 20 to 30 times.
14:26 So that's referred to as 30x, or sometimes we'll read that, you know, 500 times.
14:31 So that would be 500x.
14:33 Do you do that because you want to make sure you don't misread the gene?
14:38 Yes.
14:38 So, right.
14:40 So what happens is the greater the depth of sequence, so typically for, you know, such large panels that we sequence in a clinical setting, we usually target about 1500x to 2000x.
14:52 That means we're reading that 2000 times.
14:54 So the more the depth it is, the possibility of identifying a certain variation or genomic alteration that is present at a very low level.
15:03 For example, say, you know, you have a tumor cell and within that, only 2% of the cells have this mutation.
15:10 Others don't.
15:10 And so when you're looking for or hunting for these, you know, needles in a haystack, you really want to maximize the amount of depth you have to be able to pick those things up.
15:19 So it really depends on how deep we go.
15:21 The more deep we go, the more data it is.
15:23 And so it can scale up to, you know, almost several hundred gigabytes.
15:28 Sure.
15:28 Yeah, I've always wondered about how you can go and read somebody's genetics and then not make a mistake when you're, you know, reading, using chemicals to read.
15:38 So, but it's really ridiculous how much data is there.
15:42 Off by one, a C for a G or whatever is a bad thing, right?
15:47 Right.
15:47 But it is, you know, I think as the technology has matured, there's, you know, there's always, there's nothing, there's nothing 100%, you know, in terms of the error profile for the enzyme that has been used to work, the technology that is reading the actual fluorescence, converting that to, you know, signal.
16:04 There's always statistical values and probabilities that are associated with, you know, what is the probability that it is wrong or, you know, incorrect or correct.
16:12 But within that, you know, frame and where the current technology is, it's pretty accurate for, if not all, you know, many of the regions of the genome.
16:21 And so it's mind-baffling how it works.
16:23 Yeah, it's, it really is quite amazing.
16:26 It's one of the modern marvels of science for sure.
16:29 It is, it is.
16:30 This portion of Talk Python to Me is brought to you by Sentry.
16:34 Code breaks.
16:35 It's a fact of life.
16:36 With Sentry, you can fix it faster.
16:39 As I've told you all before, we use Sentry on many of our apps and APIs here at Talk Python.
16:44 I recently used Sentry to help me track down one of the weirdest bugs I've run into in a long time.
16:50 Here's what happened.
16:51 When signing up for our mailing list, it would crash under a non-common execution pass, like situations where someone was already subscribed or entered an invalid email address or something like this.
17:03 The bizarre part was that our logging of that unusual condition itself was crashing.
17:08 How is it possible for our log to crash?
17:12 It's basically a glorified print statement.
17:14 Well, Sentry to the rescue.
17:16 I'm looking at the crash report right now, and I see way more information than you'd expect to find in any log statement.
17:22 And because it's production, debuggers are out of the question.
17:25 I see the traceback, of course, but also the browser version, client OS, server OS, server OS version, whether it's production or Q&A, the email and name of the person signing up.
17:38 That's the person who actually experienced the crash, dictionaries of data on the call stack and so much more.
17:42 What was the problem?
17:43 I initialized the logger with the string info for the level rather than the enumeration dot info, which was an integer based enum.
17:53 So the logging statement would crash saying that I could not use less than or equal to between strings and ints.
17:59 Crazy town.
18:01 But with Sentry, I captured it, fixed it, and I even helped the user who experienced that crash.
18:07 Don't fly blind.
18:08 Fix code faster with Sentry.
18:10 Create your Sentry account now at talkpython.fm/sentry.
18:14 And if you sign up with the code TALKPYTHON, all capital, no spaces, it's good for two free months of Sentry's business plan, which will give you up to 20 times as many monthly events as well as other features.
18:28 So I think you're a little bit unusual, a little bit weird in the sense that you got into as your first sort of programming thing to bring to apply to your science and medicine side of things was C-sharp rather than or on vb.net rather than something like Python or R or something.
18:49 So maybe talk a bit about that experience, contrast it with Python.
18:53 Like, why do you end up moving to Python?
18:55 Yeah.
18:56 So, you know, I think the reason I started using vb.net C-sharp was, I would say, most, it was probably influenced a lot by, at the time when I was doing, you know, my med school in India as to, you know, what was available at that time.
19:13 It was not something I would just go to the internet and like start getting a lot of resources as, you know, one would do now.
19:18 So it was pretty much like, you know, this is the book I have available and that's about the document.
19:23 Okay.
19:23 That's the only thing.
19:24 So you start.
19:24 Yeah.
19:24 That's what I use.
19:26 Right.
19:26 And then, but the thing is when I started applying C-sharp and it was mostly C-sharp and a little bit of C++ that I, you know, started to get into like with some of the non-genetic stuff initially, well, the project I'm working on.
19:40 It was not too bad because I was able to accomplish most of the task, but then once I got into genomics and I came, so the way, you know, professionals who get into genomics and molecular pathology, there are a couple different routes.
19:58 So either the physician, you know, people who are physician trained and they kind of have a formal background in medicine and they kind of, you know, do a specialized training and then they, you know, they become molecular pathologists after getting more certified.
20:11 There is the other route, which is more of a research background where sort of, you know, people have spent a lot of their time, you know, in really deep research.
20:18 They've, you know, they've learned a lot of genomics hands-on either from a computational background or from a more laboratory, like a wet laboratory background.
20:27 And so they've spent, you know, they've obviously done their PhDs and postdoc training and then sort of, you know, come into the molecular pathology field.
20:34 You know, people starting there tend to have more of a, you know, a formal computational training.
20:42 So they're getting, you know, they usually get, obviously when you start with a research lab, R, Python are sort of like the most common tools that are used for any kind of data analysis and data set visualization.
20:52 Coming from more of a foreign medicine background, you know, typically in real,
20:57 you know, when we, you know, get training in, you know, clinical informatics or pathology informatics, often it is very, you know, kind of a, I would not say corporate base, a very, you know, formal application development space.
21:11 So it's a lot of, you know, you know, Windows base, .NET, C-sharp, that kind of thing.
21:17 So Standard enterprise stack.
21:19 Yeah.
21:20 Java or .NET is a choice.
21:21 Right.
21:21 Yeah.
21:22 Right.
21:22 Right.
21:22 Yeah.
21:22 Okay.
21:23 Yeah.
21:23 In bioinformatics, I mean, at least in genomics, bioinformatics, you know, kind of the ecosystem of tools available, it's, it's a, it's a mishmash of everything.
21:31 It's, there's, for anything which is very competition intensive, like when you're trying to align sequences to the human genome, those are very intensive tasks.
21:39 And typically it's a lot of, you know, C, C++, Java that's involved in some of these very mainstream tools that are available.
21:50 More recently, I think we are seeing Rust coming into the picture as well.
21:54 There's some applications.
21:55 And then of course, you know, Python and R, you know, like the predominant, I think, tool sets, the programming language that are used to solve all of these problems.
22:04 So when I sort of started my molecular pathology fellowship and I got into, now I had to do this project that involved, you know, manipulating all the sequence data to a point where we will be able to develop an application that would help sort of, you know, it's a web-based application that could help, you know, for other pathologists and faculty to read that sequencing data and, you know, digest in a very way that's easy for them to look at it.
22:32 Rather than going to the Linux, you know, terminal and opening up like, you know, raw files and things like that.
22:36 So I used, that was my first project was to use C# in that context.
22:41 But I quickly realized that there was a lot of these algorithms that were natively either written in R or Python and then having to, you know, incorporate those functionalities was not as easily possible.
22:52 So I had to rewrite a lot of those things in C# primarily.
22:55 You know, it was a good learning curve, but I think from a main primitive perspective was getting really difficult.
23:00 And so that's when the realization was that I think the combination of Linux and Python was, you know, I had to move towards that.
23:08 Yeah.
23:09 C# probably from the timeframe that you're thinking about, didn't really have a great package manager story, not to the same degree that the Python does.
23:16 Although they do pretty, they do pretty good now over in the land.
23:19 Right.
23:19 Right.
23:20 Yeah.
23:20 All right.
23:21 So a good question from Chris in the audience says, is there a reason to use Python specifically?
23:26 Like, are there some special sauce packages that make it attractive?
23:29 It sounds like that's kind of what you were getting at.
23:31 You found more solutions to these algorithms than, you know, available in Python than in C#?
23:37 Yeah.
23:38 Or whatever languages.
23:39 Yeah.
23:40 Right.
23:41 So I think, I mean, I think the simple answer is yes.
23:43 I think the community and the amount of work that has been done in this particular space with genomics.
23:49 I mean, when you are really searching for applications, it kind of falls into these three categories of, you know, anything which is a high performance component, you know, program that is usually in the Rust, in a C++, C, a lot of, you know, of those languages, a little bit of Rust and Java.
24:05 And then the other bin is essentially kind of, you know, split up into Python and R. I think for me, Python was, and I think I'm sure others have shared the same way where it's almost like, wow, this is amazing.
24:18 Like coming from C#, it was a little bit of a change because there's no more, like, you know, curly braces.
24:23 Think about the whole thing.
24:26 Did you miss your semicolons?
24:28 Kind of.
24:29 Like even now, sometimes when I write like a little bit of JavaScript, I feel like, oh, yeah, okay.
24:35 This is my term.
24:36 Exactly.
24:37 But, you know, not too bad.
24:39 I think what I got onto was like the simplicity of the language and how powerful it was when, like, if I'm thinking about, you know, you know, I'm not too bad.
24:48 I think about, you know, it was interesting when I had to do something like there was an algorithm where I had to parse out certain, you know, strings in a way where it required some known workflows that we use to do, like, variant annotations when we are toss referencing databases and putting them together.
25:09 You know, when you look for, like, you know, when you look for C# packages, I mean, there's really nothing there for native YouTube.
25:16 So you have to write a lot of those things.
25:18 In Python, it's the amount of time that is spent in developing those things is much faster.
25:23 And the development time itself is quick because you either get an idea of somebody who's already done the work or there's a more formal package that you can use.
25:32 So I think initially when I started off, BioPython was a very interesting collection of packages.
25:38 I think it was, like, you know, a tool suite essentially written to, you know, to have all these functions available for very common day-to-day tasks.
25:47 You know, I want to query a certain region of the, you know, bam file or I want to parse out certain things in the fastq file to look at, you know, some of the sequences or doing, you know,
25:57 counting number of sequences in a given file and, you know, getting read counts, things like that was, it's all out of the box.
26:05 And so that was sort of like the first thing to go about, this is amazing.
26:09 I mean, you just say somebody's already done the work and just putting on top of it.
26:13 So instead of creating these, I'll just like use those.
26:17 Perfect.
26:18 Right.
26:19 So that was one.
26:21 And the other motivation to use Python was, you know, say, for example, why not R?
26:25 You know, why Python?
26:27 Because R is a very rich ecosystem in, you know, at least in Genomics and Visualization.
26:30 So I think the second thing was in terms of the idea that I was working on was having to develop a web application and all of these bioinformatics, you know,
26:42 toolings and algorithms running sort of in the back end.
26:44 And so at that time, it was like, okay, well, you know, Python, I've not heard much about in terms of web application.
26:49 Mostly it was, you know, again, this big, like, you know, C#, .NET.
26:53 That was why I started off, you know, with that.
26:56 But then at that time, you know, there was Django and then Flask was sort of coming in.
27:00 It was a very minimalistic, you know, sort of application.
27:03 So I started focusing on that.
27:04 It was very easy with Flask to, you know, get up and running with very simple, you know, applications to do that.
27:11 I didn't try much into Django just because it was too bloated for me.
27:15 But, you know, Flask was great.
27:17 And then what I realized was you can create a simple web application, but then at the same time, you can use all your, you know,
27:24 Biopithons and all the wonderful biopithetics packages in the back end.
27:28 So it's like a single language that lets you do both.
27:31 And so I was like, this is great.
27:33 It was just, I don't have to go anywhere to learn, you know, a third or a fourth or a fifth different program language.
27:38 And this just gets the job done.
27:40 Yeah.
27:41 Keeping in mind that you are actually main, your main job is medicine, not programming, right?
27:45 It's not like you're a CS person who's just all after, out to learn all the languages, right?
27:50 Right, right, right.
27:51 So that, that definitely is, and again, that's a, that's a huge, you know, I would say I'm, again, as you said, I'm in a, I'm sort of in an unusual position where I'm, you know, a physician, but I also do a lot of these application developments.
28:04 So that certainly is an important point in terms of how much time I have to be able to develop these prototypes.
28:11 And then obviously, you know, typically the way it works is at least right now here, you know, where I am currently working, I have an excellent and amazing team of developers and bioinformations who really do a lot of the development work on the front and back end.
28:26 And so for me to be able to take additional time out of my, you know, the clinical and the patient care work is limited.
28:33 So if I can get whatever prototype I'm thinking of or developing the application fast, then, you know, that's, that's what I'm going for.
28:40 And so you can hand it off to the team and let them polish it up and product, make it production ready, basically.
28:46 Yeah.
28:47 Yeah.
28:48 Yeah.
28:48 I was wondering how much time of your, your job do you get to spend on these kinds of things, you know, finding new packages, optimizing or improving the ways that you're working on stuff versus just sort of handing it off to the folks you work with and, and keeping, you know, focus more on the medicine side.
29:05 Yeah.
29:06 So, it, you know, that, that, I think it's a good question.
29:10 I think it's, it's evolved over time as I have, you know, being sort of, you know, when I was in training and then being a faculty and then, you know, faculty in this new position.
29:19 you know, one of the things I did was as part of my, certification was to, you know, to get board certified in clinical informatics.
29:27 That's a discipline by itself that, you know, involves, a lot of, you know, it's a very broad field in terms of, you know, informatics and healthcare.
29:35 And then one of the buckets there is, you know, software development.
29:38 And so I was, you know, I, I was quite interested sort of in, in that field.
29:43 And so most of my time in terms of being able to, you know, devote to, you know, finding new packages or, trying to, you know, write up an application that could solve a problem or coming up with prototypes.
29:57 Or it was done in a way that sort of aligned with the work I was doing.
30:01 And so it would be days when I'm on clinical service where I'm mostly, you know, working on sort of with, you know, with, patient care related matters.
30:09 So those weeks would be, you know, obviously, very busy.
30:12 I would have, you know, I would wake up at like extremely early in the morning, spend the first two hours, four to six AM just, you know, working on this.
30:19 And then I get back to like, you know, the critical world.
30:21 And then there will be weeks when I'm off clinical service.
30:23 So I, you know, I'm not responsible for any, patient care related work.
30:27 And those weeks would be where I would spend time in terms of, you know, doing these, you know, investigating into sort of some of these packages and, you know, coming up with new ideas, exploring what is all, you know, what is available in terms of certain problems that I've been doing.
30:41 certain problems that I was solving.
30:42 And, you know, that time sort of, you know, my quote protected time professionally, was spent in that.
30:48 And so that would be, you know, maybe a week spent into like, Hey, we are trying to look into this variant annotation tool.
30:54 And then we'd want to, you know, write wrappers around it.
30:57 So it, becomes, easy for, you know, our labs operation to be able to use that.
31:02 And so, so kind of that, that's how it works.
31:04 So some of those, either early mornings or, you know, the weeks I'm off clinical services and how, how that works.
31:11 Yeah.
31:12 So, let's get started.
31:13 So, let's get started.
31:14 This portion of talk Python to me is brought to you by Posit, the makers of Shiny, formerly RStudio and especially Shiny for Python.
31:21 Let me ask you a question.
31:23 Are you building awesome things?
31:25 Of course you are.
31:26 You're a developer or data scientist.
31:27 That's what we do.
31:28 And you should check out Posit Connect.
31:30 Posit Connect is a way for you to publish, share, and deploy all the data products that you're building using Python.
31:37 People ask me the same question all the time.
31:40 Michael, I have some cool data science project or notebook that I built.
31:44 How do I share it with my users, stakeholders, teammates?
31:47 Do I need to learn FastAPI or Flask or maybe Vue or ReactJS?
31:52 Hold on now.
31:53 Now, those are cool technologies and I'm sure you'd benefit from them, but maybe stay focused on the data project.
31:58 Let Posit Connect handle that side of things.
32:00 With Posit Connect, you can rapidly and securely deploy the things you build in Python.
32:05 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Ports, Dashboards, and APIs.
32:12 Posit Connect supports all of them.
32:14 And Posit Connect comes with all the bells and whistles to satisfy IT and other enterprise requirements.
32:20 Make deployment the easiest step in your workflow with Posit Connect.
32:24 For a limited time, you can try Posit Connect for free for three months by going to talkpython.fm/posit.
32:30 That's talkpython.fm/posit.
32:33 The link is in your podcast player show notes.
32:36 Thank you to the team at Posit for supporting Talk Python.
32:39 Is it changing fast?
32:41 So I'll give you an analogy that you could tell me about your space.
32:45 So on one hand, in Python web world, you mentioned Flask and Django.
32:49 You know, Flask and Django, while they are evolving, are kind of, they're kind of the way they have been and they're pretty stable.
32:56 And if you learn Flask five years ago, you're still good to use Flask today.
32:59 Yeah.
33:00 Or is it more like FastAPI, Pydantic, msgspec, just there's something new all the time that you got to keep learning to bring in.
33:10 Are there a ton of new packages just coming online or is there a set of really solid ones?
33:15 So I think it's both yes and no.
33:20 And so it depends on what, you know, what area we're working on.
33:23 So right now in the clinical lab that I'm directing here, you know, when I came here in 2020, it was, you know, when we started off from scratch.
33:32 So essentially the idea was to be able to bring up a pediatric cancer sequencing infrastructure that was not available.
33:39 And so it was ground up from, you know, the lab to personnel, to space, to competition, and such and everything.
33:45 And so we kind of have this sort of, you know, two big bubbles in that operation from an informatics perspective.
33:51 One of them is the, you know, we essentially are in the process of developing our custom lab information system.
33:59 That's essentially a web app.
34:01 And so we have that space and the other space is bioinformatics.
34:04 And so bioinformatics is a lot of the custom scripting or the applications we develop is Python based.
34:11 Some of them we do with the Golang, you know, when we need a little bit of performance aspect.
34:17 And then the other aspect is the web app.
34:20 So from a web app perspective, when we, when I started here, we actually started, we use FastAPI.
34:26 So that's kind of, you know, that was so, you know, the idea was that, well, you know, since you're starting from scratch, and I came to know about FastAPI at that point of time, the whole thing was about, you know, acing away.
34:37 I was, I was pretty much sold on that aspect.
34:40 And then, you know, I think the whole tool made a lot of sense.
34:43 I'm like, okay, well, this is, this is perfect time to be able to, I think when I started FastAPI was, you know, 0.5 or 0.6.
34:51 And so now obviously, you can see a lot of change happening there.
34:54 So, yeah, that definitely is a lot of, you know, fast pace.
34:57 And so we kind of do catching up in a sense where it has to be done in a, you know, in a careful way.
35:04 The reason is because from, you know, as compared to more traditional research lab testing where, you know, at the end, really, you know, there's a lot of discovery, there's a lot of excitement.
35:15 And at the end, it all translates into being sort of, you know, the data is presented at a conference or you publish that as a manuscript.
35:23 And that's the end point.
35:24 So if you move off from version one to version three of an algorithm, you know, you have to obviously make sure that your research, everything is reproducible, but beyond that is not a problem.
35:34 But when we're talking about the same thing in context of a clinical care for a patient, the room for error is very, very little.
35:41 You can't make mistakes.
35:42 And so the entire space of clinical testing is very regulated in that sense, because there's a lot of requirement that, you know, you have to perform that any change that's happened in your pipeline.
35:54 Say you're using, you know, some version of an application.
35:57 Now you upgrade to a newer version.
35:58 You have to demonstrate that the analytical performance in terms of sensitivity and specificity for that pipeline didn't change.
36:05 And so a lot of work is needed when you go, you know, do like a version upgrade.
36:09 So we keep those things very controlled and, you know, and careful versus some other things which are more in the R&D space.
36:17 There's a little bit more room to play around with tools.
36:21 Right. Yeah.
36:22 Chris was asking an audience a great question about basically, is it more exploratory and you just move really fast and don't really worry about tests and stuff like that.
36:29 It sounds like this is more of a production type thing.
36:32 Like if it you're going to run it over and over and if it gives a different answer at some point for testing for a disease or something, that's really bad.
36:39 You need it to be right all the time.
36:41 And so.
36:42 Yes.
36:43 So the rule is when we do, when we do new test developments or we bring a new algorithm, obviously that part of which we refer to, there's a formal term that we use in lab medicine.
36:52 So we have a lot of things called familiarization and optimization or ONF phase.
36:56 That's where, you know, there's a lot of flexibility in new tools, new version, trying out different things.
37:01 But once it moves from that into the validation phase and then once we deploy the application, once the deployment is there, it's a production application.
37:08 We don't touch it unless something really has to be tinkered with or there's a, you know, bug that we have to fix.
37:14 Who's in charge of running those apps?
37:16 Is that people on your team and your lab or is that the hospital or how?
37:21 Yeah.
37:22 So the way it is set up here is, so when I started off, I was the end of one.
37:27 So I started off with the, the fast AP application, I had to build up the, we have, you know, bioinformatics pipeline that I had initially authored.
37:36 But then when we went through the validation phase, I luckily had, you know, two people on staff who kind of were handling the bioinformatics on the front end.
37:45 And then eventually we had a third person who joined the team.
37:48 So then they were kind of, you know, helping me out with a lot of the actual groundwork of, you know, writing the code, getting tests done.
37:57 You know, going through the validation data, summarizing that for me, being a lab director, it is my responsibility ultimately to sign off on all those things.
38:06 So we're going to say, hey, okay, this is the validation and this is what is being demonstrated that, you know, your package or your pipeline or whatever you're working on demonstrates this level of sensitivity.
38:16 Then yes, I being a lab director say that, yes, this is working.
38:19 And once that happens, so we then deploy those applications in production.
38:23 You know, we use GitHub and the usual dev test prod cycle.
38:28 And so that's kind of how it works.
38:30 Well, do you have your own hardware or do you stuff like on DigitalOcean or AWS or?
38:35 So with healthcare data, there is generally a little bit of angst with, you know, data sitting on the cloud outside of the institution.
38:45 I would say like the institution that I work on is very, you know, that way it's quite, you know, very forward thinking and being able to use modern technology.
38:56 So what we have, what we started off and since it was, you know, everything being built up from scratch, we're taking the decision to keep things on print for beginning.
39:07 But we also kept in mind that at some point of time, you know, if the institution decides that, oh, we're going to switch our infrastructure to using, you know, AWS or Azure or whatever the platform is going to be, that we want it to be ready.
39:21 And so the way we had it set up and this is due to our amazing IS team here in, you know, at our institution.
39:32 So we had our own hardware that we got in terms of the actual servers.
39:38 And I had, we collaborated with the IS team to be able to help us build our Kubernetes infrastructure.
39:45 So we have a test and a prod Kubernetes cluster and then all our apps and the biofarmatics pipeline.
39:55 Well, the apps for now and the biofarmatics pipeline that we're looking forward, you know, in the near future to get deployed on these things.
40:02 As a matter of fact, what we've done it, our dev team, we start to, we do a lot of the development on Kubernetes as well.
40:08 And then we keep moving all these things as containerized applications.
40:12 That's excellent.
40:13 So really embracing containers and Docker and Kubernetes and that should make it super easy to move to wherever you want to go.
40:19 Right?
40:20 Anything that can run Kubernetes, you just push, push to that and you're good to go.
40:23 Right.
40:24 Right.
40:25 I mean, that's, it's, you know, it's, it's a little bit difficult to start with, but I think once we are in that, you know, stream, it is, it is much less effort to move things around.
40:34 Yeah. Last year I rewrote all of our servers and APIs and condensed six to eight servers all into one, just Docker cluster.
40:43 And it was a great decision.
40:44 But to me, it was also a little intimidating to play.
40:46 Well, here's one more thing I layer, I have to manage and understand.
40:50 And if something goes wrong there, then everything else still breaks.
40:53 And, but having it set up is really nice once you get used to it.
40:56 Yes.
40:57 Right.
40:58 All right. Let's talk a bit about, well, I have a question for you.
41:00 I want to talk about some of these packages.
41:02 Yeah.
41:03 That you've been, you've been saying are like a lot of the reasons you chose Python and it used a lot.
41:07 That's just great.
41:08 But before we get there, like I got the bio python.org website pulled up.
41:12 And the very first line is bio Python.
41:15 It's a set of freely available tools.
41:17 You know, open source, freely available.
41:20 Right.
41:21 So what does that matter to you guys?
41:22 On one hand, you have a ton of money being in the medical space.
41:27 It's really high stakes.
41:28 So paying for commercial software or commercial libraries is probably not the biggest worry.
41:34 On the other hand, open source is really nice.
41:37 Invest, being able to look inside is really nice.
41:39 Right.
41:40 So it means you don't have to deal with getting permission.
41:43 Right.
41:44 How does that fit into your world?
41:45 I know how it fits into like small startups and things like that.
41:47 But for a hospital, for example, what is free and open source mean to you guys?
41:52 I think it does have a lot of impact in terms of how we end up working and setting up these
42:01 things.
42:02 And obviously, I would, whatever I'm speaking is representing, you know, what it means from
42:08 sort of an operational standpoint.
42:09 You know, when we talk about molecular pathology, generally being able to bring up a clinical service
42:15 like that is a huge investment.
42:19 And so a lot of the investment is, and this is generally applicable to, you know, any institution
42:24 where something like this has been set up for patient care or clinical use.
42:31 The investment is primarily in a lot of the instrumentation and the reagents that we use are generally quite
42:42 expensive, which is sort of, you know, the, I would say when we talk about, you know, what is the
42:48 cost of a test when it is offered.
42:51 That cost factors in a lot of these, you know, operational costs that we need to, you know, buy
42:56 these expensive sequencing instruments, you know, the reagents that are used as consumables as
43:01 we do the tests over and over again, every, you know, every week, every month.
43:05 So from that standpoint, traditionally, the way things have been designed is, you know, I would say 10 years back when we were, you know, work with our finance team to say, okay, the cost of the test is going to be so and so based on all of these different inputs.
43:23 And so, you know, 10 years back, computation, bioinformatics, you know, all of these were not factored in without.
43:30 But now, as we are in that era where, you know, using GPUs on a regular basis to be able to do, you know, simple, I would not say simple, but, you know, routine work to get from the raw sequence data to be able to identify genomic variance.
43:46 Genomic variance that's getting common, you know, using FPGAs, using large clusters to be able to, you know, to perform these tasks.
43:52 And so now we are starting to see those costs getting in as part of the ultimate, you know, cost that goes to the patient for a test.
43:59 And so we try to minimize those things.
44:01 But one of the way to be able to minimize those things is to be able to choose between free open source versus something which is a commercial product.
44:07 And it's always a balance between the reliability and the service that you're able to get back saying, hey, you know, something breaks down.
44:14 We know, you know, there's a, there's an SLA, there's a certain, you know, assurance that this thing is going to have help versus open source free would be where we feel very confident in the code base.
44:27 We sort of, you know, sometimes what happens is when we, you know, when we kind of use some of these open source tools, we end up almost invariably having some wrapper around it to change things or being able to have some insight into the source code.
44:41 So it depends on that balance, you know, what we choose.
44:44 Yeah. How often do you fork, fork it and use your self maintained version versus just run what, what is publicly on PyPI and then maybe wrap it to orchestrate it a bit?
44:55 So I would say for the web application part of it, we don't really do a lot of forking. We kind of go with what it is. The only thing what we do is since we have the luxury of using a combination of GitHub and containers, and knowing the fact that the regulatory requirements require that you tightly version control all these things with, you know, history and all those things. We tend to, when we are developing these things and we are validating it before we do that, we try to stick to a fairly stable version.
45:24 So for example, things like, you know, beta or release candidates, we try to stay away from that, even if they have some desirable features, but unless we see like a full production, you know, version of that, we don't tend to switch to it. So we keep things like that without maintaining or without forking or kind of, you know, making modifications.
45:42 When we get into more of the bioinformatic stuff where we are actually trying to use an algorithm to solve a particular piece of part of the pipeline that is doing some data transformation, it depends on how much we want to change or modify.
45:58 That's when we sometimes fork it. Sometimes we fork where we know that, and this is the unfortunate reality in many of the scenarios where you have great open source tools, but after some time, due to whatever financial or business or other reasons that they stop maintaining.
46:13 And so essentially, we kind of get into this freeze mode. We tend to fork that so that at least we have that available. And then if we make any changes that we kind of keep it to that fork.
46:23 But generally, I would say it's probably in the 80-20 where 20% is where we fork it, make some change. Most of the times we try not to do that.
46:34 But yes, open source free tools have a big impact. A lot of their tools that we use as part of our bioinformatics pipelines, a matter of fact, which is kind of used in the community of molecular pathology to build these bioinformatics pipelines.
46:48 They tend to use a lot of open source tools. And the reason for that is, for example, it's not written in Python, but we have an algorithm called BWA. It's written by Professor Henley.
46:59 That's one of the algorithms that is almost like a de facto, I would say, when it comes to doing sequence alignment. That's a part of the pipeline. And so it's a tried and tested application for more than a decade now.
47:14 So really, there's not a, and this is a fairly stable, you know, algorithm or an application. So we don't tend to, it's well maintained, you know, from an open source perspective. So those obviously are highly, we highly rely on those. But, you know, there is this whole ecosystem of softwares that come under this rubric term of variant calling, where we're trying to identify these different variants.
47:34 there's a whole bunch of those and some are fairly well maintained they are open
47:41 source sometimes depending on the context you're using you need a license
47:47 if it is used in a commercial setting you don't need a license if it is in an academic setting like for example
47:53 when we do clinical testing in an institution such as where I am right now
47:59 that's an academic institution so typically it's not for profit and so obviously we
48:05 don't need licenses for that use but once it goes into a pure commercial space
48:11 where if the lab is doing all of this testing for profit then there's a license requirement
48:16 so we see a combination of these things showing up it's actually becoming
48:23 more common now with open source tools that at least in the genomic biofibic space
48:29 yeah excellent I think another benefit probably probably for you guys I know it's a benefit for a lot of organizations
48:36 is if you use the open source tools and you need to hire somebody new there's a good chance that they have experience
48:43 already with those tools whereas if you use something private expensive you know you might have to teach them from scratch
48:49 what the thing is right yes that is that is correct as a matter of fact it's fortunate that a lot of the people
48:56 who've done a lot of good work and have contributed to the genomics biofibic space
49:01 have you know the general tendency is whenever we are setting up any sort of
49:08 you know pipelines for you know DNA sequencing RNA sequencing or you know more
49:13 from a research perspective you know methylation sequencing single cell RNA-seq
49:17 you know UMI based error corrected there's a lot of there's a very thriving
49:25 open source space and so that really helps with people who come in even if they're
49:31 you know not familiar with these tools it's easy to get familiar with because there's a lot of
49:35 you know community backing that up or as you said when we have when we hire people
49:41 who already are coming from you know from a different lab or they've had some experience
49:45 but you know they come and say oh yeah we know how to do you know alignment
49:48 or I'm aware of these applications that use that it's it is much much easier
49:52 from a learning curve perspective rather than having to you know now open up a manual
49:56 and this is the you know proprietary thing that only works here and yes exactly
50:02 exactly cool all right well we coordinated a bit on a list of packages that
50:07 you've used in your lab or find really helpful for your work and we could touch on those
50:12 just a little bit yeah yeah so cnv kit genome-wide copy number from high throughput sequencing
50:19 I don't know what that means but tell us about it yeah yeah absolutely so
50:22 copying so cnv or copy number variation this is a type of genomic operation
50:28 where what happens is in a simplistic way at least when when we talk about
50:33 cancer the cancer cells sometimes for it to be able to survive it tries to use
50:40 different ways of doing that biologically one of the way to do that is certain genes
50:44 that help a cell to grow in the absence of nutrients or with very little
50:49 nutrition is certain genes if it has more copies of those genes than normal
50:54 then it will oh okay you know it's like you have more money than expected
50:58 you can do a lot of things so typically what happens is in a normal human genome
51:03 any cell that we pick up it will only have two copies of the gene one is coming from
51:07 your mom one is coming from with that in cancer what happens is in certain
51:10 scenarios if a gene that helps with growth of the cell or it helps the cell
51:16 to survive even without signal or nutrition if it has more copies of that
51:20 it will make 6, 8, 20, 50 copies it can survive versus there are certain
51:26 scenarios where if there is a gene that is supposed to regulate the cell
51:29 so it doesn't go haywire if the cancer is able to delete one of the cells
51:33 one of the genes then you only have one gene left you knock out the thing
51:36 and then that protective mechanism is gone so then the cancer cell can easily survive
51:40 so what happens is with CNV or copy number variations the idea is that we use the
51:46 high throughput sequencing data to be able to infer how many copies of these genes
51:51 do we have is it more than two is it less than two and so this particular
51:55 package it's a very very well established well maintained package in the community
52:02 that essentially does this thing is you give it the sequencing data and define
52:08 the regions of the genome that you're interested in you can also provide
52:13 like names for the regions like you know this region is you know gene BRAF
52:18 or this EGFR or whatever you're interested in and then what I'll do is I'll do all the analysis
52:23 to be able to tell you that okay well when we are comparing this particular tumor
52:27 against this reference set of you know 20 normal samples where we know that
52:32 you should only have two copies of the gene in this particular tumor we are seeing
52:36 there are 50 copies of the gene so it gives you kind of an output data that
52:41 numerical can tell you that you know what it does is it does a log 2 based
52:48 you know transformation of the tracing that okay after all this computation
52:52 when I compare to the normal this is 50 times more this is 20 times more
52:57 the expected copy or it is you know half of the amount of copy we need in terms of deletions
53:02 so that's what really it does and it's written in Python it uses a lot of
53:09 you know Python it has Python dependencies that use that have been written
53:13 in sort of you know in either C or like you know Python C bindings but at the end
53:20 it gives you that data and it has an internal visualization tool but I was not very happy
53:27 with you know how it was written so I ended up writing a wrapper which is called
53:32 CNA plotter it's open source it essentially uses the end data for from CNA
53:39 and then it gives you a nice visualization of the copy number so I think if you go
53:43 down if you scroll down there's an example images yeah you have this on GitHub
53:47 so people can if people want to use this it's right there right yep it's right there
53:52 yeah so I think at the very bottom of the images the screenshots there oh yeah
53:57 yeah right here yeah so for example the first image over here you can see this
54:02 you know it's a thin band of all these multicolor things and each one of them
54:06 is a single human chromosome so chromosome 1 2, 3, 4 so on and so forth and it
54:12 you know if you look in the image it is at you know the the Y scale essentially is log 2
54:18 which is 0 and going up it is 1, 2, 3 and then it's a native scale on the lower side
54:23 so anything going above 0 means you have more copies than 2 going below is less
54:29 you know less than 2 copies and so if you see here in this example the plot here
54:33 you see the very end which says chromosome X is a single you know the band over here
54:39 is lower at negative 1 that means this is a male patient with a single X chromosome
54:44 as compared to females who have 2 you know 2 X chromosomes and so when we look at this
54:49 plot below here this is actually a plot from a you know a cell line a tumor cell line
54:55 that is abnormal and here we see there are 2 genes which are amplified one of them
55:00 is a gene known as TERT and the other gene is MDM2 so these 2 genes are again
55:07 one of those examples where it gives the tumor survival advantage over there
55:11 and so you can see here there are multiple copies of these genes as compared to
55:14 you know the baseline over here I see so that might predict something like
55:18 how survivable the cancer is right so if it is it going to be localized say where it happened
55:25 or it's going to like spread to other parts of the body or be difficult to treat
55:29 or be resistant to treatment yes so if this is you you want higher numbers
55:34 not lower numbers it all depends right I mean certain genes are good genes
55:38 for example if there are certain checkpoint genes if those numbers you know
55:43 if they have lower numbers you want to have two copies of them because if that
55:46 protective mechanism is gone you know the tumor becomes very aggressive again depending on
55:51 the tumor so it is all into context so if you're looking at the good genes
55:55 you want to have two copies of the good gene if you're looking at some of the bad genes
55:59 you don't want to have more than two copies of the bad genes one or zero is better
56:02 I got it I got it okay okay HGVS yes this is again a wonderful package that was initially
56:11 I think it was started by a person named Vise Hart he's I think he still
56:16 maintains it but there's a lot of like you know it's a very well publicly maintained
56:20 open source package there's a lot of you know community involvement in that as well
56:27 so what HGVS is it's a nomenclature system for you know giving a name to all these variations
56:35 so when you talk about I'm not sure Michael if you've heard about the term mutation
56:39 so mutation is a very commonly used term that refers to some kind of abnormality
56:44 in the genome in this case so what happens is there are these standards that are
56:51 that you know most clinical labs follow when they're putting all of this
56:55 information in the patient reports saying okay you know this particular tumor has
56:59 you know mutation in BRAF a mutation in EGFR some other gene and there is a certain
57:05 way that those mutations are described in terms of what sequence alterations
57:09 happening say at the mRNA level and what sequence alterations are happening at the
57:14 protein level so now in your protein you know you're missing these amino acids
57:18 or you have excess of these amino acids or something got switched from here to there
57:22 so there's a formal way of defining that and the guidelines of the group
57:27 that defines that is referred to as HGPS human genome variation society and so it's a
57:32 very complicated process where you have to do all these translations from
57:36 the you know the genomic scale where the numbering system starts from one to like
57:41 and you know whatever the length of your chromosome is in terms of ATGCs
57:45 and each chromosome has a different number and if you have a certain alteration that
57:49 is happening say in chromosome 7 at this particular position then you have
57:54 to translate that to the mRNA of that gene and then the protein of that gene
57:57 so it's a lot of math a lot of strings involved in that process and so essentially
58:02 this HGPS Python package provides all of those functionality as a wrap you can
58:07 create your translation you can essentially project the variant from your
58:13 genomic to the you know the mRNA to a protein level or vice versa you can
58:18 validate things so we ended up I actually wrote a paper about this when we
58:24 did a validation of how well this particular package works and so now in the lab
58:30 that I'm currently in we implement this thing for generating those nomenclases
58:34 so what happens is when we put a report out in the patient's chart and when
58:39 say our oncologist was treating the patient they want to know what did you
58:46 identify in this tumor genome they would read that nomenclature saying oh
58:49 okay this particular change in this BF gene this is significant I know that there
58:54 are therapies that are out there that we can use to treat this patient tumor
58:58 so that's what this nomenclature system is about so it's a very automated
59:03 system yeah and it normalizes it if there's multiple ways to represent it
59:08 very nice all right this one I'm familiar with open pi excel yeah I guess
59:15 you probably have a lot of data that either comes from or goes gets shared
59:19 out into excel right yes so what we do is we sort of are right now in our lab
59:24 we're kind of in this sort of you know kind of an interim phase where we
59:31 sometimes use excel to look at some data so traditionally speaking before
59:38 you know typically any lab that goes from you know zero to the point where you
59:44 have a web application that automates everything the intermediate phase is
59:47 using a lot of excel so it's very common in many labs to use excel for a lot
59:53 of things for qc for charts for tracking so we use this open pi excel for a few
01:00:03 things one of them is when we have a lot of you know the sequencing data that we
01:00:07 have to summarize and then generate a qc to be able to present that to essentially create
01:00:14 an excel document on the fly from the backend to provide that you know whatever data
01:00:18 they want to look at in terms of statistics or you know list of variants or
01:00:22 some form of you know calculation they want to do further that's where we use
01:00:26 this package typically we use it as part of our biofranics pipeline when
01:00:31 we have to generate those things but it's a very handy tool we actually use something
01:00:35 similar and I'm forgetting the name of the package that is used to generate
01:00:41 our document like we use some word documents for creating reports but we also use
01:00:46 python there to be able to summarize a lot of these data points and then create
01:00:49 a word document that you know it starts with a template of a word document and
01:00:53 then use python to fill up all these you know right here's where the graph
01:00:58 goes here's where the summary goes here's where the detected whatever it
01:01:01 goes yeah right yeah cool are you here's two things that overlap are you
01:01:06 familiar with this thing where scientists rename human genes to stop
01:01:11 excel from misreading oh yes yes absolutely oh my gosh this is crazy yes yes
01:01:22 it happens when we import a lot of this data coming from somewhere we'll
01:01:25 see entries like september 14th or you know march 19th these are not yeah this is a
01:01:33 big problem going in and out of excel and so as much as you can do in python or
01:01:37 any proper programming language rather than using excel but there was one
01:01:41 that was m-a-r-c-h-1 or march one yes or s-e-p-t-1 it's it's very funny some well
01:01:51 some of the gene names are funny but then excel you know gets it to the next level when it
01:01:56 changes the names this doesn't make any sense yeah yeah it doesn't make any sense
01:02:02 yeah all right on to the next one hera yes hera this is very interesting
01:02:07 so this is where i think um you know where in our instance we are going away from
01:02:15 standard web applications standard bioinformatics pipeline to really touching devops
01:02:19 using python and so one of the things that um typically uh we get to the point
01:02:25 when we scale up our bioinformatics pipeline where we have multiple samples and
01:02:28 multiple runs and everything needs to be orchestrated in a way where you
01:02:32 have uh you know while you're running your pipeline you have a lot of visibility into
01:02:36 how it works and so this is uh one of our um uh projects we are working on
01:02:41 to move our current bioinformatics pipeline the way it works you know kind of on a single
01:02:46 server to be able to use um the kubernetes cluster to actually deploy the uh
01:02:51 the long running pipelines onto that and so there are many options you know there are
01:02:56 a more standard sort of you know uh whittle based uh you know kind of you know
01:03:02 protocols that you can use uh to run on either cloud or hpc environments there
01:03:07 is a there's a very popular tool called next flow that is used to be able to you know
01:03:13 kind of create your data analysis pipe flow you can sort of define that and then use any
01:03:18 backend to deploy it um one of the things that we kind of when i was exploring this space
01:03:23 one thing that one of the things i came across was um uh you know the the the
01:03:28 the whole sort of ecosystem that argo maintains with uh you know argo workflow and argo
01:03:34 cicd and all those things so workflow was interesting because argo provides that
01:03:38 way where you can sort of you know write your pipelines in a yaml format
01:03:41 and then have it you know deployed on the kubernetes cluster it really is
01:03:45 very native to the kubernetes cluster interesting it sounds a little bit like
01:03:49 ansible but for specifically for um bio type of projects right yeah so argo so
01:03:57 interesting thing is argo when you know when this argo workflows was set up
01:04:01 really for a lot of cicd automations in mind so you know it is yes you can run
01:04:06 data pipelines in general but never it was never at least in its description it
01:04:10 never describes use case sort of in bioinformatics or you know biology pipeline
01:04:15 analysis and similarly you know it it was like okay it's a generic tool you can
01:04:21 use it for whatever you want so i tried it out with using you know like a yaml
01:04:26 file and it was a simple four-step pipeline it was wonderful it was magical and the good
01:04:31 thing was uh with argo like the argo workflow when you install that on your
01:04:37 kubernetes cluster it comes with a native um web interface so it's you know if i'm
01:04:44 sure if you've heard about the workflow option with uh uh airflow so airflow
01:04:49 is a package that also you know there's a nice python SDK for that where you have
01:04:54 you know you deploy it on a kubernetes cluster we have all these amazing you know
01:04:58 visualizations to show what step you run or if there's some error there it'll do that argo does
01:05:03 the same thing so it has it has obviously the built-in capability to interact with it as an
01:05:07 api but then also there's a web interface that it'll deploy and it can have
01:05:11 visibility into every step of the process you can summarize and see the entire tree so that
01:05:16 was very interesting for us because we could get all that thing done in a
01:05:19 single thing in a single goal but then uh our challenge was or well our desire was that if you
01:05:25 could integrate that with our limb system that we were working on using fast
01:05:29 api so that it was hey if there's any python SDK and so that's where hera comes so hera
01:05:34 essentially is a SDK or a wrapper that's essentially talking to our goal but you
01:05:39 can define your pipeline steps as uh you know dax in python and so that makes
01:05:45 the process so super simple where you're now natively essentially we can
01:05:50 integrate that as a back end to our uh web application and so then it's almost like you know
01:05:56 it's python again from start to finish you're not you know getting out of
01:05:59 that and it's it's uh again it's a very well-maintained application so so we are
01:06:03 currently doing a validation to be able to make sure that or demonstrate that you
01:06:08 know it's equally performant when we compare to a more sort of uh native
01:06:13 shell based uh you know execution on the pipeline okay yeah this is new to
01:06:17 me i mean of course i know airflow but not hera cool all right hi in sim did i
01:06:22 grab the right one here uh package no i think it's it's similar uh let me uh see if
01:06:30 i can i can send you the link yeah throw it in the private chat here and i'll pull it up
01:06:35 yeah uh yeah okay there we go in sim yeah okay got it yeah so this is a this is a
01:06:46 very interesting space in uh next generation sequencing assay or for uh you know
01:06:52 high throughput sequencing assays so what happens is uh as i mentioned that one of
01:06:56 the things that's required for a clinical lab is to be able to perform a validation on
01:07:00 multiple samples of tumors that have certain mutations and then you can demonstrate that
01:07:05 you know yes the assay works because you have tested you know 100 samples that have
01:07:11 300 different you know mutations of genetic alterations and then you can demonstrate that yes your
01:07:16 pipeline or your assay was able to pick it up so you can say that you know your assay is you know
01:07:21 x percentage sensitive x percentage specific and you know your recall rate and things like
01:07:26 that so what happens is when you're trying to um get those samples that have these very difficult or
01:07:33 challenging variants to detect because they're just you know complex in how they occur
01:07:38 biologically in the cell is very difficult some of these are very rare there may be only two
01:07:43 samples in the entire world or so it's just not possible practically to you know get those samples
01:07:48 unless we wait for like you know 10 years to you know validate that so the idea is that if it is
01:07:53 possible to be able to use algorithms which can manipulate the existing sequence so for example
01:08:00 we have a sequence data from an existing real tumor sample but we can manipulate that in a way where
01:08:07 we introduce these mutations in a in silico so we can introduce you know snbs or insertion
01:08:13 deletion mutations and then use that same file to then feed into our pipeline biophonics pipeline and
01:08:18 say okay run through the entire data pipeline and at the end let's see if we are able to identify
01:08:22 those variants that we insert and so that's where this silico mutagenesis comes it's a
01:08:27 very hot topic it's a very relevant topic of interest to really fill this large gap in terms of
01:08:34 availability of rare variants and rare samples and how we can really improve sort of some of these rare
01:08:42 but very clinically significant edge cases where we don't want to miss those variants and actually see
01:08:46 those in real tumor samples and so this is a python package that was developed by at the university of
01:08:53 chicago as part of their clinical lab and so what really it does is it will take in a list of different
01:09:00 you know mutations for example in this plot i think they give examples of
01:09:04 you know insertion deletion insertions where you have extra sequence or deletion where you have certain
01:09:09 segments which are missing or snbs or single nucleoid variants where you have
01:09:14 one nucleotide that got switched with another one and so these are typically
01:09:17 that we practically see in in like real samples or real tumor samples but this is a
01:09:22 way to mimic that you know in a sample that does not have it and so this python
01:09:28 package is able to you know take that list from you say okay i have a list of
01:09:32 these 20 important mutations that i know from the public databases have been
01:09:37 reported but i want them to be inserted into my data set that was created from
01:09:42 say a set of three or four real tumors and then use that to challenge the pipeline
01:09:47 to say that hey can you still pick it up and so i see simulate these rare changes and then
01:09:53 yes test or exercise your setup right we've got a few more to cover but i
01:09:59 think we're getting a little bit short on time so let me just close this out with a
01:10:02 final question for you because i know this is the topic du jour what is uh what is ai and llms look like
01:10:11 for you guys is it does it matter is it really powerful is it super important
01:10:15 uh i mean genetics is kind of text data in a sense and so yes sort of in the space of
01:10:21 how it could apply right right it is it is uh it is a it is a text data and it's a lot of um
01:10:27 you know there's a when you talk about like a search space a lot of the search space is
01:10:32 very text based you know there is some numerical base was a lot of text based
01:10:36 search as well and i think uh across the entire spectrum from where we start with
01:10:41 very raw sequencing data to the point that we are trying to uh you know uh ask the
01:10:47 question that okay i found this rare or novel mutation in this particular gene what
01:10:52 does it mean you know what human has it been described what disease does it relate
01:10:56 to uh one of the things that we do as um molecular pathologies and this is sort of
01:11:02 where a lot of the medical work comes in is where we really go through a lot of the
01:11:06 medical literature what we have learned before new publications papers out there
01:11:10 that you know that have a lot of data in terms of you know studies that have done on
01:11:14 this particular gene and they've described like okay these alterations actually activate
01:11:18 the gene or is bad for the tumor or you know makes it treatment resistant um so you can
01:11:23 see the naturally a lot of text starts and happens and so in that space uh we are seeing
01:11:28 in the i would say in the past you know three to four years there's been a lot of uh
01:11:32 applications of ai tools that have come out um you know particularly in the space of variant
01:11:38 calling where we have this genomic sequence data and we're trying to identify variants uh you know
01:11:43 one of the examples uh that's been talked about a lot is the uh deep variant caller it's called deep variant
01:11:49 from uh from the team at google who developed that uh that uses a lot of the ai techniques to be able to
01:11:55 you know pick those things up uh there are some genomic databases that um we use for in silico
01:12:01 prediction for example if you have a variant we have no idea about it it uses there's a database called
01:12:07 dbscsnv that uses random forest um techniques and i think it used another algorithm to predict if a
01:12:14 certain site where there's a mutation can enhance uh abnormal mechanism called splicing versus not
01:12:21 uh similarly there's a lot of tools that are coming in and the llms i think are i would say not mainstream but i think
01:12:28 there's a lot of interesting research that is coming around there where people are trying to use um
01:12:32 llms for doing these more broader text search saying that hey you know i have these you know i don't know thousand articles
01:12:40 and i want to find these particular combination of uh words that you know you know uh there's a combination of a disease and a mutation
01:12:50 and what do i get back on that um i personally tried you know chat gpt with
01:12:54 different you know like uh phrases and uh questions about it what i've seen so far is and this is purely my personal experience i think
01:13:03 a lot of it reads very real but when you start to look into the references as to what it references then you quickly figure out
01:13:11 right just making that yeah this is not the real deal and so i think um i think it's uh you know i don't i'm not a very pessimistic person i would say
01:13:22 oh no this is all garbage but i think uh there is opportunity there it's just how
01:13:26 do you train it uh maybe there's a a space or an opportunity and it probably
01:13:31 already has been people are pursuing this is training a smaller model but
01:13:35 it's really deeply in genetics or it's trying not trying to use a model that tries to understand
01:13:40 everything right right yeah it's more yeah more in the medical literature or the genomic literature
01:13:46 to be able to like meaning is enhanced in that yes so i think there's active work going on there
01:13:51 but it's uh yeah it's i think it's making you know a lot of a lot of interesting research a lot of
01:13:56 potential impact on how you know we do things and obviously the tool sets that we currently use
01:14:02 we might expect in the next 10 years to change yeah for sure all right final thought here people
01:14:07 are listening they're maybe doing similar work to you how do they get started what would you tell them
01:14:14 get going with python and some of these packages yeah i mean i would uh you know my uh reflecting on my
01:14:20 own experience sort of you know in a very uh winded way that i ended up here i think um you know
01:14:27 python is i i i feel programming in general and i think python particularly as uh programming language
01:14:34 is a very low you know sort of you know entry point in terms of being able to really quickly get things
01:14:40 done like learn it easily and get things done i think it should be to me anybody anybody who's trying to pursue
01:14:46 something in biology or competition biology or bioinformatics i think this is the first thing it's it's something easy to do
01:14:55 um to be you know i would say relatively easy to do to be able to get from that anybody with you know
01:15:00 um a desire to learn this has analytical thinking uh i mean i i think investing into python is probably
01:15:10 the best bet because you can pretty much do anything you want uh that's what i tell you know when i tell
01:15:15 train people you know in my lab or i talk to other students is that if you want to spend your time you have
01:15:20 very little time because you're busy you know with your other things uh i think the one thing that can get some
01:15:25 of the job done and be still aligned with what you're doing is python um and after that i think it's you know it's a
01:15:31 it's a lot of self-driven learning where you know you're kind of you know looking into things but the good thing is i think the the
01:15:38 the python community is wonderful it's almost like i sit down and i think about oh i have to solve this problem
01:15:44 um probably there are 50 other people who think about that and maybe two people have already worked
01:15:49 on it so it's right they've already published it to pipe the eye and you're good to go absolutely i i
01:15:54 totally agree with that and you know people should take the the couple of weeks get good at it and it'll
01:15:59 amplify it'll save you time definitely in the long run oh absolutely yes the only thing that i uh that i the only thing i would
01:16:08 say like an added thing is if somebody is learning python and then they do have an intention you know to
01:16:16 take it to the point where they would be involved in more serious um like you know application development
01:16:22 or maintaining an open source package or you know however they contribute to that i think learning a
01:16:26 little bit more like learning python in its real sense in terms of how to do it right you know there are
01:16:33 five ways of doing something correctly but i think uh there's one way that is consistent so that it's
01:16:39 again you know easily shared it's easily maintainable others can easily understand i think that would be
01:16:44 my second advice it is it takes a little bit time but i think it's well worth the effort to spend the time
01:16:49 writing you know idiomatic python code so it's um it's portable absolutely all right so mac thank you for
01:16:57 being on the show it's been great to get this look inside of what you all are doing with python
01:17:01 yeah thank you for having me on the show i appreciate that yep bye okay bye bye
01:17:05 this has been another episode of talk python to me thank you to our sponsors be sure to check out
01:17:11 what they're offering it really helps support the show take some stress out of your life get notified
01:17:16 immediately about errors and performance issues in your web or mobile applications with sentry
01:17:22 just visit talkpython.fm/ sentry and get started for free and be sure to use the promo code
01:17:28 talk python all one word this episode is sponsored by posit connect from the makers of shiny publish
01:17:35 share and deploy all of your data projects that you're creating using python streamlet dash shiny
01:17:40 bokeh fast api flask quarto reports dashboards and apis posit connect supports all of them try
01:17:48 posit connect for free by going to talk python dot fm slash posit p-o-s-i-t want to level up your python
01:17:55 we have one of the largest catalogs of python video courses over at talk python our content ranges from
01:18:00 true beginners to deeply advanced topics like memory and async and best of all there's not a subscription
01:18:06 in sight check it out for yourself at training.talkpython.fm be sure to subscribe to the show open your
01:18:12 favorite podcast app and search for python we should be right at the top you can also find the itunes feed
01:18:18 at /itunes the google play feed at /play and the direct rss feed at /rss on talkpython.fm
01:18:25 we're live streaming most of our recordings these days if you want to be part of the show and have your
01:18:30 comments featured on the air be sure to subscribe to our youtube channel at talkpython.fm/youtube
01:18:36 this is your host michael kennedy thanks so much for listening i really appreciate it now get out there
01:18:41 and write some python code you