Learn Python with Talk Python's 270 hours of courses

#470: Python in Medicine and Patient Care Transcript

Recorded on Sunday, Jun 23, 2024.

00:00 Python is special. It's used by the big tech companies, of course, but it's also used by

00:04 those you would rarely classify as developers. On this episode, we get a look inside how Python

00:10 is being used at a children's hospital to speed and improve patient care. We have Dr. Somak Roy

00:16 here to share how he's using Python in his day to day job to help kids get well a little bit faster.

00:23 This is Talk Python to Me, episode 470, recorded June 23rd, 2024.

00:30 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy.

00:46 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,

00:51 both on mastodon.org. Keep up with the show and listen to over seven years of past episodes at

00:57 talkpython.fm. We've started streaming most of our episodes live on YouTube. Subscribe to our

01:03 YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of

01:09 that episode. This episode is brought to you by Sentry. Don't let those errors go unnoticed.

01:14 Use Sentry like we do here at Talk Python. Sign up at talkpython.fm/sentry. And it's brought to you

01:22 by Posit Connect from the makers of Shiny. Publish, share and deploy all of your data projects that

01:27 you're creating using Python. Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Reports,

01:34 Dashboards and APIs. Posit Connect supports all of them. Try Posit Connect for free by going to

01:40 talkpython.fm/posit. P-O-S-I-T. So, Mac, welcome to Talk Python. Awesome to have you here.

01:47 Hey, thank you, Michael, for the introduction. I'm happy to be here. Excited.

01:52 Yeah, I'm pretty excited to be talking about medicine and all the stuff that you guys are

01:58 doing with Python. And I really like these kinds of shows because I think it's important to

02:02 highlight that Python is not just for web developers and pure data science machine learning

02:08 people, but it's used by this huge spectrum of people doing all sorts of interesting stuff and

02:12 solving real problems with it. Right. And it sounds like you fall pretty solidly in that category.

02:16 Yeah, absolutely. It was like, you know, Python has been sort of this discovery as I've gone

02:22 through my career as a physician. And it's interesting how to begin with, when computer

02:27 science, initially when I was training and growing up, it was hard to imagine medicine and computer

02:33 science sort of being hand in hand together. But now I think things have progressed and there's a

02:37 lot of technology that's now in medicine that allows you to do all kinds of things. And of

02:43 course, as I've discovered Python, it brings in kind of the toolkit and the ability to be able to

02:52 achieve and solve problems in a way that I think it's not been envisioned before. So it's a very

02:58 exciting time. Yeah, it is a very exciting time. And I think it's just getting better and better.

03:03 Before we get too far into this, tell people a quick bit about yourself, quick introduction.

03:07 Yeah, absolutely. So I'm Somak Roy. I am a molecular pathologist. It's a type of physician

03:15 who deals with looking at the genome of either a patient or patient's tissue. And we essentially

03:26 look at all of these things in a way to be able to help manage patient's treatment. In my current

03:35 position, I am an associate professor and the director of molecular pathology at Cincinnati

03:41 Children's Hospital. My lab is a clinical lab that is under the division of pathology. And

03:48 we do a lot of work that pertains to kids in terms of helping them diagnose and manage

03:57 pediatric cancer, as well as infectious disease that happen in this age group.

04:04 Molecular pathology, I essentially trained back in India as a physician, did my MD,

04:13 and then came here, started in Pittsburgh, did my training in pathology and lab medicine. I specialized in molecular pathology. Then I was there in

04:23 Pittsburgh. I worked for some time and then opened Cincinnati Children's.

04:26 Excellent. Do you work directly with patients or do you get samples sent to you from other

04:33 doctors and then you process them and analyze them?

04:36 Yeah, that's a good question. So I do not work with patients directly. It's a kind of,

04:41 it's a subspecialty in medicine where my lab works with the samples that have been collected

04:48 in the patient either from the OR or a procedure or from the radiology suite. And then we work on

04:55 that tissue or the blood sample or a bone marrow sample that comes to us. And yes, then all the

05:02 testing that we perform is off from that specimen. And then once we generate the clinical reports

05:08 back, they go back to the patient's chart and to the patients, to the clinicians who are treating

05:15 and managing them. And then that way it helps how they're able to then get a diagnosis and then

05:21 give the appropriate treatment and the management to the patient.

05:24 Yeah, excellent. So yeah, you must see a lot of different stuff flying through the lab you have

05:30 to analyze. So how did you go from, I'm studying medicine to I'm writing Python code and running

05:38 automation and what was that process like? Well, that was an interesting journey for me.

05:45 So before medicine and biology came into my life, I started off, it was second grade, I believe,

05:54 when my dad, he got me a computer at that time, which is a 64 kilobyte small machine.

06:02 I think it was for Toshiba MSX computers where you could write GW basic code and

06:09 some basic predefined hex code and you can run small applications on that. That was my starting

06:17 point. It was super exciting for me. And I think from there on, the journey went to, as I went

06:24 through high school and then college, medicine was, I would say, biology was something that

06:34 intrigued me. And at the same time, I also got interested in genetics, looking at DNA sequences.

06:42 And I had a natural liking for the fact that I could study about the cell or the genome, the DNA

06:55 and RNA. But also I realized that there was a lot of math and computation that you can use to slice

07:01 and dice data. And at that time, the place that I grew in India was a very small place. So we didn't

07:06 have access to resources like internet. So my exposure to internet was when I actually went to

07:13 med school for the first time. Oh yeah, you can actually connect to other computers. So that's

07:19 when I started my medicine. So obviously I did my training in medicine in medical school back in

07:26 India. That's when I started to connect and talk to a lot of people. And some of my friends who

07:31 actually were already writing apps at that time using Java applets and on browser. And so I started

07:40 to make some connection in terms of images and learning how some of those things can be used in

07:46 medicine. And so radiology, during my radiology rotation, that was my first real life, that was

07:54 my realization that actually in medicine, you can use computers a lot to handle a lot of these

07:58 and images, x-rays, CT scans. And I think as it went on, when I came to the US, that's where

08:04 it really started off. During my residency in pathology here, I actually connected with my

08:12 mentors, Dr. Anil Parwani and Dr. Leonard Pentamax. They are well-known pathology informaticists.

08:22 They've spent a lot of their time sort of dwelling in the world of pathology, medicine, and

08:28 computer science. And so that is when I could actually realize that, yes, you can do a lot of

08:34 innovative stuff by developing apps, algorithms, analyzing either image data or molecular data.

08:42 And so that is when I started to get into designing an app, which was a very simple web app that was

08:51 a project I was working with one of my mentors. And so his idea was that we had a lot of these

08:57 pathology images and we wanted to create a little in-browser app that would display these images as

09:03 thumbnails and then clicking the art that could enlarge the image, show that on display.

09:07 And so I used, at that time, it was .NET framework to be ASP.NET. And so I created a little app using

09:16 Visual Basic. Slowly I then migrated to using C# in the same environment. And that time,

09:23 I started my advanced fellowship training in molecular pathology. That's when I started there.

09:33 That's when I realized there's a lot of genomic sequencing data where essentially you're dealing

09:37 with a lot of strings and numbers. And you have to make a lot of sense in terms of this large

09:42 volume of data that comes in. - If we're working with, so the kind of data that you're working with

09:47 for say this genetic stuff. - Yes. - For us, for when you're studying the genomics, so how much

09:55 data is in say one strand of DNA? How much of that do you actually care about? Like, give me,

10:00 give us a sense of sort of how much data we're talking. - Right. So it really depends on what

10:06 has been done. And so when we look at, so when we talk about genomics, it is really designed on how

10:14 the experiment is done. So for example, if we just simply look at the entire human genome, we are

10:20 talking about 3 billion alphabets. Essentially it's the combination of four alphabets, A, T, G,

10:30 and C. So these are the four nucleotides of the DNA sequence. And the RNA has one additional one

10:36 which replaces A. But the idea is that it's a mix and match of these sequences. And so if you think

10:41 about the entire human genome as a single thread of A, T, G, and Cs in various combinations,

10:48 you're looking at 3 billion alphabets. And so what happens is when we do these sequencing

10:54 experiments where you would take the DNA molecule from a bunch of cells within a tissue, and then

11:02 either we read all the 3 billion base pairs. And typically the way the sequencing is done is you

11:08 read all of these sequences from many molecules. And so you'll have multiple copies of that when

11:16 you're translating from a molecular, like a chemical molecular structure to a DNA sequence

11:24 on say a plat file in a file system. So if you look at that large scale of data, like the entire

11:31 genome, we are talking of hundreds of gigabytes, maybe even terabyte worth of data. Then there are

11:38 other more practical approaches when we look at the genome. And especially this is something that

11:43 we use for day-to-day patient care, which is referred to as targeted sequencing. What that

11:48 means is instead of the 3 billion base pairs, we focus on those regions of the human genome

11:54 that are of most pertinent use, or that we at least as a current field of genomics, that we

12:01 understand what to do with. And so there are certain genes that at least in the space I work

12:07 with, cancer genomics, that are, I would say close to about maybe a thousand to 2,000 genes,

12:16 which are known to be cancer associated. And of that, roughly about five to 700 genes are where

12:23 we know that they have been studied and demonstrated that there are certain types of

12:28 abnormalities in those genes in terms of the sequence changes, that they have certain meaning

12:33 in context of tumor in order to make a diagnosis or to understand if the tumor is aggressive or

12:42 benign, or if there are certain treatments that could be applied to those tumors. And that's

12:47 specifically linked to the kind of sequence change you see in that region of the genome.

12:54 And we are talking about, practically speaking, when we talk about the targeted testing that we

13:01 do, it's a very small fraction of the large genome. Typically there's a term known as exome

13:07 sequencing, and exome sequencing refers to sequencing all those regions of the human genome

13:13 where it at least encodes for one or the other anodic gene. That is typically about one to 2%

13:21 of the entire genome. And so if we further narrow it down to about say five, 600 genes,

13:25 that one would typically sequence for practical cancer molecular testing. I would say that's

13:34 probably about a tenth, or maybe slightly less than that, of the genome, but it's a very high

13:39 yield from a clinical standpoint, because the most alteration you will find that would help

13:45 with the clinical treatment is high. So if we're going to talk about that dataset, it's complex in

13:52 a different way because just looking at the raw sequence data would be somewhere in, I would say,

14:00 one to 20 gigs from a single sequence file, but it entirely depends on how deep we go.

14:08 So for example, when we talk about sequencing, as I mentioned before, when we sequence a molecule,

14:13 we can sequence it either at certain depths, that means what level of redundancy you want

14:20 to be able to read that molecule. Sometimes we read the molecules 20 to 30 times, so that's

14:27 referred to as 30x, or sometimes we'll read that 500 times, so that will be 500x.

14:32 >> You do that because you want to make sure you don't misread the gene?

14:38 >> Yes. So, right. So what happens is the greater the depth of sequence, so typically for

14:43 such large panels that we sequence in a clinical setting, we usually target about 1500x to 2000x,

14:52 that means we're reading that 2000 times. So the more the depth it is, the possibility of identifying

14:59 a certain variation or genomic alteration that is present at a very low level. For example, say,

15:03 you have a tumor cell and within that, only 2% of the cells have this mutation, others don't.

15:10 And so when you're looking for or hunting for these needles in a haystack, you really want to

15:16 maximize the amount of depth you have to be able to pick those things up. So it really depends on

15:21 how deep we go. The more deep we go, the more data it is, and so it can scale up to almost

15:26 several hundred gigabytes. >> Sure. Yeah, I've always wondered about how you can go and read somebody's genetics and then not make a mistake when you're

15:35 using chemicals to read. So, but it's really ridiculous how much data is there.

15:41 Off by one, a C for a G or whatever is a bad thing, right?

15:46 >> Right. But it is, I think as the technology has matured, there's not a hundred percent in terms

15:56 of the error profile for the enzyme that has been used to work, the technology that is reading the

16:01 actual fluorescence converting that to signal. There's always statistical values and probabilities

16:07 that are associated with what is the probability that it is wrong or incorrect or correct.

16:12 But within that frame and where the current technology is, it's pretty accurate for,

16:18 if not all, many of the regions of the genome. And so it's mind-baffling how it works.

16:23 >> Yeah, it really is quite amazing. It's one of the modern marvels of science for sure.

16:29 >> It is, it is. >> This portion of Talk Python to Me is brought to you by Sentry. Code breaks, it's a fact of life. With Sentry, you can fix it

16:38 faster. As I've told you all before, we use Sentry on many of our apps and APIs here at Talk Python.

16:44 I recently used Sentry to help me track down one of the weirdest bugs I've run into in a long time.

16:50 Here's what happened. When signing up for our mailing list, it would crash under a non-common

16:55 execution path, like situations where someone was already subscribed or entered an invalid email

17:01 address or something like this. The bizarre part was that our logging of that unusual condition

17:07 itself was crashing. How is it possible for her log to crash? It's basically a glorified

17:13 print statement. Well, Sentry to the rescue. I'm looking at the crash report right now,

17:18 and I see way more information than you would expect to find in any log statement. And because

17:23 it's production, debuggers are out of the question. I see the trace back, of course,

17:28 but also the browser version, client OS, server OS, server OS version, whether it's production

17:34 or Q&A, the email and name of the person signing up, that's the person who actually experienced

17:39 the crash, dictionaries of data on the call stack, and so much more. What was the problem?

17:44 I initialized the logger with the string info for the level rather than the enumeration dot info,

17:51 which was an integer-based enum. So the logging statement would crash, saying that I could not

17:56 use less than or equal to between strings and ints. Crazy town. But with Sentry, I captured it,

18:04 fixed it, and I even helped the user who experienced that crash. Don't fly blind. Fix

18:09 code faster with Sentry. Create your Sentry account now at talkpython.fm/sentry. And if you

18:15 sign up with the code TALKPYTHON, all capital, no spaces, it's good for two free months of Sentry's

18:22 business plan, which will give you up to 20 times as many monthly events as well as other features.

18:26 So I think you're a little bit unusual, a little bit weird in the sense that you got in as your

18:34 first sort of programming thing to bring to apply to your science and medicine side of things was

18:40 C# rather than, or VB.NET, rather than something like Python or R or something. So maybe talk a

18:51 bit about that experience, contrast it with Python. Why do you end up moving to Python?

18:56 Yeah. So I think the reason I started using VB.NET C# was I would say most,

19:04 it was probably influenced a lot by at the time when I was doing my med school in India. What

19:12 was available at that time, it was not something I would just go to the internet and start getting a

19:16 lot of resources as one would do now. So it was pretty much like, this is the book I have available

19:21 and that's the only thing. So you start. But the thing is when I started applying C# and it was

19:32 mostly C# and a little bit of C++ and I started to get into like with some of the non-genetic stuff

19:38 initially, well, the project I'm working on, it was not too bad because I was able to accomplish

19:44 most of the tasks. But then once I got into genomics and I came, so the way professionals

19:51 who get into genomics and molecular pathology, there are a couple of different routes. So

19:58 either the physician, people who are physician trained and they have a formal background in

20:04 medicine and they do a specialized training and then they become molecular pathologists

20:09 after getting board certified. There is the other route, which is more of a research background,

20:14 where people have spent a lot of their time in really deep research. They've learned a lot of

20:20 genomics hands-on, either from a computational background or from a more laboratory, like a

20:25 wet laboratory background. And so they've obviously done their PhDs and postdoc training and then

20:32 sort of come into the molecular pathology field. People starting there tend to have more of a

20:40 formal computational training. So they're getting, they usually get, obviously when you start with a

20:45 research lab, R, Python are sort of like the most common tools that are used for

20:49 any kind of data analysis and data visualization. Coming from more of a foreign medicine background,

20:56 and typically when we get training in clinical informatics or pathology informatics,

21:03 often it is very, I would not say corporate based, but very formal application development space.

21:11 So it's a lot of Windows based, .NET, C#, C++, that kind of thing.

21:17 Standard enterprise stack. Yeah. Java or .NET is a perfect choice. Yeah. Okay.

21:23 In bioinformatics, at least in genomics bioinformatics, the ecosystem of tools available,

21:28 it's a mishmash of everything. For anything which is very computationally intensive, like when you're

21:34 trying to align sequences to the human genome, those are very intensive tasks. And typically

21:40 it's a lot of C, C++, Java, that's involved in some of these very mainstream tools that are

21:49 available. More recently, I think we are seeing Rust coming into the picture as well. There's

21:54 some Golang applications. And then of course, Python and R are the predominant, I think,

21:59 tools as the programming language that are used to solve all of these problems.

22:04 So when I started my molecular pathology fellowship and I got into, now I had to do

22:10 this project that involved manipulating all the sequence data to a point where we would be able

22:17 to develop an application that would help, it's a web-based application that could help

22:22 for other pathologists and faculty to read that sequencing data and digest in a very way that's

22:30 easy for them to look at it rather than going to the Linux terminal and opening up raw files and

22:35 things like that. So I used, that was my first project was to use C# in that context. But I

22:41 quickly realized that there was a lot of these algorithms that were natively either written in

22:45 R or Python and then having to incorporate those functionalities was not as easily possible.

22:51 So I had to rewrite a lot of those things in C#, Python primarily. It was a good learning curve,

22:57 but I think from a main derivative perspective, it was getting really difficult. And so that's

23:01 when we realized it was that I think a combination of Linux and Python was, I had to move towards.

23:08 Yeah. C# probably from the timeframe that you're thinking about, didn't really have a great package

23:12 manager story, not to the same degree that Python does. Although they do pretty good now over in

23:18 the continent land. Right. Yeah. All right. So a good question from Chris in the audience says,

23:24 is there a reason to use Python specifically? Like, are there some special sauce packages

23:28 that make it attractive? It sounds like that's kind of what you were getting at. Like you found

23:32 more solutions to these algorithms than, you know, available in Python than in C#.

23:37 Yeah. Whatever languages. Yeah.

23:39 Right. So I think, I mean, I think the simple answer is yes. I think the community and the

23:45 amount of work that has been done in this particular space with genomics, I mean, when you

23:49 are really searching for applications, it kind of falls into these three categories of, you know,

23:54 anything which is a high performance compiling program that is usually in the rust, C++, C,

24:01 a lot of those languages, a little bit of rust and Java. And then the other bin is essentially

24:07 kind of, you know, split up into Python and R. I think for me, Python was, and I think I'm sure

24:14 others have shared the same way where it's almost like, wow, this is amazing. Like coming from C#,

24:19 it was a little bit of a change because there's no more, like, you know, curly braces.

24:23 Think about the whole things. But I think the job You don't miss your semicolons.

24:29 Kind of. Like even now, sometimes when I write like a little bit of JavaScript,

24:34 I feel like, oh, yeah, okay, this semicolon is here.

24:36 Exactly.

24:37 The curly braces. But, you know, not too bad. I think what I got onto was like the simplicity

24:43 of the language and how powerful it was when, like, if I'm thinking about, you know, it was

24:48 interesting when I had to do something like there was an algorithm where I had to parse out certain,

24:55 you know, strings in a way where it required some known workflows that we used to do, like,

25:05 variant annotations when we're cross referencing databases and putting them together. You know,

25:10 when you look for in terms of C# packages, there's really nothing there for it natively to do it. So

25:16 you have to write a lot of those things. In Python, the amount of time that is spent

25:20 in developing those things is much faster. Like the development time itself is quick because

25:26 you either get an idea of somebody who's already done the work or there's a more

25:30 formal package that you can use. So I think initially when I started off,

25:34 BioPython was a very interesting collection of packages. It was like a tool suite essentially

25:41 written to, you know, to have all these functions available for very common day to day tasks.

25:47 You know, I want to query a certain region of the BAM file or I want to parse out certain things in

25:53 the FASTQ file to look at some of the sequences or doing, you know, counting number of sequences

26:00 in a given file and, you know, getting read counts, things like that was, it's all out of

26:05 the box. And so that was sort of like the first thing to go, "Wow, this is amazing." I mean,

26:09 somebody's already done the work and we're just using it on top of it.

26:13 Yeah. Instead of creating these, I'll just like use those. Perfect.

26:17 Right. So that was one. And the other motivation to use Python was, you know, say for example,

26:24 why not R? Why Python? Because R offers a very rich ecosystem in, you know, at least in genomics

26:29 and visualization. So I think the second thing was in terms of the idea that I was working on was

26:37 having to develop a web application and all of these bioinformatics, you know, tooling and

26:42 algorithms running sort of in the back. And so at that time it was like, okay, well, you know,

26:46 Python, I've not heard much about in terms of web application. Mostly it was, you know, again,

26:51 this big, like, you know, C#.net. That was why I started off, you know, with that. But then at that

26:57 time, you know, there was Django and then Flask was sort of coming in. It was a very minimalistic,

27:02 you know, sort of application. So I started focusing on that. It was very easy with Flask to,

27:06 you know, get up and running with very simple, you know, applications to do that. I didn't try

27:12 much into Django just because it was too bloated for me. But, you know, Flask was great. And then

27:18 what I realized was you can create a simple web application, but then at the same time,

27:22 you can use all your, you know, my pythons and all the wonderful bioinformatics packages in the

27:27 backend. So it's like a single language that lets you do both. And so this is great. It was just,

27:33 I don't have to go anywhere to learn, you know, a third or a fourth or fifth different programming

27:38 language. And this just gets the job done. Yeah. Keep it in mind that your actually main,

27:42 your main job is medicine, not programming, right? It's not a CS person. It's just all after

27:48 out to learn all the languages, right? Right, right, right. So that definitely is,

27:52 again, that's a huge, you know, I would say I'm, again, as I said, I'm in a, I'm sort of

27:59 in an unusual position where I'm, you know, a physician, but I also do a lot of these

28:03 application developments. So that certainly is an important point in terms of how much time I have

28:09 to be able to develop these prototypes. And then obviously, you know, typically the way it works is

28:13 at least right now here, you know, where I am currently working, I have an excellent and amazing

28:19 team of developers and bioinformation who really do a lot of the development work on the front end,

28:25 back end. And so for me to be able to take additional time out of my, you know, the clinical

28:31 and the patient care work is limited. So if I can get whatever prototype I'm thinking of,

28:36 we're developing the application fast, then, you know, that's, that's what I'm going for.

28:40 And so you can hand it off to the team and let them polish it up and product,

28:45 make it production ready, basically. Yeah. Yeah. I was wondering how much time of your,

28:50 your job do you get to spend on these kinds of things, you know, finding new packages,

28:55 optimizing or improving the ways that you're working on stuff versus just sort of handing

29:00 it off to the folks you work with and, and keeping, you know, focus more on the medicine side.

29:05 Yeah. So it you know, that, that I think it's a good question. I think it's, it's evolved over

29:12 time as I've been, you know, being sort of, you know, when I was in training and then being a

29:16 faculty and then, you know, faculty in this new position you know, one of the things I did was

29:22 as part of my certification was to, you know, to get board certified in clinical informatics.

29:27 That's a discipline by itself that, you know, involves a lot of, you know, it's a very broad

29:32 field in terms of informatics and healthcare. And then one of the buckets there is, you know,

29:37 software development. And so I was you know, I was quite interested sort of in that field. And so

29:44 most of my time in terms of being able to devote to, you know, finding new packages or trying to,

29:52 you know, write up an application that could solve a problem or coming up with prototypes.

29:57 It was done in a way that sort of aligned with the work I was doing. And so it would be days

30:03 when I'm on clinical service where I'm mostly, you know, working on sort of with, you know, with

30:07 patient care related matters. So those weeks would be, you know, obviously very busy. I would have,

30:12 you know, I would wake up at like extremely early in the morning, spend the first two hours,

30:17 four to six a.m. just, you know, working on this. And then I get back to like, you know,

30:20 the clinical work. And then there would be weeks when I'm off clinical service. So I,

30:23 you know, I'm not responsible for any patient care related work. And those weeks would be

30:28 where I would spend time in terms of, you know, doing these, you know, investigating into sort

30:34 of some of these packages and, you know, coming up with new ideas, exploring what is all, you know,

30:39 what is available in terms of certain problems that I was solving. And, you know, that time

30:43 sort of, you know, my quote, protected time professionally was spent in that. And so that

30:48 would be, you know, maybe a week spent into like, hey, we are trying to look into this

30:53 variant annotation pool. And then we want to, you know, write wrappers around it. So it becomes

30:58 easy for, you know, our labs operation to be able to use that. And so, so kind of that,

31:04 that's how it works. So some of those either early mornings or, you know, the weeks I'm off

31:08 clinical services and how, how that works. This portion of talk Python to me is brought to you

31:15 by Posit, the makers of Shiny, formerly RStudio, and especially Shiny for Python.

31:22 Let me ask you a question. Are you building awesome things? Of course you are. You're a

31:26 developer or data scientist. That's what we do. And you should check out Posit Connect. Posit

31:31 Connect is a way for you to publish, share and deploy all the data products that you're building

31:36 using Python. People ask me the same question all the time, Michael, I have some cool data science

31:42 project or notebook that I built. How do I share it with my users, stakeholders, teammates, I need

31:47 to learn FastAPI or flask or maybe view or react.js Hold on now. Those are cool technologies,

31:54 and I'm sure you'd benefit from them, but maybe stay focused on the data project. Let Posit Connect

31:59 handle that side of things. With Posit Connect, you can rapidly and securely deploy the things

32:03 you build in Python, Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Ports, Dashboards,

32:10 and APIs. Posit Connect supports all of them. And Posit Connect comes with all the bells and

32:16 whistles to satisfy IT and other enterprise requirements. Make deployment the easiest step

32:22 in your workflow with Posit Connect. For a limited time, you can try Posit Connect for free for three

32:27 months by going to talkpython.fm/posit. That's talkpython.fm/POSIT. The link is in your podcast

32:35 player show notes. Thank you to the team at Posit for supporting Talk Python.

32:39 Is it changing fast? So I'll give you an analogy that you could tell me about your space. So

32:45 on one hand, in Python web world, you mentioned Flask and Django. Flask and Django, while they

32:51 are evolving, they're kind of the way they have been and they're pretty stable. And if you learned

32:56 Flask five years ago, you're still good to use Flask today. Or is it more like FastAPI, Pydantic,

33:04 msgspec? There's something new all the time that you got to keep learning to bring in. Are

33:10 there a ton of new packages just coming online or is there a set of really solid ones?

33:15 So I think it's both yes and no. And so it depends on what area we're working on. So right now in the

33:26 clinical lab that I'm directing here, when I came here in 2020, it was when we started off from

33:32 scratch. So essentially, the idea was to be able to bring up a pediatric cancer sequencing

33:38 infrastructure that was not available. And so it was ground up from the lab to personnel to space

33:43 to competition and such. And so we kind of have these sort of two big bubbles in that operation

33:49 from an informatics perspective. One of them is we essentially are in the process of developing

33:56 our custom lab information system. That's essentially a web app. And so we have that

34:02 space and the other space is bioinformatics. And so bioinformatics is a lot of the custom

34:08 scripting or the applications we develop is Python based. Some of them we do with Golang

34:13 when we need a little bit of performance aspect. And then the other aspect is the web app.

34:20 So from a web app perspective, when I started here, we actually started, we use FastAPI and

34:26 Python. So that's kind of, that was, so the idea was that, well, since you're starting from scratch

34:31 and I came to know about FastAPI at that point of time, the whole thing was about

34:36 a single way that was, I was pretty much sold on that aspect. And then, I think the whole tool

34:42 segment a lot of sense. I'm like, okay, well, this is perfect. I'm to be able to, I think when

34:46 I started FastAPI was, all about five or six. And so now obviously you can see a lot of change

34:53 happening there. So yeah, that definitely is a lot of fast pace. And so we kind of do catching

34:59 up in a sense where it has to be done in a careful way. The reason is because from, as compared to

35:08 more traditional research lab testing where at the end really, there's a lot of discovery,

35:14 there's a lot of excitement at the end, it all translates into being sort of, the data is

35:20 presented at a conference or you publish that as a manuscript and that's the end point. So if you

35:25 move off from version one to version three of an algorithm, you have to obviously make sure that

35:31 your research, everything is reproducible, but beyond that is not a problem. But when we're

35:35 talking about the same thing in context of a clinical care for a patient, the room for error

35:40 is very, very little. You can't make mistakes. And so the entire space of clinical testing is

35:46 very regulated in that sense, because there's a lot of requirement that you have to perform that

35:51 any change that's happened in your pipeline, say you're using some version of an application,

35:56 now you upgrade to a newer version. You have to demonstrate that the analytical performance in

36:01 terms of sensitivity and specificity for that pipeline didn't change. And so a lot of work is

36:07 needed when you go do a version upgrade. So we keep those things very controlled and careful

36:13 versus some other things which are more in the R&D space. There's a little bit more room to

36:19 play around with tools. Right. Yeah. Chris was asking an audience a great question about

36:24 basically, is it more exploratory? You just move really fast and don't really worry about

36:27 tests and stuff like that. It sounds like this is more of a production type thing. Like if it,

36:33 you're going to run it over and over. And if it gives a different answer at some point for

36:37 testing for a disease or something, that's really bad. You need it to be right all the time. And so.

36:41 Yes. So the room when we do new test developments or we bring a new algorithm, obviously that

36:48 part of which we refer to, there's a form of term that we use in lab medicines called

36:52 familiarization and optimization or O&F phase. That's where, you know, there's a lot of flexibility,

36:58 new tools, new version, trying out different things. But once it moves from that into the

37:02 validation phase, and then once we deploy the application, once the deployment is there,

37:06 it's a production application. We don't touch it unless something really has to be tinkered with,

37:12 or there's a bug that we have to fix. Who's in charge of running those apps? Is that people

37:16 on your team and your lab or is that the hospital or how? Yeah. So the way to set up here is,

37:23 so when I started off, I was the end of one. So I started off with the FastAPI application.

37:29 I had to build up the, we had a bioinformatics pipeline that I had initially authored. But

37:36 then when we went through the validation phase, I luckily had two people on staff who kind of

37:42 were handling the bioinformatics on the front end. And then eventually we had a third person who

37:48 joined the team. So then they were kind of helping me out with a lot of the actual groundwork of

37:55 writing the code, getting tests done, going through the validation data, summarizing that for me.

38:01 Being a lab director, it is my responsibility ultimately to sign off on all those things,

38:05 say, "Hey, okay, this is the validation and this is what is being demonstrated that

38:10 your package or your pipeline or whatever you're working on demonstrates this level

38:15 of sensibility." Then yes, I being a lab director say that, yes, this is working.

38:19 And once that happens, so we then deploy those applications in production. We use GitHub and

38:25 another, the usual dev, test, prod cycle. And so that's kind of how it works.

38:30 Well, do you have your own hardware or do you have stuff like on DigitalOcean or AWS?

38:36 So with healthcare data, there is generally a little bit of angst with data sitting on the cloud

38:43 outside the support institution. I would say the institution that I work on is very,

38:48 that way it's quite very forward thinking and being able to use modern technology.

38:56 So what we started off and since it was everything being built up from scratch,

39:03 we had taken the decision to keep things on-prem for beginning. But we also kept in mind that

39:10 at some point of time, if the institution decides that, "Oh, we're going to switch our

39:14 infrastructure to using AWS or Azure or whatever the platform is going to be," that we want it to

39:21 be ready. And so the way we had it set up, and this is due to our amazing IS team here at our

39:30 institution. So we had our own hardware that we got in terms of the actual servers. And we

39:40 collaborated with the IS team to be able to help us build our Kubernetes infrastructure.

39:45 So we have a test and a prod Kubernetes cluster, and then all our apps and the bioinformatics

39:54 pipeline. Well, the apps for now and the bioinformatics pipeline that we're looking

39:58 forward in the near future to get deployed on these things. As a matter of fact, what we've

40:03 done is our dev team, we start to, we do a lot of the development on Kubernetes as well. And then we

40:09 keep moving all these things as containerized applications.

40:12 That's excellent. So really embracing containers and Docker and Kubernetes, and that should make

40:17 it super easy to move to wherever you want to go, right? Anything that can run Kubernetes,

40:21 you just push to that and you're good to go.

40:24 Right. Right. I mean, it's a little bit difficult to start with, but I think once

40:29 we are in that stream, it is much less effort to move things around.

40:35 Yeah. Last year I rewrote all of our servers and APIs and condensed six to eight servers all into

40:42 one just Docker cluster. And it was a great decision, but to me, it was also a little

40:46 intimidating to like, well, here's one more thing I layer, I have to manage and understand. And if

40:51 something goes wrong there, then everything else still breaks. But having it set up is really nice

40:55 once you get used to it. All right. Let's talk a bit about, well, I have a question for you. I

41:00 want to talk about some of these packages that you've been saying are like a lot of the reasons

41:05 you chose Python and you use a lot, which is great. But before we get there, like I got the

41:10 biopython.org website pulled up. And the very first line is, "Biopython is a set of freely

41:16 available tools." You know, open source, freely available. How much does that matter to you guys?

41:22 On one hand, you have a ton of money being in the medical space. It's really high stakes. So

41:29 paying for commercial software or commercial libraries is probably not the biggest worry.

41:34 On the other hand, open source is really nice. Being able to look inside is really nice.

41:38 Free means you don't have to deal with getting permission. How does that fit into your world?

41:44 I know how it fits into like small startups and things like that, but for a hospital, for example,

41:49 what does free and open source mean to you guys? I think it does have a lot of impact in terms of

41:55 how we end up working and setting up these things. And obviously, whatever I'm speaking

42:04 is representing what it means from sort of an operational standpoint. When we talk about

42:11 molecular pathology, generally being able to bring up a clinical service like that is a huge investment.

42:18 And so a lot of the investment is... And this is generally applicable to any institution where

42:24 something like this has been set up for patient care or clinical use.

42:29 The investment is primarily in a lot of the instrumentation and the reagents that we use

42:40 are generally quite expensive, which is sort of the... I would say when we talk about what is the

42:47 cost of a test when it's offered, that cost factors in a lot of these operational costs

42:54 that we need to buy these expensive sequencing instruments, the reagents that are used as

43:00 consumable as we do the test over and over again, every week, every month.

43:06 So from that standpoint, traditionally, the way things have been designed is,

43:11 I would say 10 years back when we would work with our finance team to say, "Okay,

43:18 the cost of the test is going to be so and so based on all of these different inputs." And so

43:23 10 years back, computation, bioinformatics, all of these were not factored in at all.

43:30 But now as we are in that era where using GPUs on a regular basis to be able to do simple...

43:38 I would not say simple, but routine work to get from the raw sequence data to be able to identify

43:45 genomic variants, that's getting common. Using FPGAs, using large clusters to be able to perform

43:51 these tests. And so now we are starting to see those costs getting in as part of the ultimate

43:57 cost that goes to the patient for a test. And so we try to minimize those things.

44:01 One of the ways to be able to minimize those things is to be able to choose between

44:04 free open source versus something which is a commercial product. And it's always a balance

44:08 between the reliability and the service that you're able to get back saying, "Hey, something

44:13 breaks down. We know there's an SLA. There's a certain assurance that this thing is going to

44:20 have help," versus open source free would be where we feel very confident in the code base.

44:27 Sometimes what happens is when we use some of these open source tools, we end up almost

44:34 invariably having some wrapper around it to change things or being able to have some insight

44:39 into the source code. So it depends on that balance, what we choose.

44:44 How often do you fork it and use your self-maintained version versus just run what

44:51 is publicly on PyPI and then maybe wrap it to orchestrate it a bit?

44:55 So I would say for the web application part of it, we don't really do a lot of forking. We kind

45:02 of go with what it is. The only thing what we do is since we have the luxury of using a combination

45:07 of GitHub and containers and knowing the fact that the regulatory requirements require that

45:13 you tightly version control all these things with history and all those things, we tend to,

45:17 when we are developing these things, when we are validating it, before we do that, we try to stick

45:22 to a fairly stable version. So for example, things like beta or release candidates, we try to stay

45:28 away from that. Even if they have some desirable features, but unless we see a full production

45:33 version of that, we don't tend to switch to it. So we keep things like that without maintaining

45:38 or without forking or making modifications. When we get into more of the bioinformatics stuff,

45:44 where we are actually trying to use an algorithm to solve a particular piece of

45:49 part of the pipeline that is doing some data transformation, it depends on how much we want to

45:55 change or modify. That's when we sometimes fork it. Sometimes we fork where we know that,

46:02 and this is the unfortunate reality in many of the scenarios where you have great open source tool,

46:07 but after some time due to whatever financial or business or other reasons that they stop

46:12 maintaining. And so essentially we get into this freeze mode. We tend to fork that so that at least

46:17 we have that available. And then if we make any changes that we keep it to that fork.

46:22 But generally I would say it's probably in the 80, 20, where 20% is where we fork it,

46:30 make some change. Most of the times we try not to do that. But yes, open source,

46:36 free tools have a big impact. A lot of their tools that we use as part of our

46:41 bioinformatics pipelines, as a matter of fact, which is kind of used in the community of

46:44 molecular pathology to build these bioinformatics pipeline. They tend to use a lot of open source

46:51 tools. And the reason for that is, for example, it's not written in Python, but we have,

46:55 there's an algorithm called BWA. It's written by Professor Heng Li. That's one of the algorithms

47:02 that is almost like a de facto, I would say, when it comes to doing sequence alignment,

47:07 that's a part of the pipeline. And so it's a tried and tested application for more than a decade now.

47:15 So really there's not a, and this is a fairly stable algorithm or application. So we don't

47:20 tend to, it's well maintained from an open source perspective. So those obviously are highly,

47:24 we highly rely on those. But there is this whole ecosystem of software's that come under this

47:31 rubric term of variant calling, where we're trying to identify these different variants.

47:34 There's a whole bunch of those and some are fairly well maintained. They are open source.

47:42 Sometimes depending on the context you're using, you need a license if it is used in a commercial

47:48 setting. You don't need a license if it is in an academic setting. Like for example,

47:54 when we do clinical testing in institutions such as where I am right now, that's an academic

48:00 institution. So typically it's not for profit. And so obviously we don't need licenses for that use.

48:09 But once this goes into a pure commercial space where if the lab is doing all of this testing

48:14 for profit, then there's a license requirement. So we see a combination of these things showing

48:21 up in the, it's actually becoming more common now with open source tools that, at least in the

48:28 genomic bioinformatics space. Yeah. Oh, excellent. I think another benefit probably for you guys,

48:33 I know it's a benefit for a lot of organizations is if you use the open source tools and you need

48:39 to hire somebody new, there's a good chance that they have experience already with those tools.

48:44 Whereas if you use something private, expensive, you might have to teach them from scratch what

48:49 the thing is, right? Yes, that is correct. As a matter of fact, it's fortunate that a lot of the

48:55 people who've done a lot of good work and have contributed to the genomics bioinformatics space

49:01 have, the general tendency is whenever we are setting up any sort of pipelines or DNA sequencing,

49:10 RNA sequencing, or more from a research perspective, methylation sequencing, single cell RNA-seq,

49:20 UMI-based error-corrected variant calling, there's a lot of, there's a very thriving open

49:25 source space. And so that really helps with people who come in, even if they're not familiar with

49:32 these tools, it's easy to get familiar with because there's a lot of community backing that up.

49:37 Or as you said, when we hire people who already are coming from a different lab or they've had

49:44 some experience, but they come and say, "Oh yeah, we know how to do alignment, or I'm aware of these

49:49 applications that use that." It is much, much easier from a learning curve perspective rather

49:54 than having to now open up a manual and this will be a proprietary thing that only works here.

50:00 Yes, exactly. Cool. All right. Well, we coordinated a bit on a list of packages that you've used in

50:08 your lab or find really helpful for your work. And maybe we could touch on those just a little bit.

50:13 Yeah.

50:14 Yeah. So CNV kit, genome-wide copy number from high throughput sequencing. I don't know what

50:20 that means, but tell us about it.

50:21 Yeah, yeah, absolutely. So CNV or copy number variation, this is a type of genomic operation

50:28 where what happens is in a simplistic way, at least when we talk about cancer, the cancer cells,

50:36 sometimes for it to be able to survive, it tries to use different ways of doing that biologically.

50:42 One of the way to do that is certain genes that help a cell to grow in the absence of nutrients

50:48 or with very little nutrition is certain genes. If it has more copies of those genes than normal,

50:54 then you'll go, "Oh, okay." It's like you have more money than expected, you can do a lot of

50:59 things. So typically what happens is in a normal human genome, any cell that we pick up, it will

51:05 only have two copies of the gene. One is coming from your mom, one is coming from your dad.

51:08 In cancer, what happens is in certain scenarios, if a gene that helps with growth of the cell or

51:15 it helps the cell to survive even without signal or nutrition, if it has more copies of that,

51:20 it'll make six, eight, 20, 50 copies, it can survive. Versus there are certain scenarios

51:26 where if there is a gene that is supposed to regulate the cells, so it doesn't go haywire,

51:31 if the cancer is able to delete one of the genes out, then you only have one gene left.

51:36 You knock out a thing and then that protective mechanism is gone. So then the cancer cell can

51:39 easily survive. So what happens is with CNV or copy number variations, the idea is that we use

51:46 the high throughput sequencing data to be able to infer how many copies of these genes do we have.

51:51 Is it more than two? Is it less than two? And so this particular package, it's a very, very

51:56 well-established, well-maintained package in the community that essentially does this thing.

52:04 Is you give it the sequencing data and define the regions of the genome that you're interested in.

52:10 You can also provide names for the regions, like this region is gene B-RAF or this is EGFR,

52:18 whatever you're interested in. And then what I do is I will do all the analysis to be able to

52:23 tell you that, okay, well, when we are comparing this particular tumor against this reference set

52:29 of 20 normal samples, where we know that you should only have two copies of the gene,

52:34 in this particular tumor, we are seeing there are 50 copies of the gene.

52:37 So it gives you an output data that numerical can tell you that what it does is it does a

52:46 log-2-based transformation of the tracing there. Okay. After all this computation,

52:52 when I compare to the normal, this is 50 times more, or this is 20 times more the

52:57 expected copy, or it is half of the amount of copy we need in terms of deletions.

53:03 So that's what really it does. And it's written in Python. It has Python dependencies that have

53:13 been written in either C or Python C bindings. But at the end, it gives you that data. And it

53:22 has an internal visualization tool, but I was not very happy with how it was written. So

53:29 I ended up writing a wrapper, which is called CNAplotr. It's open source. It essentially

53:34 uses the end data from CNV Kit, and then it gives you a nice visualization of the copy numbers.

53:42 I think if you go down, if you scroll down, there's my example images.

53:45 - Yeah, you have this on GitHub, so if people want to use this, it's right there, right?

53:51 - Yep, it's right there. So I think at the very bottom of the images, by the screenshots there.

53:56 - Oh, yeah.

53:57 - Yep, right here. So for example, the first image over here, you can see this,

54:01 it's a thin band of all these multicolor things, and each one of them is a single

54:07 human chromosome. So chromosome one, two, three, four, so on and so forth. And if you look at the

54:14 image, it is at, the white scale essentially is log two, which is zero. And going up,

54:20 it is one, two, three, and then it's a negative scale on the lower side. So anything going above

54:25 zero means you have more copies than two, going below is less than two copies. And so if you see

54:30 here in this example, the plot here, you see the very end, which is chromosome X, is a single,

54:37 the band over here is lower at negative one. That means this is a male patient with a single X

54:44 chromosome, as compared to females who have two X chromosomes. And so when you look at this plot

54:49 below here, this is actually a plot from a cell line, a tumor cell line that is abnormal. And

54:56 here we see there are two genes which are amplified. One of them is a gene known as TERT,

55:02 and the other gene is MDM2. So these two genes are, again, one of those examples where it gives

55:08 the tumor survival advantage over other. And so you can see here there are multiple copies of

55:13 these genes as compared to the baseline over here. - I see. So that might predict something like the

55:19 how survivable the cancer is. - Yes. Right. So if it is, is it going to be localized,

55:24 say, where it happened, or it's going to like spread to other parts of the body,

55:28 or be difficult to treat, or be resistant to treatment. Yes.

55:31 - So if this is you, you want higher numbers, not lower numbers.

55:34 - It all depends. I mean, certain genes are good genes, for example, if there is a,

55:39 there are certain checkpoint genes, if those numbers, you know, if they have lower numbers,

55:44 you want to have two copies of them, because if that protective mechanism is gone,

55:48 you know, the tumor becomes very aggressive. - I see.

55:52 - So it is all in the context. So if you're looking at the good genes, you want to have

55:56 two copies of the good gene. If you're looking at some of the bad genes, you don't want to have

56:00 more than two copies of the bad genes. - One or zero is better. I got it. I got it.

56:03 Okay. Okay. HGVS. - Yes. This is, again, a wonderful package that was initially, I think it was started by a person named Vishal. He's, I think he still

56:16 maintains it, but there's a lot of like, you know, it's a very well publicly maintained

56:20 open source package. It's a lot of, you know, community involvement in that as well.

56:27 So what HGVS is, it's a nomenclature system for, you know, giving a name to all these variations.

56:35 So when you talk about, I'm not sure if Michael, if you've heard about the term mutation.

56:39 So mutation is a very commonly used term that refers to some kind of abnormality in the

56:45 genome. In this case, so what happens is there are these standards that are, that, you know,

56:52 most clinical labs follow when they're putting all of this information in the patient's report

56:57 saying, okay, you know, this particular tumor has, you know, mutation in beta, mutation in EGFR,

57:03 some other gene. And there's a certain way that those mutations are described in terms of what

57:08 sequence alterations happening, say at the mRNA level and what sequence alterations are happening

57:14 at the protein level. So now in your protein, you know, you're missing these amino acids or

57:18 you have excess of these amino acids or something got switched from here to there.

57:22 So there's a formal way of defining that. And the guidelines of the group that defines that is

57:28 referred to as HGVS, Human Genome Variation Society. And so it's a very complicated process

57:34 where you have to do all these translations from the, you know, the genomic scale where

57:38 the numbering system starts from one to like, and you know, whatever the length of your chromosome

57:44 is in terms of ATGCs and each chromosome has a different number. And if you have a certain

57:48 alteration that is happening, say in chromosome seven at this particular position, then you have

57:53 to translate that to the mRNA of that gene and then the protein of that gene. So it's a lot of

57:58 math, a lot of strings involved in that process. And so essentially this HGVS Python package

58:05 provides all of those functionality as a wrap. You can create your translation. You can essentially

58:10 project the variant from your genomic to the, you know, the mRNA to a protein level or vice versa.

58:17 You can validate things. So we ended up, I actually wrote a paper about this when we,

58:24 you know, we did a validation of how well this particular package works. And so now, you know,

58:29 in the lab that I'm currently in, we implement this thing for generating those nomenclatures.

58:34 So what happens is when we put a report out for the, in the patient's chart and when our, you know,

58:39 say our oncology, oncologist was treating the patient, they want to know, okay, what is,

58:45 you know, what did you identify in this tumor genome? They will read that nomenclature saying,

58:49 oh, okay, this particular change in this BF gene, this is significant. I know that there are

58:54 therapies that are out there that we can use to treat this patient tumor. So that's what this

59:00 nomenclature system is about. So it's a very complicated automated system.

59:04 Yeah. And it normalizes it if there's multiple ways to represent it.

59:08 Very, very nice. All right. This one I'm familiar with, OpenPyXL.

59:14 Yes.

59:15 You probably have a lot of data that either goes, comes from or goes, it's shared out into Excel, right?

59:20 Yes. So what we do is we sort of are right now in our lab, we're kind of in this sort of,

59:26 you know, kind of an interim phase where we sometimes use Excel to look at some data. So

59:35 traditionally speaking before, you know, typically any lab that goes from, you know,

59:42 zero to the point where you have a web application that automates everything,

59:45 the intermediate phase is using a lot of Excel. So it's very common in many labs

59:51 to use Excel for a lot of different things, you know, for QC, for charts, for tracking. So

59:56 we use this OpenPyXL for a few things. One of them is when we have a lot of,

01:00:05 you know, the sequencing data that we have to summarize and then generate a QC to be able to

01:00:10 present that to essentially create an Excel document on the fly from the backend to provide

01:00:17 that, you know, whatever data they want to look at in terms of statistics or, you know,

01:00:21 list of variants or some form of, you know, calculation they want to do further. That's

01:00:26 where we use this package. Typically we use it as part of our bioinformatics pipeline when we

01:00:30 have to generate those things. But it's a very handy tool. We actually use something similar,

01:00:35 and I'm forgetting the name of the package that is used to generate our document. Like we use some

01:00:43 Word documents for creating reports, but we also use Python there to be able to summarize a lot of

01:00:48 these data points and then create a Word document that, you know, it starts with a template of a

01:00:52 Word document and then use Python to fill up all these, you know. Right. Here's where the graph

01:00:57 goes. Here's where the summary goes. Here's where the detected, whatever it goes. Yeah. Right. Yeah.

01:01:02 Cool. Are you, here's two things that overlap. Are you familiar with this thing where scientists

01:01:09 rename human genes to stop Excel from misreading? Oh yes. Yes, absolutely. Oh my gosh. This is

01:01:17 crazy. Yes. Yes. It happens. When we import a lot of this data coming from somewhere, we'll see

01:01:25 entries like September 14th or March 19th. Yeah. This is a big problem going in and out of Excel.

01:01:34 And so as much as you can do in Python or any proper programming language, rather than using

01:01:40 Excel, but there was one that was M-A-R-C-H one or March one. Yes. Or S-E-P-T one.

01:01:48 It's very funny. Some of the gene names are funny, but then Excel, you know, gets it to the next

01:01:55 level when it changes the names. This doesn't make any sense. Yeah. Yeah. It doesn't make any sense.

01:02:02 Yeah. All right. On to the next one. Hera. Yes. Hera. This is very interesting. So this is where

01:02:08 I think, you know, where in our instance, we are going away from standard web applications,

01:02:16 standard bioinformatics pipeline to really touching DevOps using Python. And so one of the things

01:02:21 that typically we get to the point when we scale up our bioinformatics pipeline, where we have

01:02:27 multiple samples and multiple runs and everything needs to be orchestrated in a way where you have,

01:02:33 you know, while you're running your pipeline, you have a lot of visibility into how it works.

01:02:37 And so this is one of our projects we're working on to move our current bioinformatics pipeline,

01:02:44 the way it works, you know, kind of on a single server to be able to use the Kubernetes cluster

01:02:49 to actually deploy the long running pipelines onto that. And so there are many options. You know,

01:02:55 there are a more standard sort of, you know, a little based, you know, kind of, you know,

01:03:02 protocols that you can use to run on either cloud or HPC environments. There is a very popular tool

01:03:09 called Nextflow that is used to be able to, you know, kind of create your data analysis pipeline.

01:03:15 We can sort of define that and then use any backend to deploy it. One of the things that we

01:03:21 kind of, when I was exploring the space, one of the things I came across was, you know, the whole

01:03:28 sort of ecosystem that Argo maintains with, you know, Argo workflow and Argo CI/CD and all those

01:03:35 things. So workflow was interesting because Argo provides that way where you can sort of, you know,

01:03:39 write your pipelines in a YAML format and then have it, you know, deployed on the Kubernetes

01:03:43 cluster. It really is very native to the Kubernetes cluster. It sounds a little bit like Ansible,

01:03:49 but for specifically for bio type of projects, right?

01:03:55 Yeah. So Argo, so interesting thing is Argo, when, you know, when this Argo workflows was set

01:04:00 up really for a lot of CI/CD automations in mind. So, you know, it is, yes, you can run data

01:04:06 pipelines in general, but never, it was never, at least in its description, it never describes

01:04:11 use case sort of in bioinformatics or, you know, biology pipeline analysis. And similarly,

01:04:17 you know, it was like, okay, it's a generic tool. You can use it for whatever you want.

01:04:22 So I tried it out with using, you know, like a YAML file and it was a simple four-step pipeline.

01:04:28 It was wonderful. It was magical. And the good thing was with Argo, like the Argo workflow,

01:04:35 when you install that on your Kubernetes cluster, it comes with a native web interface. So it's,

01:04:43 you know, I'm sure if you've heard about the workflow option with Airflow. So Airflow is a

01:04:49 package that also, you know, there's a nice Python SDK for that, where you have, you know,

01:04:54 you deploy it on a Kubernetes cluster. We have all these amazing visualizations to show what step

01:04:59 you're on, or if there's some error there, it'll do that. Argo does the same thing. So it has,

01:05:04 it has obviously the built-in capability to interact with it as an API, but then also there's

01:05:09 a web interface that it'll deploy and it can have visibility into every step of the process. You can

01:05:14 summarize and see the entire tree. So that was very interesting for us because we could get all

01:05:18 that thing done in a single thing, in a single go. But then our challenge was, well, our desire was

01:05:24 that if you could integrate that with our LIMP system that we were working on using FastAPI,

01:05:29 so the idea was, hey, if there's any Python SDK. And so that's where Hera comes. So Hera

01:05:33 essentially is a SDK, a wrapper that's essentially talking to Argo, but you can define your pipeline

01:05:40 steps as, you know, DAGs in Python. And so that makes the process super simple, where you're now

01:05:48 natively, essentially we can integrate that as a backend to our web application. And so then it's

01:05:55 almost like, you know, it's Python again, from start to finish, you're not getting out of that.

01:05:59 And it's, again, it's a very well-maintained application. So we are currently doing a

01:06:04 validation to be able to make sure that, or demonstrate that, you know, it's equally

01:06:09 performant when we compare to a more sort of native shell-based, you know, execution on the

01:06:15 pipeline. Okay. Yeah. This is new to me. I mean, of course I know Airflow, but not Hera. Cool.

01:06:20 All right. Hi, InSim, did I grab the right one here?

01:06:26 No, I think it's similar. Let me see if I can, I can send you the link.

01:06:32 Yeah. Throw it in the private chat here and I'll pull it up.

01:06:35 Yeah. Okay.

01:06:39 Here we go. In Sim.

01:06:40 Yeah. Okay. Got it.

01:06:43 Yeah. So this is a, this is a very interesting space in next generation sequencing assay or for,

01:06:52 you know, high throughput sequencing assays. So what happens is, as I mentioned, that one of the

01:06:56 things that is required for a clinical lab is to be able to perform a validation on multiple samples

01:07:01 of tumors that have certain mutations. And then you can demonstrate that, you know, yes,

01:07:05 the assay works because you have tested, you know, a hundred samples that have 300 different,

01:07:12 you know, mutations or genetic alterations. And then you can demonstrate that, yes,

01:07:16 your pipeline or your assay was able to pick it up. So you can say that, you know, your assay is,

01:07:20 you know, X percentage sensitive, X percentage specific, and you know, your

01:07:24 recall rate and things like that. So what happens is when you're trying to

01:07:28 get those samples that have these very difficult or challenging variants to detect, because

01:07:35 they're just, you know, complex in how they occur biologically in the cell, it's very difficult.

01:07:40 Some of these are very rare. There may be only two samples in the entire world. Or it's just

01:07:45 not possible practically to get those samples unless we wait for like, you know, 10 years to

01:07:50 validate that. So the idea is that if it is possible to be able to use algorithms, which can

01:07:56 manipulate the existing sequence. So for example, we have a sequence data from an existing real

01:08:04 tumor sample, but we can manipulate that in a way where we introduce these mutations in silico.

01:08:10 So we can introduce, you know, SNBs or insertion, deletion mutations, and then

01:08:14 use that same file to then feed into our pipeline, bioinformatics pipeline and say,

01:08:18 okay, run through the entire data pipeline. And then let's see if we are able to identify

01:08:22 those variants that we inserted. And so that's where this silico mutagenesis comes.

01:08:26 It's a very hot topic. It's a very relevant topic of interest to really fill this large

01:08:32 gap in terms of availability of rare variants and their samples and how we can really improve

01:08:40 sort of some of these rare, but very clinically significant edge cases where we don't want to

01:08:44 miss those variants and actually see those in real tumor samples. And so this is a Python package

01:08:50 that was developed by at the University of Chicago as part of their clinical lab. And so what really

01:08:56 it does is it will take in a list of different, you know, mutations, for example, in this plot,

01:09:02 I think they give examples of, you know, insertion, deletion, insertions, where you have extra

01:09:07 sequence or deletion where you have certain signals which are missing or SNVs or single

01:09:11 nucleotide variants where you have one nucleotide that got switched with another one. And so these

01:09:16 are typically that we practically see in like real samples or real tumor samples, but this is a way

01:09:22 to mimic that, you know, in a sample that does not have it. And so this Python package is able to,

01:09:29 you know, take that list from you say, okay, I have a list of these 20 important mutations that

01:09:35 I know from the public databases have been reported, but I want them to be inserted into

01:09:40 my dataset that was created from say, a set of three or four real tumors, and then use that to

01:09:46 challenge the pipeline to say that, hey, can you still pick it up? And so I see.

01:09:50 I see. Simulate these rare changes and then test or exercise your setup.

01:09:56 Yeah, right.

01:09:57 We've got a few more to cover, but I think we're getting a little bit short on time. So let me just

01:10:01 close this out with a final question for you, because I know this is the topic du jour.

01:10:06 What does AI and LLMs look like for you guys? Does it matter? Is it really powerful? Is it

01:10:14 super important? I mean, genetics is kind of text data in a sense. And so,

01:10:19 yes, sort of in the space of how it could apply.

01:10:22 Right. It is a text data and it's a lot of, you know, there's a, when you talk about like a search

01:10:30 space, a lot of the search space is very text-based. You know, there is some numerical

01:10:34 base, but there's a lot of text-based search as well. And I think across the entire spectrum from

01:10:40 where we start with very raw sequencing data to the point that we are trying to,

01:10:44 you know, ask the question that, okay, I found this rare or novel mutation in this particular

01:10:51 gene. What does it mean? What tumor has been described? What disease does it relate to?

01:10:57 One of the things that we do as molecular pathologists, and this is sort of where a lot

01:11:02 of the medical work comes in, is where we really go through a lot of the medical literature,

01:11:06 what we have learned before, new publications, papers out there that, you know, that have a lot

01:11:11 of data in terms of, you know, studies that are done on this particular gene. And they've described

01:11:16 like, okay, these alterations actually activate the gene or is bad for the tumor or, you know,

01:11:20 makes it treatment resistant. So you can see the natural, there's a lot of text as it happens.

01:11:26 And so in that space, we are seeing in the, I would say in the past, you know, three to four

01:11:31 years, there's been a lot of application of AI tools that have come out, you know, particularly

01:11:36 in the space of variant calling, where we have this genomic sequence data and we're trying to

01:11:41 identify variants. You know, one of the examples that's been talked about a lot is the

01:11:46 deep variant caller. It's called deep variant from the team at Google who developed that.

01:11:52 That uses a lot of the AI techniques to be able to pick those things up.

01:11:56 There are some genomic databases that we use for in silico prediction. For example,

01:12:02 if you have a variant, we have no idea about it. It uses, there's a database called DB-SC-SNV that

01:12:09 uses random forest techniques. And I think it uses another algorithm to predict if a certain

01:12:14 site where there's a mutation can enhance abnormal mechanism called splicing versus not.

01:12:21 Similarly, there's a lot of tools that are coming in and the LLMs, I think are,

01:12:26 I would say not mainstream, but I think there's a lot of interesting research that is coming

01:12:29 around there, but people are trying to use LLMs for doing these more broader excerpts saying that,

01:12:36 hey, you know, I have these, you know, I don't know, a thousand articles, and I want to find

01:12:41 these particular combination of words that, you know, you know, it's a combination of a disease

01:12:49 and a mutation and what do I get back on that? I personally tried, you know, the ChatGPT with

01:12:54 different, you know, like phrases and questions about it. What I've seen so far is, and this is

01:13:01 purely my personal experience. I think a lot of it reads very real, but when you start to look

01:13:08 into the references as to what it references, then you quickly figure it out. This is not the

01:13:14 real deal. And so I think, I think it's, you know, I'm not a very pessimistic person. I would say,

01:13:21 oh no, this is all garbage, but I think there is opportunity there. It's just how do you train it.

01:13:26 Maybe there's a space or an opportunity, and it probably already has been,

01:13:31 people are pursuing this as training a smaller model, but really deeply in genetics,

01:13:37 whereas trying, not trying to use a model that tries to understand everything.

01:13:41 Right. Right. Yeah. It's more, yeah. More in the medical literature or the genomic literature

01:13:46 to be able to like, meaning is enhancement. Yes. So I think there's active work going on there,

01:13:51 but it's, yeah, it's, I think it's making, you know, a lot of, a lot of interesting

01:13:55 research, a lot of potential impact on how, you know, we do things. And obviously the tool sets

01:14:01 that we currently use, we might expect in the next 10 years to change.

01:14:04 Yeah, for sure. All right. Final thought here, people are listening. They're maybe doing similar

01:14:10 work to you. How do they get started? What would you tell them? Get going with Python and some of

01:14:15 these packages? Yeah. I mean, I would, you know, my reflecting on my own experience sort of, you

01:14:21 know, in a very winded way that I ended up here, I think, you know, Python is, I feel the program

01:14:30 in general, and I think Python particularly as a programming language is a very low, you know,

01:14:36 sort of, you know, entry point in terms of being able to really quickly get things done, like learn

01:14:41 it easily and get things done. I think it should be to me, anybody, anybody who's trying to pursue

01:14:47 something in biology or competition biology or bioinformatics, I think this is the first thing.

01:14:52 It's something easy to do to be, you know, I would say relatively easy to do, to be able to get in

01:14:58 that. Anybody with, you know, a desire to learn this has analytical thinking. I mean, I think

01:15:07 investing into Python is probably the best bet because you can pretty much do anything you want.

01:15:13 That's what I tell, you know, when I train people in my lab or I talk to other students, is that

01:15:18 if you want to spend your time, you have very little time because you're busy, you know, with

01:15:22 your other things. I think the one thing that can get some of the job done and be still aligned with

01:15:27 what you're doing is Python. And after that, I think it's, you know, it's a lot of self-driven

01:15:34 learning where, you know, you kind of, you know, look into things. But the whole thing is,

01:15:38 I think the Python community is wonderful. It's almost like I sit down and I think about, oh,

01:15:42 I have to solve this problem. Probably there are 50 other people who are thinking about that and

01:15:47 maybe two people have already worked on it. So it's-

01:15:49 Right, and you've already published it to PyPI and you're good to go.

01:15:53 Absolutely. I totally agree with that. And, you know, people should take the couple of weeks,

01:15:58 get good at it, and it'll amplify. It'll save you time, definitely, in the long run.

01:16:02 It is, yeah. Oh, absolutely. Yes. The only thing that I, that I, the only thing I would say,

01:16:08 like an added thing is if somebody is learning Python and then they do have an intention,

01:16:15 you know, to take it to the point where they will be involved in more serious,

01:16:20 like, you know, application development or maintaining an open source package or,

01:16:24 you know, however they contribute to that. I think learning a little bit more,

01:16:27 like learning Python in its real sense in terms of how to do it right. You know, there are five

01:16:33 ways of doing something correctly. But I think there's one way that is consistent so that it's,

01:16:38 again, you know, easily shared, it's easily maintainable, others can easily understand.

01:16:43 I think that would be my second advice. It is, it takes a little bit of time, but I think it's

01:16:47 well worth the effort to spend the time writing, you know, idiomatic Python code. So it's,

01:16:53 it's portable. Absolutely. All right. So, Mac, thank you for being on the show. It's been

01:16:58 great to get this look inside of what you all are doing with Python.

01:17:01 Yeah. Thank you for having me on the show. I appreciate that.

01:17:03 Yep. Bye.

01:17:04 Okay. Bye, bye.

01:17:05 This has been another episode of Talk Python to Me. Thank you to our sponsors. Be sure to check

01:17:11 out what they're offering. It really helps support the show. Take some stress out of your life.

01:17:16 Get notified immediately about errors and performance issues in your web or mobile

01:17:20 applications with Sentry. Just visit talkpython.fm/sentry and get started for free. And

01:17:26 be sure to use the promo code talkpython, all one word. This episode is sponsored by Posit Connect

01:17:33 from the makers of Shiny. Publish, share, and deploy all of your data projects that you're

01:17:37 creating using Python. Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Reports, Dashboards,

01:17:44 and APIs. Posit Connect supports all of them. Try Posit Connect for free by going to

01:17:49 talkpython.fm/posit. P-O-S-I-T.

01:17:53 Want to level up your Python? We have one of the largest catalogs of Python video courses over at

01:17:58 Talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async.

01:18:04 And best of all, there's not a subscription in sight. Check it out for yourself at

01:18:08 training.talkpython.fm. Be sure to subscribe to the show, open your favorite podcast app,

01:18:13 and search for Python. We should be right at the top. You can also find the iTunes feed at

01:18:18 /iTunes, the Google Play feed at /play, and the Direct RSS feed at /rss on talkpython.fm.

01:18:25 We're live streaming most of our recordings these days. If you want to be part of the show and have

01:18:30 your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:18:35 This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it.

01:18:40 Now get out there and write some Python code.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon