Python in Biology and Genomics

Episode #154, published Wed, Mar 7, 2018, recorded Fri, Feb 9, 2018

Episode Deep Dive Links Transcript

Python is often used in big-data situations. One of the more personal sources of large data sets is our own genetic code. Of course, as Python grows stronger in data science, it's finding its place in biology and genetics.

In this episode, you'll meet Ian Maurer. He's working to help make cancer a think of the past. We'll dig into how Python is part of that journey.

Episode Deep Dive

Guest introduction and background

In this episode, we meet Ian Maurer, a seasoned software developer and computer engineer who leads development at Genome Oncology. Ian started coding on a Commodore 64, later moved into Java for enterprise e-commerce, and then embraced Python to build annotation tools, APIs, and enterprise apps powering genomics research. Through his work, he helps oncologists and pathologists leverage next-generation sequencing data to improve cancer therapies. Ian’s perspective blends deep technical expertise with a passion for biology, illustrating how Python is enabling breakthroughs in modern genomics.

What to Know If You’re New to Python

Here are a few core ideas you’ll want to grasp before diving into discussions about genomics and data pipelines:

Python’s REPL: An interactive environment that lets you quickly test snippets of code, this is one of the features that drew Ian to Python.
Data Libraries: Tools like NumPy and pandas come up frequently, especially for data analysis and manipulation.
Django / Django REST Framework: Understanding basic web frameworks in Python helps when discussing annotation APIs and data exchange.
Async / Concurrency: Ian's team uses async and await features to speed up long-running tasks such as annotation jobs.

Key points and takeaways

How Python Powers Genomics and Cancer Research Python’s data science ecosystem has displaced older scripting languages in bioinformatics. Ian’s team uses Python for annotation, decision support, and building molecular pathology workflows. They focus on bridging the gap between raw genomic data and actionable insights for oncologists.
- Links and Tools:
  - Genome Oncology
  - Biopython
Why Genomic Data is “Big Data” A single human genome contains roughly 3 billion base pairs; sequencing a tumor adds further complexity because of normal vs. cancer cell mixtures. As costs plummet (from $3 billion for the Human Genome Project to under $1,000 for individual tests), the volume of data grows massively.
- Links and Tools:
  - National Human Genome Research Institute
The Next-Generation Sequencing (NGS) Pipeline Typical stages include: (1) Sequencing raw reads into FASTQ files, (2) Alignment to a reference genome (BAM/SAM files), (3) Variant calling to detect genetic changes (VCF files), and (4) Annotation to interpret these variants for clinical decisions. Genome Oncology specializes in annotation and decision support rather than the actual alignment or variant calling steps.
- Links and Tools:
  - SAM/BAM formats
  - VCF specification
Annotation and Decision Support in Python Once variants are called, Ian’s team leverages Python to annotate them with population frequencies, published research findings, and drug approvals. Their Django REST–based API helps pathologists match a patient’s mutations to the latest treatments or trials.
- Links and Tools:
  - Django REST Framework
  - My Cancer Genome (Vanderbilt)
Bioinformatics Python Libraries Projects like HGVS (from the BioCommons group) and Biopython offer specialized functionality for manipulating genomic coordinates (G-dot, C-dot, P-dot) and sequences. This enables teams to build advanced pipelines that convert raw genetic data into clinical insights.
- Links and Tools:
  - HGVS on GitHub
  - Biopython
NLP in Genomics: Using spaCy Medical texts, such as physician notes or pathology reports, are often unstructured. Ian’s team uses spaCy to extract key terms (entity recognition) and interpret sentences indicating biomarkers like “estrogen receptor positive.” While they still keep humans in the loop for validation, spaCy significantly accelerates reading free text.
- Links and Tools:
  - spaCy
  - Getting Started with NLP and spaCy (Talk Python Course)
Concurrency and AsyncIO for High-Throughput Data Handling large volumes of annotations can be I/O-bound. Ian’s team rewrote key pipelines using async and await (via aiohttp) to parallelize HTTP calls against their annotation APIs. The result: massive speedups, all in a single Python process.
- Links and Tools:
  - aiohttp
  - Async Techniques and Examples in Python (Talk Python Course)
Real-Time Collaboration with Django Channels Some of their apps support interactive tumor boards, letting multiple clinicians review patient data in real time through synchronized web sessions. Django Channels, Redis, and WebSockets coordinate these multiuser experiences.
- Links and Tools:
CLI Distribution with Click and Pex Complex command-line interfaces for genomics can be simplified using Click. Packaging them with Pex bundles Python libraries together, so end-users don’t fuss with virtual environments.
- Links and Tools:
  - Click
  - Pex
Ian’s Open Source Projects: related and rigor

related: Built on top of the Adders (attrs) library, making it easy to translate YAML or JSON into Python objects with validation.
rigor: A YAML-driven testing framework (inspired by Cucumber) for automating API tests with concurrency, coverage metrics, and JSON validation via JamesPath.
Links and Tools:
- attrs
- JamesPath GitHub

Tips for Attending PyCon in Cleveland PyCon takes place downtown near many great restaurants and attractions, such as the Rock & Roll Hall of Fame or Progressive Field for an Indians game. Ian encourages folks to explore the city, join the hallway track at the conference, and remember that talks are recorded, so it’s fine to skip a session in favor of meeting people.

Interesting quotes and stories

“Cancer is really mutations within the genome causing things to break down… it’s the stuck gas pedal or the cut brake line.” -- Ian

“We’re a small company, so we focus on pragmatic solutions with Python rather than spending billions. It’s really worked out.” -- Ian

“The hallway track is where you form those deep connections, so remember you can watch the recorded talks on YouTube.” -- Michael

Key definitions and terms

NGS (Next-Generation Sequencing): High-throughput DNA/RNA sequencing technologies that drastically lower costs and increase data volume.
Alignment: Process of mapping short DNA reads to a reference genome (often resulting in BAM or SAM files).
Variant Calling: Identifying genetic differences (variants) relative to a reference genome, typically saved in VCF files.
G-dot, C-dot, P-dot: Different notations for describing genetic variants at the genomic, coding, and protein levels, respectively.
Entity Recognition: NLP technique to automatically locate and classify named entities (e.g., gene names or biomarkers) in text.

Learning resources

Python for Absolute Beginners – If you want a solid foundation in Python from the ground up.
Getting Started with NLP and spaCy – For a deeper dive into natural language processing topics covered in this conversation.
Async Techniques and Examples in Python – Learn how Python’s async and concurrency features can accelerate IO-heavy tasks.
Django: Getting Started – If you want more detail on building Django-based web apps similar to Genome Oncology’s annotation APIs.

Overall takeaway

Python’s unique blend of readability, data science libraries, and cutting-edge frameworks makes it a powerful fit for genomics and clinical research. From building annotation pipelines to real-time collaboration tools, Python accelerates both the technical and workflow aspects of cancer studies. The result is faster discovery, clearer insights, and tangible improvements in patient care, reinforcing how a flexible language plus an active community can have an outsized impact on modern healthcare.

Links from the show

Ian on Twitter: @imaurer
Genomoncology: genomoncology.com
Genomoncology on GitHub: github.com/genomoncology
My Cancer Genome: mycancergenome.org
Google's Deep Variant: github.com/google/deepvariant

Background Reading
What are computers used for in DNA sequencing?: biology.stackexchange.com
An introduction to Next-Generation Sequencing Technology: illumina.com
Difference between fasta fastq and sam file formats: bioinformatics.stackexchange.com
One Renegade Cell Book: amazon.com
BioStars, Bioinformatics explained: biostars.org

Sponsors
Codacy: codacy.com
Talk Python Training: training.talkpython.fm
Episode #154 deep-dive: talkpython.fm/154
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #154 deep-dive: talkpython.fm/154

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Python is often used in big data situations. One of the more personal sources of large data sets

00:06 is our very own genetic code. Of course, as Python grows stronger in data science,

00:10 it's finding its way into biology and genetics. In this episode, you'll meet Ian Marr. He's working

00:15 to help make cancer a thing of the past. We'll dig into how Python is part of that journey.

00:20 This is Talk Python To Me, episode 154, recorded February 9th, 2018.

00:26 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem,

00:45 and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy.

00:50 Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter

00:55 via at Talk Python. This episode is brought to you by Codacy. Learn how they make code reviews better

01:02 by checking out what they're offering during their segment. Ian, welcome to Talk Python.

01:06 Hi, Michael. Thanks for having me on.

01:07 Yeah, I'm really glad to have you on to talk about Python and biology and genomics. These are two areas

01:13 where I've wanted to do a show on for a long time, but just haven't managed to get the right stuff all

01:19 lined up. So really excited to see how Python is playing a role here. And I think it's just another

01:26 cool example of how Python is being used in all these really varied ways.

01:32 Great. Yeah, it's really been taken off the last few years, and it's gone really well with

01:38 what we're trying to get done at our company.

01:40 Awesome. So let's dig into that. But first, let's hear your story. How'd you get into programming in

01:44 Python?

01:44 Yeah, so I started programming when I was 13. My parents got me a Commodore 64. I started learning

01:49 basic and trying to make my own games and things like that. I went to school for programming,

01:55 computer engineering, learned a lot of C and Pascal even at the time to date myself a little.

02:00 And after graduating, I worked at a defense contractor in their logistics department where

02:09 we did some SGML and XML based tools for documentation of these complex systems that they

02:16 have. And part of that was, you know, parsing of those files. And we actually were using this library

02:22 that was built in Python. And I kind of fell in love because of the REPL, right? So being able to kind

02:27 of open up the REPL and explore and play with the information right there was what hooked me. And ever

02:33 since then, I've kind of been doing it as a hobby, hoping that it would kind of take off in the web space.

02:37 And it did with, you know, Django and following on after Rails, but it just never worked out for me. I

02:43 was always doing Java based development for, you know, e-commerce sites and other stuff that I've done.

02:48 Java again?

02:49 No.

02:50 Yeah, always Java. So I was doing Java. I still like Java, still a good, still consider myself a good

02:55 Java developer. When I joined my current company, we did a couple of small things in Python and,

03:01 and then those kind of hooked, took off a little bit and we were able to, you know, kind of just double

03:06 down and, and add some more features. And, and really since that time with bioinformatics and other stuff,

03:12 we'll talk about, you know, Python's really taken off and actually made sense to, to really use it as, as one of our

03:18 core languages for some of our products.

03:19 Yeah, that's cool. So it's, it's finally like grown into this place where it's not just, oh, I'd like an excuse to use it, but it

03:26 really makes sense, right?

03:27 Yep. Exactly right. It actually solves the niche and it's really taken over actually for Perl in a lot of ways in,

03:33 in the bioinformatics space. And, it kind of sits along with R and has really got a lot of mind

03:39 share in the, you know, in this bioinformatics world.

03:41 Yeah. There's probably some cool infographic of Perl, I'm sorry, R in Python, like duking it out over,

03:47 you know, some sort of data science crown. I don't know what we'll see where that goes, but they're both doing

03:52 really well. And it's, it's nice to see Python growing so quickly. I really think, you know,

03:57 you look at the growth of Python, there's like this huge jump in its popularity. Like it's always been

04:03 growing, which is kind of amazing, but it has this sort of inflection point where it grows faster around

04:08 2012, which I feel like is where the data science stuff really started to kick in for Python.

04:13 Yeah. NumPy and scikit-learn and Jupyter, pandas, you know, some of these core, a lot,

04:18 all the machine learning.

04:19 Yeah. All the machine learning stuff, all those things really have just kind of gotten some mind

04:23 share altogether. And it's really just, we're kind of riding that wave and it's, it's really great.

04:27 And, and I, and I think you might've said this in one of your previous podcasts, but the fact that

04:31 people who are in data science want to learn something that is a general purpose language too,

04:34 that they could use to make themselves a little bit more marketable is, I think another, another kind of feather in the cap for Python over some of the other languages.

04:41 Yeah, it definitely is. Awesome. So this, you know, sounds like a really interesting

04:46 way of getting into it. So you went through the computer engineering perspective. Very nice. And

04:52 I think maybe the first place to start this discussion really is to talk about the biology

04:59 and, you know, your company and kind of the problem space that you guys are working in. So then we talk

05:05 about all the tools and the way Python solving the problems people know. So maybe tell us a bit about

05:09 what you do day to day. Yep. So I lead development for a company called genome oncology out in Cleveland.

05:15 We'll talk more about Cleveland later, but so I lead our, you know, software design development,

05:19 testing, and deployments were founded in 2012. And really that timing is important because really

05:25 around 2011, some of the big NGS platforms, next generation sequencing platforms were, came out around

05:32 there. So these are, these include things like the aluminum my seek and ion torrent. And why those are

05:37 important is because the human genome project, which you might've heard of kind of wrapped up between

05:41 2000, 2003. When did that start? Like late nineties, mid nineties, late nineties. It took, it took a few

05:47 years for sure. And, you know, it took about $3 billion to complete it. And that was basically just

05:51 mapping a first draft kind of, of the, of the human genome. And, you know, that's really, it basically says,

05:58 these are all the variants that, you know, quote unquote, a typical human is made up of. And that

06:03 took about $3 billion to do. And now we're talking about, you know, less than a thousand dollars.

06:07 And as that, you know, as the, you know, Moore's law applies to computer chips, right? There's kind

06:13 of a Moore's law effect, but even I've read some, some analyses where it's like, it's even greater

06:18 exponentially than Moore's law with these costs of genomics. It really is just driving the price down,

06:23 which makes it a lot, allows us to apply these technologies for lots of different reasons.

06:27 And the, you know, my favorite part obviously is the work that we're doing around helping people

06:32 with cancer and helping use genomics to help people find clinical trials, find therapies,

06:37 and hopefully improve their, their odds at, at fighting that disease.

06:40 It's definitely one of the great challenges of our time. You know, it's, we've sort of solved

06:46 the problems that, that were really bad for humanity to a large degree that, and now cancer is like,

06:53 one of the major, major things that people have to, to deal with, right? It used to be,

06:58 you might be hungry, you might be getting eaten by a wolf. Now, you know, you live, you live a long,

07:03 healthy life until something, you know, you get some kind of bad news, right? And so how much is cancer

07:08 a genetic problem versus other types of problems, right? You guys are building genetic tools. How's

07:15 this all fit together?

07:16 Caveat this by saying, I'm not, you know, a molecular pathologist. I'm not, you know, a bioinformatics person,

07:21 but cancer is a, is a disease of the genome, right? So you have your genome, you know, 23 chromosomes,

07:27 23 pairs of chromosomes, you know, you're talking about, you know, chromosome one is got 2000 genes,

07:33 you know, 250 million base pairs, right? That's the kind of the scope of the data that we have.

07:38 And cancer is really mutations within that genome causing things to break down in a certain way,

07:45 right? So, you know, one of the analogies, and there's a book called One Renegade Cell that we kind of make

07:49 all of our new employees read, really walks you through, you know, the gas pedal, the sticky gas pedal and the cut

07:57 break line. And basically what ends up happening is your cell, you know, if you were to cut, have a little cut on your

08:04 finger, the cells around that cut would know to kind of multiply and grow and then cover over that cut.

08:10 And then they know how to stop, which is really an amazing feat.

08:14 It's actually unbelievable that the machine that is humans works or any form of animal, really. It's

08:20 incredible.

08:20 And it's all these individual cells and there's different signals throughout the cell and those

08:25 signals are called pathways. And what ends up happening is those pathways stop working in some

08:30 fundamental way. And the way that they're stop working is through mutations. And those mutations can

08:35 occur due to, you know, some environmental factor like smoking or uv light or some other mechanism that

08:42 causes that mutation to happen. And then from there, it ends up that one cell ends up growing and taking

08:49 over the space of the other cells. So a lot of these drugs and therapies that are out there are looking,

08:56 you know, some of these targeted personalized therapies are targeting those individual cells

09:02 that are kind of going off, going rogue and bringing them back and, you know, getting rid of them so that

09:07 the healthy cells can, can do their thing. And so, you know, our, our software, our company basically is in

09:14 the business of helping people, helping oncologists, helping pathologists and other folks in the healthcare

09:20 industry identify what these mutations are, figure out what they mean, and then help their patients get them on a

09:26 clinical trial or prescribe them a therapy.

09:28 Right. If you understand the actual genetics that's causing the problem, maybe there's a better,

09:33 more focused sort of treatment, right?

09:36 Exactly right.

09:37 Yeah. So if you look at chromosomes, like we talk about big data all the time, right?

09:42 But I mean, there's only 23 pairs, so that's no big deal, but they're actually made up of a lot of stuff,

09:48 right? So maybe like take us through the sort of big data store, just sort of the scale of the data,

09:53 I guess a better way to put it around chromosomes and genetics.

09:56 Yep. So there's, you know, 23 pairs of chromosomes and a quote unquote normal human being, right? And

10:02 you have about 3 billion base pairs across all those chromosomes. So they, and they get labeled one to 22

10:08 and then X and Y for the sex chromosomes. But the 3 billion base pairs in the human genome, there's about 21,000,

10:16 24,000 of those are what we call genes. And genes are what are the actual thing that code to proteins. And proteins are the thing that actually make the whole system work.

10:25 So the genes, the actual DNA part of it is the base pairs. And there's three base pairs, ACs, Gs, and Ts. Those go in pairs of three, you know, sets of three.

10:34 If you remember from biology, those build out to become amino acids. And then the average person has about 30 million or so variants, 10%. So in one of the tricks that we do, obviously, in the space is we don't actually record all three, 3 billion base pairs.

10:49 We just record, we just record the delta, just to make it, you know, a lot less data. And then the other part of making it a lot less data is, you know, focusing on specific genes, right? In cancer, you know, there's, depending on the disease type, there only might be three or four genes that matter, or maybe there's only 50 genes that matter.

11:07 But in this pan cancer, across all the different types of cancers, there might be about 800 or 900 genes that matter. So, you know, our types of tests and sequencing that we do really focus in on those, those smaller regions to just kind of manage the data in a faster way.

11:22 That collection of, you know, 3 billion base pairs, ACs, Gs, and Ts, those are, those are what are called the reference genome. And the reference genome is what everybody gets compared against, right?

11:33 So when you're, do your, when sequencing is done on a tumor or on a, you know, a normal cell, the deltas, the variants are what it's actually captured and recorded. And we're actually recording it also in the context of what genes are there. So genes actually don't make up a huge amount of the genome. It's a very small portion of the genome that actually codes the proteins.

11:53 Interesting. So is there like a bunch of basically instructions that are just off? They just don't go, they don't do anything?

12:01 They call it junk DNA. Now they don't necessarily, it doesn't necessarily mean it is junk. It's just not necessarily known at this time, or it doesn't, it doesn't code to protein, but maybe it does other things.

12:10 Like there's things called methylation and these other factors that affect the coding parts of it. And there's lots of theories of, of how that happens. Some of it's through evolution and, and pieces just kind of fall out and don't actually matter anymore in the human species.

12:25 But there's other, other theories that maybe some of it isn't junk as well. And then even within the exon, even within the genes themselves, there's exons and introns.

12:34 So the exons are these, these strings within the gene that actually get, you know, sliced out and turned into the RNA. And then that goes and codes the protein.

12:44 And then the other part's called introns. The introns are the parts in between each of the exons. So understanding how the whole map works, understanding how to sequence the data, get the data off the sequencer and, and then keeping track of all that data is, is interesting.

12:58 And, and one of the things that might be interesting to your, to your listeners, just because of the whole Python two, Python three thing is these reference transcripts get released over time.

13:07 Right. So, and the one that's currently, you know, the main one that people use in the clinical setting is called GRCH 37.

13:15 And that was released in February of 2009 and lots of tools and things were built off of this version of the reference genome.

13:22 Well, over time they learn new things, they apply new regions. They, you know, it's, it's a very dynamic map.

13:27 And then in 2013, right, eight, five years ago now, almost they released GRCH 38.

13:33 The whole industry hasn't moved over to this new, this new version of the reference genome.

13:37 So it's just the, because you got to update all your tools, update all your databases.

13:41 And it's a, and it's a, it's a tricky thing to do.

13:44 This major incompatibility. How interesting.

13:46 So you talk about this reference genome and there's about 3 billion base pairs and make up a person.

13:53 How much of that is consistent across every single person and how much of there's difference?

14:00 Because, you know, I feel like we look at people, we all look quite varied, but then you also hear things like, well, your DNA is 1.5% different than say a chimpanzee or something like that.

14:10 Right. So give me a sense for, you know, when you say I'm going to save the Delta, what does that look like?

14:15 Usually an average person has, you know, about 10% variation from the 3 billion.

14:20 So about 30 million base pairs will be different across different, different people.

14:25 And, you know, it's obviously the numbers go up and down and there's prevalency frequencies, right?

14:30 So a lot of these databases that are out there and available for, for people to consume as part of their process, they actually say, you know, we sampled a thousand people.

14:38 And, you know, 20% of folks had a G in this spot and, you know, another percent had an A in that spot.

14:44 So that's, that's a big part of just understanding what some of these variants are.

14:49 And one of the things we do in our tools for cancer is that we'll, doctors are interested in that, that, that prevalency, that's that allele frequency.

14:56 Because if the frequency is 50%, well, there's no way that that's actually a cancer causing variant because people would be born with cancer.

15:04 And that just, it just doesn't really work that way.

15:06 It wouldn't be a viable situation.

15:08 So one of the data points they like to look at is how often does this variant actually happen in the, in the wild and, and actually in the human population.

15:16 So it's a very interesting stat.

15:18 Yeah, for sure.

15:19 Okay.

15:20 Interesting.

15:20 So maybe let's talk about how do you actually do the sequencing at a high level that won't get into the tools and the Python code that you actually make, how that's working in there.

15:31 But give us the sort of overall pipe.

15:34 Yeah.

15:34 Give us the general pipeline.

15:36 Like how do you go from, you know, a swab on the cheek or whatever it is to, here's your printout.

15:43 You ACCGTAC is, is you.

15:46 There's been older technologies that, you know, work in smaller regions and, and can do things like that.

15:51 There's a thing called Sanger sequencing.

15:53 But as I said earlier, one of the major changes is in 2011, they did this next generation sequencing.

15:59 That basically takes raw data right from, from a blood sample or a tumor sample.

16:05 They put it in this, this machine called the sequencer.

16:07 And then through the, you know, either chemicals or, or, or lights of, of that actual machine.

16:12 And once again, this is in my area of expertise.

16:15 They're able to analyze it and basically do what are called reads.

16:18 So they're doing, you know, 65 base pairs across or, or what have you and read out ACs, Gs and Ts and write that to a file.

16:25 And that's written to a file called, you know, either a fast day file or a fast Q file, which has quality associated with it.

16:32 So all these raw reads are happening and it's basically like little snippets of a book.

16:36 And, but it's like a book that they, someone's cut up into little fragments and then kind of thrown it up in the air and then try to figure out how to reassemble it.

16:44 So that process, yeah.

16:46 So that process is not something we do at my company, but that process is what we call alignment.

16:52 And we take that book and try to basically tape it together.

16:55 And the way they do that is by trying to compare regions against the reference genome itself and through math and algorithms and, and some machine learning.

17:04 Now they're able to kind of align the whole readout of the, of the reference genome.

17:09 And those get stored into a file called a SAM file.

17:12 And really it's just a, just a listing of all these different variants, but in a line format.

17:16 And then those files can get compressed into what's called a BAM file.

17:20 And then we, you know, there are tools that are open source and, and tools like ours that actually allow you to do visualization of that alignment and really get a good understanding of do the reads line up?

17:30 Do the variants look right?

17:32 Is the quality there?

17:33 And do you believe the actual calls that are being done?

17:38 And then the next step after aligning it is actually what's called variant calling.

17:41 So the, you know, some additional software, once again, stuff we don't actually do.

17:46 It goes through the, these alignment files and makes decision and say, yep, I've, I've read through this, this BAM file or SAM file.

17:53 And, you know, I believe at this position on this chromosome that it's an A and not a T.

17:59 And obviously with two pairs of chromosomes, you might have, you know, half of them being A's and half of them being T's and things like that.

18:06 And cancer is a little bit different, right?

18:08 Because you then have a mixture of, of tumor cells that are kind of commingled with normal cells.

18:14 So you might actually get allele frequencies, what we call variant allele frequencies or VAPs that are not 0.5 or 1, but something in between.

18:22 Because it could be that actual mutation that is causing the cancer.

18:26 So like half of them have some values there, others have another, right?

18:30 The original normal cell and then you have these clones of tumor cells that actually, the actual cancer causing cell that is now growing and spreading in that region.

18:39 Right.

18:39 So that gives you more or less, here's what we think the genetics is.

18:43 And then you have to analyze it, right?

18:44 Right.

18:45 And this is really where we come into play, right?

18:46 So our company started in 2012 just because of this NGS data, you know, was starting to overwhelm pathologists and physicians with lots of genomic and molecular data.

18:57 And the belief of our company is that, you know, all medicine is going to be molecular in the future.

19:00 And really understanding how that, those, what those variants mean in the context of cancer, especially, is where we really focus our energies.

19:09 And that includes things like annotating the variants and trying to help people understand, you know, how often do they happen in the population?

19:16 Has there been papers out there that said this variant's pathogenic or benign?

19:21 There are some prediction models that people have written that say, you know, this variant will cause the protein to degrade in some known way, right?

19:28 In the stuck gas pedal or the broken brake line analogy.

19:34 And then from there, we're able to do decision support, right?

19:38 So there are FDA drugs that are available.

19:41 There are clinical trials that are available.

19:43 These things have very complicated eligibility criteria.

19:45 And our software helps you, helps doctors, you know, make sense of all this disparate data, bring it all together and say, oh, yeah, for this patient, given these mutations and maybe some other tests and some other data about the person themselves, we can say that, you know, this clinical trial is best for you.

20:02 Or, you know, this therapy would work for you.

20:04 The FDA has approved it for you.

20:05 And one of the interesting things that's happening is to prove the whole idea of cancer is a disease of genetics and not a cancer of, you know, something else, is that these drugs that are getting approved for, you know, lung cancer with a specific variant, well, that drug might work for, you know, a melanoma patient with a specific variant or vice versa.

20:24 I might be getting the analogy wrong, but you get the point.

20:27 Basically, it's the specific mutation that matters.

20:30 The fact that you have a V600E on BRAF is the most important part, not the fact that it was on your skin.

20:36 That's pretty interesting.

20:37 Understanding at this level is really powerful.

20:39 So let's talk about the software stack, maybe at a high level first, then we can dig into some of the tools.

20:45 Like, what kind of software are you guys writing to solve these problems and where's Python fit in?

20:50 We started off with, you know, a research application that we used to, you know, get the company started.

20:54 And then we built our first clinical app for pathologists.

20:57 And that was all built using Java and a language and GWT.

21:01 So Google Web Toolkit is a Java-based JavaScript tool, right?

21:05 So we don't really have any JavaScript wizards in-house.

21:08 And we've always been, you know, Java-based.

21:10 And so while that was getting built, we actually partnered up with a team at Vanderbilt University called My Cancer Genome.

21:16 And they have a website for people that are, you know, looking for information about genetics and cancer.

21:22 While the rest of my team was kind of building this, our first couple of products, I actually built a curation tool for them.

21:28 And I built that with the Django admin tool, right?

21:30 So Django has this great admin tool.

21:32 So we started, so I was able to kind of whip together a nice content management tool for them so they could get rid of their SharePoint solution that they were running at the time.

21:40 Anything that gets rid of SharePoint, that's a good thing.

21:43 That was the thinking there.

21:44 You can hold your head high that day.

21:46 We turned off SharePoint.

21:47 Right.

21:48 So, yeah.

21:48 And having a, you know, quick user interface for that.

21:51 And then we've since evolved that tool.

21:52 And now that tool is managing not just, you know, some basic content management stuff for the My Cancer Genome site.

21:57 But it's basically managing all of our knowledge and what we call our knowledge management system.

22:02 And then what we did was built on top of that Django REST framework API.

22:07 So using, you know, Tom Christie's tool to build out an API using, you know, REST.

22:11 And now you can, you know, hit the API and get back specific information running a thing called Match in our software.

22:19 So you can, you know, given a patient's information, their demographic and whatever biomarkers they might have, you hit our API and we'll give you back, you know, hey, this is a good clinical trial for you.

22:30 You know, within, you know, 50 miles of the patient, here's a good trial for you to maybe put them on or here's a therapy that's approved by the FDA.

22:37 That sounds really, really powerful.

22:39 And some cool tools that are involved in there.

22:41 You talked a little bit about user interfaces.

22:44 Is that all Java or are you doing some UI in Python?

22:47 I've heard your recent stuff about UIs.

22:50 All our UIs are in Google Web Toolkit right now.

22:53 We are doing the new version of the My Cancer Genome website using React.

22:56 So that's one piece of JavaScript that we've started to use.

23:00 But for the most part, you know, we're building out strong APIs with Python and then our UIs and things are still with Java and GWT.

23:09 Yeah, that sounds good.

23:10 I've heard a lot of good things about React, but I haven't done anything with React, so I can't speak too much to it.

23:15 Yeah, cool.

23:16 This portion of Talk Python is brought to you by Codacy.

23:20 If you want to improve code quality, prevent bugs and security issues from making it into production, and at the same time speed up your code review process by 20%, then you need to try Codacy.

23:31 That's C-O-D-A-C-Y.

23:34 Codacy makes it easy to track code quality and identify and fix issues by automatically analyzing your commits and pull requests with all the most widely used static analysis tools.

23:44 Codacy helps great teams build great software.

23:48 Join companies like DeliverHero, PayPal, Samsung, and more.

23:52 Try your first code review by visiting talkpython.fm/Codacy and linking your GitHub or Bitbucket account.

23:58 You can also just click on the Codacy link in the show notes.

24:01 All right, so let's talk about some of the tools that you're using.

24:04 So you talked about Django REST framework.

24:07 That's Tom Christie's tool.

24:08 I had him on, or framework.

24:10 I had him on the show a while ago as well.

24:12 So it's basically layers on REST API on top of Django, right?

24:17 So maybe tell people how you're using that, like what it's doing for you.

24:22 One of the key things that we do is annotations.

24:24 And one of the annotations people want to know is, okay, where is this variant?

24:28 And where is it in the context of the whole genome?

24:30 And that's called the G-dot.

24:32 Or where is it in the context of the coding region of a gene?

24:36 And that's called the C-dot.

24:37 Or where does it end up land, you know, once it goes from a C-dot to a P-dot, which is the protein, right?

24:43 So the actual amino acids.

24:44 So G-dot, C-dot, P-dot.

24:46 So that is a nomenclature called HGVS.

24:49 There's actually a lot.

24:50 And so our API actually houses, you know, all of our knowledge.

24:54 But it also calculates annotations for people.

24:57 And one of the great libraries we use is the BioCommons and HGVS.

25:04 And those two libraries are open source, open on GitHub.

25:07 And they do a really good job of doing those calculations.

25:10 So if you're trying to understand, you know, how to get into genetics, I'd look at those libraries.

25:15 There's also a library called Biopython.

25:17 We don't use that, but it's also really good.

25:20 And then from a bioinformatics perspective, you know, we use that full stack.

25:24 So we have on top of our API, we've built out some user interfaces that use actually Jupyter and Bokeh and Pandas and NumPy.

25:32 So I actually take that back.

25:34 Our genome analytics platform, you know, the major part of it, the container part of it is written in GWT.

25:40 But it's actually calling in and bringing in Bokeh plots as well.

25:44 So Bokeh is being used on the back end using Pandas to calculate these great plots.

25:49 And then we're rendering them in our front end.

25:51 Yeah, that's really cool.

25:51 I've never had a chance to use anything with Bokeh.

25:54 But that's where you basically do the sciencey visualization stuff on the server in Python.

26:00 And it just transfers over to the web front end.

26:03 Is that right?

26:03 Yeah.

26:04 So it's calculating the JavaScript for you.

26:06 Because once again, we don't have the JavaScript chops in house.

26:10 But you're basically running pure Python using Pandas data frames.

26:14 And then you basically configure your Bokeh plot using this Python library.

26:19 And then it renders it.

26:20 And then it basically streams out HTML and JavaScript.

26:23 And you can just kind of embed it in an iframe or what have you in your UI.

26:27 And it works great.

26:28 And it sounds really great.

26:29 Like you don't have to be in the charting business.

26:32 Exactly right.

26:32 Those are live, right?

26:33 They're not just like PNGs or something.

26:35 You can definitely work with them dynamically right there.

26:39 You can use them to generate PNGs if that's what you need.

26:41 And some of our clients do need that to include it in their research papers if that's what they're using our tools for.

26:48 But yeah, it's got lots of different use cases.

26:50 And Python keeps coming up with great libraries for visualizations.

26:55 And there's lots of different options too.

26:57 But Bokeh has worked out well for us.

26:59 Yeah.

26:59 It's kind of becoming a paradox of choice, right?

27:01 Like there's a little – as soon as you learn something – yeah, as soon as you learn something, you're happy with it.

27:06 You're like, but that looks better.

27:07 Maybe I should do that.

27:08 And of course, it's a constant treadmill sort of thing.

27:12 So one of the tools that you're using that didn't surprise me but I think is interesting and I want to hear more about is spaCy.

27:17 So I don't even think I've mentioned spaCy on the podcast before.

27:20 Tell us about that.

27:21 What's spaCy?

27:22 Yeah.

27:22 So we've done really a proof of concept at this point using natural language processing.

27:27 So one of the major challenges in our space and IBM and a few other big companies are spending lots of money to try to tackle this problem.

27:35 But basically the problem is a lot of these EHRs, EMRs, people are recording their notes about patients in kind of free text.

27:43 And one of the challenges with that obviously is it's unstructured and it's hard to do anything with it.

27:48 We're not really in the business of major machine learning.

27:51 We're kind of in the workflow and tools business.

27:53 We help people solve problems in kind of a more pragmatic way.

27:57 We're a small company.

27:58 We can't spend billions of dollars.

27:59 But what we're doing is we're taking spaCy and using that to parse some of these free text files and basically make recommendations to people.

28:09 So doing things like what are called entity recognition.

28:13 So entity recognition means I'm reading this Wikipedia article and finding all the proper nouns in it.

28:18 Barack Obama did this in Detroit, Michigan or whatever.

28:22 Those would all be proper nouns.

28:23 And this is a great tool for extracting out named entities like that.

28:27 We've trained spaCy to find named entities based on our ontologies, our data within our KMS.

28:35 Right.

28:35 These are our important words.

28:37 Go see if they say this.

28:38 Something like that.

28:39 Exactly right.

28:40 So there's a pattern matching framework that's within spaCy that's really very easy to use.

28:44 And then the other thing we'd use it for is for classification.

28:47 So basically we've trained some models to say, OK, when you read this sentence and it says, you know, estrogen receptor strongly expressed.

28:55 Well, we want that to actually mean something.

28:58 We want that to mean ER positive in our in our use case, in our vernacular.

29:01 And that means something to our our end customers.

29:04 And what it really does is what we then do is present it to them and say, hey, we saw this sentence and we're you know, we think it says this.

29:11 Do you agree?

29:11 Yes or no.

29:12 And if they say yes, then we kind of keep that that piece of information and use it to further train our model to make it better over time.

29:20 We're not really trying to we don't really think we can get rid of the human in the loop at this point just because, you know, we're just just at the start of this thing and we want to make sure we get the right answer 100 percent of the time.

29:29 But what we want to do is make it so they don't have to read, spend a half an hour reading through a document where we can just scan it for them and say, here are the interesting parts.

29:38 Please go ahead and just confirm it.

29:40 That's pretty wild.

29:41 I feel like this whole machine learning AI business is deeply reaching into medicine and things like that.

29:49 Right.

29:49 This is just another super interesting example I hadn't even thought of.

29:52 But, you know, in terms of oncology, like the analyzing, say, scans like pictures to see, you know, have the machine say, no, that looks like cancer to me.

30:02 Like kind of doing what radiologists might do or something.

30:04 Right.

30:04 Exactly.

30:05 Yeah.

30:05 It's pretty amazing.

30:06 We like spaCy a lot.

30:07 I originally tried playing with NLTK a few years ago and actually kind of ran into some barriers.

30:11 It's an old that's an older project.

30:13 spaCy is really modern and that it's, you know, kind of does some of the best practices with Python.

30:18 I highly recommend it.

30:19 The documentation is really good.

30:20 Performs really well out of the box.

30:22 And I was able to pull together a really good demonstration in just a few weeks.

30:26 So I highly recommend it.

30:28 Looks really cool.

30:29 It definitely, they have it lined up to, when you go to visit spaCy.io, it really looks appealing and polished.

30:37 I was wondering why you didn't choose, what the difference or what made you choose spaCy over NLTK?

30:42 It's actually pretty obvious straight away, isn't it?

30:44 They're doing a really good job with, you know, as a small open source, you know, company.

30:48 I think there's like maybe two people working there from what I can tell.

30:51 And they've, you know, they've basically open sourced their core product and they're selling, you know, ancillary products on top of it.

30:58 And they're consulting services too.

31:00 And, you know, it seems like a great project.

31:02 Yeah, that's really cool.

31:03 And I definitely look at it more because I'm always fascinated how these people are building really interesting business models on top of some kind of successful open source thing.

31:13 So, yeah, another cool example.

31:15 So you're building some interesting CLI tools and you guys are using Click, which is pretty common.

31:23 That's from Armin Roenker who made Flask.

31:26 You're also using Pex.

31:27 That, I think, gets less, a little bit less awareness.

31:30 Tell us about Pex.

31:31 It's really interesting.

31:32 Click's great.

31:32 There's obviously lots of great ways of building, you know, command line tools in Python.

31:36 They've been doing that for a long time.

31:38 But Click's really, really easy to use.

31:40 And then what we find is, you know, how do we get this to our clients?

31:44 We do a lot of things with Docker.

31:46 And when we're setting up servers, using Docker to set up a server is great.

31:49 But we actually also have now command line tools that we're trying to distribute to people.

31:54 And, you know, pushing things up to PyPI and having them pull things down using pip and having them set up virtual environments.

32:00 It just sometimes gets a little bit difficult for some of our end users who, you know, might not be Python, day-to-day Python developers.

32:08 So using Pex, you're able to actually just build the whole module together with the virtual environment baked in.

32:15 And when you deliver it to them, it just kind of, it just runs.

32:18 And you can build it to different platforms.

32:20 You can, you know, on my, on my, for one of my projects, we have a little docking container that actually builds it to Linux and then builds it to macOS.

32:28 And we're able to share it out to people and, and use the tools without having to go through the whole virtual environment setup stuff.

32:33 That's really cool.

32:34 So I think Pex is the one that actually takes everything, zips it up, and then it turns out Python can execute zip files and run from there, right?

32:43 Which is pretty wild.

32:44 Do you know if that entirely eliminates the dependency on Python?

32:48 Like if I had a blank machine or is it just sort of the packaging, but they got to have the base Python there?

32:54 Someone asked me that just the other day.

32:56 I don't think, I actually, this thing, it's just the libraries because it doesn't seem that big of a file.

33:00 It's not like, it's not like when you download a clips and you get the whole jar with it.

33:03 Yeah, yeah, yeah.

33:04 I get the whole Java JDK with it.

33:06 I, you know, I actually don't think, I don't think so.

33:08 It's pretty cool.

33:09 Yeah, I've been playing with Pi Installer and it's pretty nice as well.

33:13 And it kind of, it'll do it so there's no dependency.

33:15 It's also more problematic because it's trying to solve the problem bigger, I think.

33:20 So I was just thinking, oh, maybe Pex is going to be nice.

33:23 Another thing that I think is really cool around this stuff, just as a shout out, is that I've been playing with a lot lately is this thing called GUI.

33:30 G-O-O-E-Y.

33:31 Have you heard of this?

33:32 I did see your little prototype up on GitHub, I think.

33:34 Yeah, so you can take something like this and then just throw like a little UI with dropdowns instead of command line arguments on top of it.

33:41 It's pretty cool.

33:41 Right.

33:42 So another thing that you are doing is AIoHCP.

33:45 Tell us, are you using the server or the client component of that?

33:49 All client.

33:50 So for us, it's high throughput annotations, right?

33:52 So one of our clients, you know, basically paid millions of dollars for this high throughput system to generate, to go through the whole alignment and variant calling situation, right?

34:01 So they're trying to do high throughput, you know, lots of thousands of cases per week or whatever they're doing.

34:07 And they're trying to keep up with that.

34:09 But they need annotations from our KMS, our knowledge management system.

34:13 And so the challenge was, okay, how do I keep up with them?

34:16 And the first version of my software had trouble, right?

34:20 So we were, you know, trying to parallelize things with multiprocessors and it worked.

34:25 But, you know, once I've actually played with AIoHTP and AsyncIO and really understanding how to program in that paradigm and really look for the IO bottlenecks and work around them, it made my redesign of that tool we called our annotator that actually does that annotation much easier.

34:44 Right. So now, you know, I have basically have these five stages in my little program with queues in between them, you know, where basically what an annotator does is really just reading a file, making a call to an API.

34:55 A remote API of our service, right?

34:57 Exactly right. And then injecting that data into the stream and then writing it out to disk, right?

35:02 So you got basically, let's just say, three spots where you can leverage the AsyncIO.

35:06 So reading from the original file, making the call to the HTTP server, and then writing out to disk.

35:11 And, you know, this whole framework allows me to do all three of those things.

35:16 It kind of just magically balances itself with regards to how much it's reading from the disk, how much it's writing to the disk, and how much it's calling the API.

35:24 The only thing you have to do is make sure you don't call your API too much unless you want to take down your server.

35:28 And then our server on the other end is, you know, highly parallelized through using Celery and Redis and handling.

35:35 It can scale up because we've thrown lots of hardware at that.

35:39 And so what we're able to do is we're able to keep up with that, you know, multi-million dollar hardware solution with Python 3 and AsyncIO.

35:47 And it's been great.

35:48 And probably like, what, one thread?

35:49 So basically, yeah, one process running and it's doing the job.

35:53 So we do process, we can then, we can then scale out that one program across multiple processes if we want.

35:58 But it's, it's really pretty high performance and, and our client's pretty happy with it.

36:02 That's really awesome.

36:02 Yeah, because so much of the time, programs that are slow, they're actually just waiting on some other part of the system.

36:08 They're waiting on the web service.

36:09 They're waiting on disk.

36:10 They're waiting on, you know, whatever, right?

36:12 And so this lets them be productively waiting, basically.

36:15 It's definitely a paradigm shift.

36:17 And you have to, you have to think through the whole, this Async method is calling this other Async method and, and really understanding how that all fits together.

36:25 And it can definitely bend your brain a little bit if you're not used to it.

36:28 But once you actually do figure it out, it's kind of a superpower and it's really great.

36:31 Yeah.

36:31 And as far as superpowers go, like the actual change in the programming model is pretty mellow, right?

36:37 There's like not, it's not that different from serial requests.

36:41 Yeah.

36:41 You just got those couple of keywords with Async and await.

36:44 And once you figure that out, then it's kind of easy from there.

36:47 And then it's just really about using cues.

36:49 And then you get into the whole queuing theory and, you know, lean manufacturing and that kind of stuff and try to understand, like, how do you, how do you remove the bottlenecks from your system so that, so that things go as fast as they possibly can go.

37:00 And if you, if you kind of have that background and mentality with it, it's, it's really cool.

37:04 Yeah, that's cool.

37:04 But of course, anytime you're thinking about concurrency, it can definitely sort of bend your mind, like you said.

37:09 Yeah, exactly.

37:10 So speaking about concurrency, another thing that you guys are using that's really cool is channels and celery and Redis.

37:16 Channels, is that like Django channels?

37:18 Yeah, Django channels.

37:19 So one of our tools, there's actually async mode to it.

37:21 So in the oncology space, one of the big things that happens is for challenging cases, they go to what's called a tumor board.

37:29 So some of your bigger hospitals will have a tumor board where basically all of the experts at that hospital, or even they could even, you know, WebEx other people in from other hospitals to get to the experts to help people with, you know, rare cases, right?

37:42 There's a case, there's a variant, they don't know what it means.

37:45 What do they do about it?

37:47 And that's what they call a tumor board.

37:48 And we build software for that.

37:50 And one of our modes is actually async mode where people can kind of, so they don't actually have to have a WebEx, they can just kind of go to our app.

37:56 And everybody's in the app at the same time.

37:59 And if there's a leader, the person's moving around from one page of the app to the other, that's sync mode.

38:05 And that's actually done using WebSockets.

38:06 And so if you know anything about Django and, you know, its history, so Django started off, it was built on WSGI, and that's a synchronous protocol.

38:15 Yeah, all the popular ones are.

38:17 They still haven't found a way really around it.

38:19 Godwin, Andrew Godwin?

38:21 Yeah, Andrew Godwin, yeah.

38:22 He added this capability to Django, which is basically kind of like this little side thing to Django called Channels.

38:28 He invented another framework for interfacing in with Django from your web server, right, from Apache or Nginx, and using ASGI, I think is what he called it.

38:39 And it's an asynchronous platform.

38:40 And so that enables us to do WebSockets.

38:43 And the WebSockets is the thing that allows us to do this synchronous movement between different people on our application.

38:48 So if, you know, one person clicks a link and jumps to another page, all the other people that are on the app jump along with them.

38:54 And really, the main goal of this is to allow people to kind of dynamically work with the genomic information at their fingertips rather than having, you know, a bunch of people on their phones Googling.

39:05 What do these variants mean, right?

39:07 So they're all kind of working together on a single call.

39:10 So you guys sort of built, like, the Google Docs.

39:14 You kind of added a Google Docs equivalent type of experience to your app, right?

39:19 So everybody fires up your app and they have this local sort of guided experience.

39:23 Yeah, that's a really good analogy.

39:25 Yeah.

39:25 Yeah, I think more apps need that.

39:26 I think that's really awesome.

39:27 How hard was it to add this channels, to do the channels code and to add this stuff together?

39:32 Well, the channels part was easy.

39:34 I mean, it basically just kind of worked out of the box where, you know, we're able to send messages from one thing to the other.

39:39 But once again, you know, getting the actual communication going from one instance to the other is tricky and it's managing state.

39:46 And how do you change, you know, from one user to another and make sure that the experience is smooth?

39:51 That's always tough.

39:52 And then as you add new features, you need to make sure that the sync thing works across those new features, right?

39:57 That's right.

39:57 We've got this new visualization, but it only shows up for the leader, not for you.

40:00 Those are always fun.

40:01 But the actual channels plumbing and things like that, even though it's kind of cutting edge code for, you know, in beta or what have you, works really well.

40:09 And adding the Redis channel in between is what ends up happening when you actually set this up.

40:15 You end up having, you know, your web server, Nginx.

40:18 You have what's called an interface server, which is basically an instance of your Django app.

40:22 You have the Redis channel and then you have workers.

40:25 So the workers are basically other instances of your Django app, but they're actually doing the actual work of responding to either a plain old HTTP request or to one of these WebSocket requests.

40:36 And, you know, all that plumbing just worked great.

40:38 How cool.

40:39 Yeah, it sounds really fun.

40:40 I had no chance to use it, but it definitely looks really cool.

40:44 Yeah.

40:44 All right.

40:45 Well, that sounds like quite the list of cool projects and technologies you're getting to put together there.

40:50 It must be fun to work on.

40:51 It's great.

40:52 And, you know, having a purpose and, you know, working for something that's not online marketing or e-commerce or whatever I was doing in my past life is great.

41:00 So it's great, you know, working on something that I think is going to make a difference.

41:03 Yeah, definitely trying to make people healthier and live more full lives is way better than trying to optimize that click rate or, you know, convert one more piece of data to try to piece together.

41:14 No, this person is actually that other person and they're in this demographic, right?

41:18 Right.

41:19 Yeah, exactly.

41:19 So, you know, some other thing that nobody needs.

41:22 Online stocking is not something I'm interested in now.

41:24 No, for sure.

41:25 Cool.

41:26 So you actually have a couple of somewhat related open source libraries.

41:30 You want to talk about those a bit?

41:31 One of the libraries that's out there is called Adders.

41:33 And it's actually, I think, the basis of the new data classes that's in Python 3.7, right?

41:38 So the new PEP that does data classes.

41:41 So there was actually an original project called Adders, which is a really great project.

41:45 And it lets you define your classes and you get a bunch of, you know, kind of boilerplate Python stuff for free for comparisons and, you know, string representations and things like that.

41:55 Right.

41:55 It implements like, say, hashing correctly and all that kind of weirdness that you can overlook.

42:00 Yeah.

42:00 The problem I was trying to solve at the time was I wanted an immutable way of reading a YAML file, getting a nested Python object, and not having to, like, munch dictionaries, right?

42:12 Because you start writing code to dictionaries and quickly things get kind of nasty with some nested dictionary references and things like that.

42:20 So that's what I was looking for was a way to round trip to YAML, kind of like in Java, there's a library called Jackson that'll do that.

42:27 It'll round trip to JSON or to what have you.

42:29 And Python does a good job of, obviously, round tripping from dictionaries to YAML.

42:33 So what I wanted was an actual object model and patterns, which is really good.

42:37 But I kind of had just a different mental model, and I wanted something more like the Django ORM.

42:42 And I had a lot of use cases where I wanted to basically say, yeah, I want to call this a string field, and I want it to always have this validator and this converter.

42:50 So what adders will let you do is when you define your fields, you can say it's got this converter and this validator.

42:55 And I kind of just wanted some templatized versions so I didn't have to keep saying the same thing over and over again.

43:01 And I also wanted this, you know, this magical transformation.

43:03 And that's what the related project does.

43:06 Related.

43:06 It looks really cool.

43:07 And it does look like you're working either in the Django ORM or Mongo Engine or, you know, one of these types of things where you define what the object actually is.

43:19 Can you have, like, nested objects?

43:22 You basically can have, if you declared a class A, it can then relate to class B either as a, have a child object B,

43:29 or it can have a list of Bs or it could have a map of Bs, right?

43:33 So those, that object model, and it fully knows how to kind of render it to and from a dictionary.

43:40 And it does the whole serialization and deserialization for you.

43:43 That's sweet.

43:44 So, yeah, definitely people should check this out if they're working on Python and YAML.

43:49 It definitely looks like a cool project.

43:50 So the other one's called Rigger.

43:53 You know, obviously we're in a very, very, it's very important to us to have the right answers for people.

43:58 Yeah, the answers have consequences.

43:59 The most important thing about my job, I want to make sure we give people the best data, the most relevant data, the most up-to-date data.

44:06 And one of the key things we got to do is testing.

44:08 And we spend a lot of time testing, you know, by hand.

44:11 We do a lot of unit testing.

44:13 You know, we believe in the testing pyramid at my company.

44:16 But one of the things I like to make sure we have is kind of an end-to-end test or an integration test or a functional test, however you want to describe it.

44:23 And we, in our Java space, we actually use the tool called Cucumber.

44:27 And what Cucumber lets you do is basically, you know, declare your tests in a given-when-then kind of English-style DSL.

44:35 And that allowed our product team, you know, our product specialist team, which are basically non-developers, but they understand the science and they understand how to use the software and test the software, to describe how a function should work, right?

44:47 And given some state, when I do some function, then I should get some result.

44:51 But what I wanted was something like that on our API side.

44:55 You know, I didn't want to go through the whole pain of having glue, where people actually had the right code that runs behind this DSL.

45:02 And since HTTP is kind of its own language in itself, I decided to kind of shortcut it and just basically build out a simple YAML-based approach.

45:11 And that's kind of where this related project came from.

45:14 So you write out a YAML file that actually describes your steps.

45:16 And the steps describe what you make requests to and then get the response back.

45:21 And, you know, basically it allows us to build out a suite of hundreds and thousands of tests, testing out the software to make sure it gives the same answer every time so that people know when they make changes, they're not breaking anything.

45:32 And it does it using AsyncIO because I wanted it to run fast.

45:36 And then we use a thing called JamesPath to actually transform the response that comes back.

45:40 So the transformation, that allows for the test to not be fragile, right?

45:45 So one of our rules for APIs is we don't let you change a field or remove a field without, you know, some major consequences.

45:53 But if you add a field, if you add a field, it's usually not a problem.

45:57 But it can break your tests if you have very, very specific tests that have all the fields listed.

46:03 If it doesn't match exactly like a string test, then it's going to break.

46:06 I'm just expecting this string back or this JSON document back.

46:09 Are they the same?

46:09 No, crash, right?

46:10 Like that's right.

46:11 Yeah, that's too much.

46:12 Yeah.

46:12 So with JamesPath, we're able to kind of filter it down and say, yep, I only care about these three fields.

46:17 These three fields match exactly as I expect.

46:19 And if so, it's correct.

46:20 And so I was going to open source this thing a few months ago.

46:23 And then I heard on one of your other programs, I think the Tavern CI project was released.

46:28 And it's very similar.

46:28 So people should definitely check that one out.

46:30 And both our project and that project were built kind of based off the idea of PyRest test, which seems to have been abandoned, which was a nice project.

46:38 It just had a few things that we needed that it didn't have.

46:42 And, you know, I would say that the reason to choose our project over maybe Tavern CI would be this JamesPath thing.

46:48 We also have API coverage for Swagger.

46:52 So we define all of our APIs with the open specification, otherwise known as Swagger, which we still call it Swagger.

46:59 And so we can tell you, oh, you've got 100% coverage of all your API endpoints and their variables.

47:04 And then we also actually included the Cucumber reporting HTML reporting tool called Cucumber Sandwich, which brings up a nice pretty, you know, HTML view of your test and shows you how all your steps ran and things like that.

47:17 Yeah, the graphical output really is nice and colorful.

47:19 It's you could tell you can get info out of it right away.

47:22 Yep, it's great.

47:22 Very cool.

47:23 And you can see how related fits in there perfectly.

47:25 Yes, exactly right.

47:26 Also saw you using AIoHTP.

47:28 So it's all like async nice and quick.

47:30 Yeah, so I wrote AIoHTP to do this little rigor testing so I could do parallel testing to kind of speed up our test suite because I didn't want them to, you know, if you have to run them sequentially, it's going to take a lot longer than if I run them all in parallel.

47:43 So it takes three to five times less time when you turn the concurrency on with our test suite for all of our API endpoints.

47:50 Very, very nice.

47:51 All right, so I think that's maybe we'll leave it there for the genomics stuff, but that was a really interesting look at how you're using Python to address these major problems.

48:01 And I got to commend you.

48:03 You've got a bunch of really cool tools and systems put together, it sounds like.

48:07 So nice work.

48:07 Thank you.

48:08 I mean, Python's got a great ecosystem, great community, so many great tools.

48:12 So it makes getting stuff done really fast easy.

48:16 The paradox of choice is a real thing that continues to vex people building stuff like this, right?

48:22 Because you build it all out and you're like, oh, but there's some other REST calling API thing.

48:27 You know, there's maybe I should use API star instead of Django REST framework because Tom Christie's not working on that, right?

48:32 But you've got to just put a stake in the ground and say we're building something productive here.

48:36 Always lots of new toys to play with and it can get distracting.

48:39 Another thing that we want to touch on is there's some kind of event going on in your city.

48:44 Is that right?

48:44 Yeah, PyCon is coming here.

48:46 Yeah, in May.

48:47 Is that May 7th, I think?

48:49 Yeah, so beginning of May.

48:51 Yeah, it'll be here and be right down downtown Cleveland, which is a great city.

48:55 Been here 18 years.

48:56 It's about two blocks away from my office, so I'm just going to be able to stroll right over there at the end of the day.

49:01 And it's great.

49:02 Cleveland's awesome, so people should definitely take advantage of some of the sites when they're here.

49:06 I absolutely think so as well.

49:08 A quick correction, it's May 9th, not May 7th, but basically the same, more or less that time frame, right?

49:15 And I'm looking.

49:15 Can I still register?

49:17 I think I can.

49:18 I don't think it's sold out yet.

49:20 It's not sold out yet.

49:21 So maybe it will be by the time people hear this.

49:25 So one of the things I wanted to touch on with you, maybe two parts.

49:30 One is, what advice do you have for getting the most out of the conference itself by, like, I'm within the walls of the convention center, you know?

49:38 And then people are going to be in your town, a bunch of folks together traveling here for the conference.

49:45 Like, what would you recommend they do to get the most out of Cleveland?

49:48 I haven't been to a PyCon since 2005, I think, was when I figured it out.

49:53 So I think maybe Dallas or something like that.

49:55 I bet it's a really big difference of an experience.

49:58 I'm excited to check it out.

50:00 So it's going to be great to go.

50:01 You know, obviously, everything's online.

50:03 So if you've never been and you never noticed the PyCon on YouTube, definitely check that out.

50:08 So what that should do is give you confidence that you can miss, you know, some of the speaker, some of the talks that maybe you're not super interested in and spend more time in the hallway track and talk and meet some folks in the community.

50:19 Because the PyCon group does a great job of getting all those videos online.

50:23 Within like a day.

50:24 So you almost could watch it while you're at the conference if you really felt like, oh, geez, I wish I saw that.

50:28 That's my recommendation there.

50:30 And as far as if you're downtown and you're staying downtown, you know, there's some great restaurants over on East 4th Street.

50:35 There's, you know, Lola by Michael Simon, the Iron Chef.

50:38 There's another one called Greenhouse Tavern.

50:39 There's the House of Blues, which might have a concert that night.

50:42 There's the Rock Hall, which has some special events sometimes.

50:45 And if you're, you know, a rock and roll fan, that's definitely a place to check out.

50:49 The Indians are in town.

50:50 I checked.

50:51 The Indians are in town that weekend.

50:52 They're playing the Royals.

50:53 So if you're a baseball fan, that's a few blocks away.

50:56 Yeah, that's really cool.

50:57 So if people are in town, they could obviously drop in and see that.

51:00 But if they're traveling from, say, outside the country, right?

51:03 I know tons of people come from all over the world.

51:05 Like, when do you get to see a professional baseball game?

51:08 Right?

51:08 Like, this might be a chance.

51:10 Take a couple hours, skip the conference, and go watch it, right?

51:12 Yeah, the Indians have been good the last few years.

51:14 So it should be a good team.

51:15 And then, you know, there's some other areas, too, to check out, right?

51:19 So there's, on the west side, there are Ohio City, the west side market.

51:24 You know, lots of breweries.

51:25 You know, micro pub type of things.

51:27 Definitely check those out.

51:29 Playhouse Square, which is, you know, maybe another six or seven blocks away.

51:33 That's actually the largest performing arts center in the United States, other than New

51:38 York City.

51:38 And then University Circle, which is a few miles away.

51:41 That's not as easy to get to.

51:43 There's Lyft or Uber.

51:44 Like, you could get there.

51:45 Yeah, you could get there pretty easy, right?

51:47 Exactly right.

51:48 So, yeah, Cleveland's a pretty easy town to get in and out of, and lots of great restaurants

51:51 and lots of great things to do.

51:52 Oh, that sounds really fun.

51:53 I definitely want to second, first of all, what you said about the hallway track.

51:58 I may take that track too much when I go to conferences, but I find I skip a lot of the

52:03 talks and actually just really try to experience being with people.

52:07 Because when you go to the talk, it's great, but it's really, you sit quietly and you watch

52:11 a great presentation and you experience it there, right?

52:14 But you don't interact really with anyone near you or anyone presenting so much at all.

52:20 And so there's the hallway track, which is just hanging out, talking to people.

52:23 And if you find yourself in an interesting situation, just take advantage of that because

52:28 you can always, like you said, go watch on YouTube the thing that you would have gone

52:32 to see.

52:32 The other thing that they're doing really well there are open spaces.

52:37 So I find that open spaces are more participation and engagement than the main talks, and they're

52:44 not recorded.

52:44 So there'll be a board.

52:46 If it's like the last two years, there'll be a big board where people put up index cards

52:49 saying, in this room at this time, we're going to just meet and it's kind of undirected group

52:53 conversation about something amazing, right?

52:55 And so definitely take advantage of those as well.

52:58 That's great.

52:59 Yeah.

52:59 And if you want to connect at PyCon, just send me an email and I'll look for you there.

53:03 Yeah.

53:04 Very cool.

53:05 And do take advantage of some of these fun things that Ian pointed out.

53:09 Like the worst thing about traveling is if you just get on a taxi to a plane to another

53:16 taxi to a hotel to a conference center, and then you pop those off the stack again and you

53:23 do them in reverse, right?

53:24 Like you want to go like, I was in Cleveland and I saw this amazing thing, right?

53:28 You know, like same thing, like wherever you go, try to take advantage of that.

53:32 So that's great.

53:32 That's great.

53:33 Yep.

53:33 Yeah.

53:34 Awesome.

53:34 All right.

53:35 Well, it's down to the two questions.

53:37 So let me hit you with those.

53:38 First of all, if you're going to write some Python code, what editor do you run?

53:42 Converted to PyCharm.

53:43 It's great.

53:44 I use the Vim editor mode and it's a great environment and love using it every day.

53:50 Yeah.

53:50 Awesome.

53:50 It's definitely kind of overwhelming when you get started, right?

53:53 Yep.

53:53 A lot of great tools and the integration with pytest and the integration with the Vim and Markdown

53:58 editors.

53:58 It's a really good tool though.

54:00 Yeah, it is.

54:00 Once you get used to using the feature, it's hard to not, it's hard to imagine not using

54:04 it.

54:05 Awesome.

54:05 Okay.

54:06 And then a notable PyPI package.

54:07 All right.

54:08 I'm going to go with deep variant by Google.

54:09 So I haven't used this.

54:11 I probably won't ever use this, but it's just such an interesting use of AI.

54:15 They are actually, you know, taking those BAM pileups that I described and basically using

54:21 image recognition type AI to actually determine and make variant calls.

54:25 So what used to be, you know, somebody with a way bigger brain than me doing these calculations

54:31 with math and trying to figure out the right determination of what a variant is, is kind

54:37 of being superseded now by this really interesting Google project.

54:40 So deep variant is the name of it.

54:41 Okay.

54:42 That sounds really cool.

54:43 And just another one of those AIs creeping in to solve these tricky problems.

54:47 Exactly right.

54:48 Yeah.

54:48 Very cool.

54:49 All right.

54:50 Well, definitely interesting choices.

54:52 And thanks for sharing everything.

54:54 Any final call to action?

54:55 People want to get involved in biology, genomics, Python?

54:59 Like, how do they get started?

55:00 There's a website called BioStars.

55:02 There's lots of interesting topics up there.

55:05 It's a stack overflow type clone, I would say.

55:08 And then there's stack overflow itself.

55:10 There's, you know, lots of conversation there.

55:11 Feel free to reach out to me if you're interested in learning more.

55:14 And, you know, Python is just a great, great ecosystem.

55:19 And there's so many cool tools to play with.

55:21 Yeah, I totally agree.

55:22 So one of the challenges I see for people getting started in this space is they're not researchers

55:27 or doctors.

55:28 Like, where do they get the data?

55:29 Do you know of any, like, good open places to get some data to work with?

55:33 Lots of the research that's out there is funded by the U.S. government or European governments.

55:37 You know, NCBI is a website.

55:39 I can't tell you what the acronym stands for right now.

55:41 They've got tools.

55:43 There's data sets out there such as TCGA, which is called, which is the cancer genome

55:48 atlas.

55:49 There's a project called Genie, which we were involved with helping them analyze their data.

55:55 And they've got lots of cancer data that's out there.

55:57 But lots of tools.

55:59 So search for keywords like VCF and BAM and SAM tools.

56:04 And there's lots of different keywords to search for.

56:07 And, you know, you'll find lots of different data sets.

56:09 It really just kind of depends on, you know, what kind of analysis are you looking to do?

56:12 And you also find a bunch of Jupyter notebooks out there, right?

56:15 People are people doing their analyses in Jupyter notebooks and then posting them to the web

56:19 for people to follow along with.

56:20 And really, it's I've learned all this stuff in the last five years.

56:23 It's not insurmountable.

56:25 It's just a matter of, you know, having a goal and trying to reach that goal and solve a problem.

56:31 That's cool.

56:31 And it's great.

56:32 Yeah.

56:32 Solve problems one at a time and eventually have this big tool chest, right?

56:36 Exactly right.

56:37 All right.

56:38 Well, Ian, thanks for being on the show.

56:39 It was great to talk with you and learn all about this stuff.

56:41 That's great.

56:42 Thanks, Michael.

56:42 Really glad to be here.

56:44 This has been another episode of Talk Python To Me.

56:47 Today's guest was Ian Marr, and this episode has been brought to you by Codacy.

56:51 Review less, merge faster with Codacy.

56:56 Check code style, security, duplication, complexity, and coverage on every change while tracking code

57:02 quality throughout your sprints.

57:04 Try them at talkpython.fm/codacy, C-O-D-A-C-Y.

57:09 Are you or a colleague trying to learn Python?

57:12 Have you tried books and videos that just left you bored by covering topics point by point?

57:17 Well, check out my online course, Python Jumpstart by Building 10 Apps at talkpython.fm/course

57:23 to experience a more engaging way to learn Python.

57:25 And if you're looking for something a little more advanced, try my Write Pythonic Code course

57:30 at talkpython.fm/pythonic.

57:34 Be sure to subscribe to the show.

57:35 Open your favorite podcatcher and search for Python.

57:37 We should be right at the top.

57:39 You can also find the iTunes feed at /itunes, Google Play feed at /play, and

57:44 direct RSS feed at /rss on talkpython.fm.

57:48 This is your host, Michael Kennedy.

57:50 Thanks so much for listening.

57:51 I really appreciate it.

57:52 Now get out there and write some Python code.

58:03 I really appreciate it.