00:00 Gene therapy holds the promise to permanently cure diseases that have been considered lifelong challenges.
00:05 But the complexity of rewriting DNA is truly huge and lives in its own special kind of big data world. On this episode, you'll meet David Born, a computational biologist who uses Python to help automate genetics research and helps move that work to production. This is Talk Python to Me, episode 335, recorded September 15, 2021.
00:39 Welcome to Talk Python to Me.
00:41 A weekly podcast on Python.
00:43 This is your host, Michael Kennedy.
00:45 Follow me on Twitter, where I'm @mkennedy and keep up with a show and listen to past episodes at 'talkpython.fm' and follow the show on Twitter via '@talkpython'.
00:54 We've started streaming most of our episodes live on YouTube, subscribe to our YouTube channel over at 'talkpython.fm/youtube' to get notified about upcoming shows and be part of that episode. This episode is brought to you by Shortcut, formally known as 'ClubHouse.IO' and Us over at Talk Python training and the transcripts are brought to you by 'Assembly AI', David.
01:16 Welcome to Talk Python to me.
01:18 Thanks, Michael. It's great to be here. Yeah.
01:19 It's great to have you here. One of the things I really love to explore is the somewhat non traditional use cases of Python that are not straight down that I'm building an API and something that talks to a database for a startup, right? Something like that. But blending it with other technologies and science and whatnot and genetics plus Python. It's gonna be interesting.
01:42 It sure is.
01:43 When you got into this, did you start out on the biology side or the programming side of the world?
01:48 I definitely started out on the biology side.
01:51 Yeah. So I went all the way through Grad school with relatively minimal formal programming experience.
01:57 Yeah. Okay. So you studied biology and genetics and whatnot? And then how did you end up here on a Python podcast?
02:05 So I always thought that programming would be cool, but I didn't really have much of an opportunity through my undergraduate studies to really do much formal programming. I took one computer science class that my College had to offer was in C++. Think I wrote a boggle program, something with some recursion in there. It was pretty fun. I didn't really get to use Python until graduate school. I was in a genetics course, and we were basically tasked with doing some data analysis on published data and then reproducing some plots in a figure, then extending it further. My partner and I decided to learn Python teach it to ourselves so that we could do this. We heard that it was it was a good way to do data analysis in biology. And so we basically taught it to ourselves that we use NumPy, and some basic string searching and things to redo this analysis. And it was really amazing what we could do with Python for that's.
03:02 That's awesome. How'd the learning project go, like coming from not having a ton of programming. What was your experience like?
03:08 It was relatively easy, I would say. I think my brain sort of fits pretty well with how programming languages work, but it was definitely a lot in a short amount of time to really dive into how to make sure your while loops don't stay open. And then someone tells you maybe you shouldn't use a while loop at all, as I was learning a lot of things not to do right away.
03:28 Yeah, of course. But you had to get all the analysis done right. So you got to solve all the problems and power through.
03:34 Yeah, we did. We got our analysis done there and we reproduce some plots and got some new analysis made. And I think we really just I really came to appreciation of how much you can do in a short amount of time with just a little bit of coding knowledge or essentially none.
03:49 Yeah. I think that's really an important take away. That a lot of people.
03:53 Maybe many people listen to the podcast already kind of know, but I think looking in from the outside, it feels like, oh, I've got to go get a degree in this to be productive or useful. And really what you need is like a couple of weeks and a small problem and you're already already there.
04:10 Absolutely. Yeah. I've definitely found that just learning through through doing has been the way I've worked entirely. I have essentially no formal programming training no coursework and I'm using python every day that's fantastic.
04:24 Yeah, I didn't take that much computer science on college that are just enough to do the extra stuff from my math degree. Very cool. Alright. Now how about today? Are you working at Beam Therapeutics doing genetic stuff? Tell us about what you do day to day.
04:37 Yeah, I'm on the computational Sciences team at Beam Therapeutics. We're at Gene editing Company. So we develop these precision genetic medicines that are trying to develop them to cure genetic diseases that are caused by single genetic changes in genome so we've got an example mutation or something like that. Yes.
05:00 So if you have one of these genetic changes, you might have a disease that is lifelong and there aren't any cures for most of these diseases.
05:09 So we're trying to create these. We call them hopefully lifelong cures for patients by changing the genetic code back to what it should be.
05:18 That's incredible. It seems really out of the future. I mean, I think it's one thing to understand genetics at play, and it's even amazing to be able to read the gene sequences, but it's entirely another thing, I think to say and let's rewrite that.
05:34 Yeah. We're definitely at the cutting edge of a lot of biotechnology and science that has really come to a head in the last decade with CRISPR and technologies that use CRISPR, which hours to be able to precisely target genetic sequences. It's a really fascinating place to work and the privilege.
05:54 I bet a lot of people go to work and they end up writing what you might classify as forms over data. It's like, well, I need a view into this bit of our database, or I need to be able to run a query to just see who's got the most sales this week or something like that.
06:10 That's important work. And it's useful. And there's cool design patterns and whatnot you can focus on. But also, it's not like what people dream of necessarily building when they wake up. But this kind of science, like, maybe. So these are the really interesting problems that both have a positive outcome. Right. You're helping cure disease, not just shave another 1/100% off of a transaction that you get a key or something like that, right. From, like, in finance, you get to use really cool tech to do it, too. Like, programming wise.
06:43 One of our dreams that we joke about on the computational team is that it's conceivable. One day we could say, like, hey, Alexa, how do we cure sickle cell disease? And it'll tell you what parts of our technology we should put together to cure that disease. That's sort of like the pipe dream of where we could go if we combine all our data in the right ways. And I think all the stuff we'll talk about today is really just laying out the the framework for that.
07:09 Yeah. Absolutely. So you mentioned CRISPR, maybe tell people a bit about that. Biotechnology.
07:15 CRISPR is molecular machine, which targets a very specific place in a specific genetic sequence. And so usually people are using CRISPR to target a specific place in the genome, a specific sequence. And what CRISPR does naturally, is to cut at that sequence. So it'll cut in a very specific place in the genome.
07:42 And as people were using CRISPR, we could actually decide where it's going to cut by giving it a different targeting sequence.
07:50 This sort of directed molecular machine is a basis of a whole new field of biotechnology. Using CRISPR and CRISPR derived technology.
08:00 Our technology is like, kind of a CRISPR 2.0 where we don't use CRISPR to cut. We use the localization machinery and we add on to it another protein which just changes the base instead of cutting the DNA itself.
08:16 It's slight variation, but it's still using the same CRISPR technology.
08:20 Okay. So let me see if my not very knowledge filled background understand here. It kind of has a decent analogy. So does it work basically, like, you give it almost like, find and replace. You give it a sequence of and it says, okay, if I find attcat, like, enough specificity that it's like, that's the unique one. And then it does, like, cut at that point. Is that kind of how it works?
08:47 Yeah. It's pretty much just like that. We you give it the sequence you want to target and then if it finds that sequence in the genome, it will cut the genetic material, the DNA at that position for normal CRISPR.
09:00 When you're coming up with these and you say we're going to rewrite the DNA that solve this problem, how do you do it on a large enough scale? How do you mean how much of the body has to be changed for this to kind of to be permanent, right.
09:14 Right. It's definitely a tricky question. I think we are.
09:18 We definitely leverage human biology, how the human body works for a lot of these problems. For example, some of our leading drug candidates are for sickle cell disease. And because of the way sickle cell disease manifests in red blood cells and red blood cells are created through a set of there's a specific type of cell that creates red blood cells. And we can. If you access that type of cell and you cure sickle cell in the progenitor cells, the stem cells, then you can create all red blood cells from a cured population. So if you can target it to the progenitor cells, you can cure sickle cell throughout the body, essentially because the symptoms are from red blood cells. There's a lot of diseases. That which by curing a single organ.
10:08 You can cure the symptoms of the disease because that's where it actually metaphors like diabetes or sickle cell anemia or something like that. A sickle cell diseases in the liver, some blindness in the eye.
10:22 By just targeting specifically where the symptoms occur, you can cure the disease.
10:26 Yeah. That's amazing. Is it a shot? Is it a blood transfusion or like, how do you deliver this stuff?
10:32 Yeah. The delivery is a huge place of research, and it definitely depends on the type of targeting we're doing or something in the eye. It would probably be an injection for something in the blood. It's slightly more complicated, and it's not an injection or something in the liver. It would probably be more akin to a injection or dosing regimen.
10:54 Okay. It sounds really fascinating. Like I said, it feels like it's a little bit out of the future to be able to come up with these and just say, no. We're just going to rewrite this little bit of the genetics, and I think it'll be good. But if you can make it happen, it's pretty clear how it's obviously a benefit, right.
11:10 Those progenitor cells, they eventually have to recreate new ones. And then the way they do that is they clone their their current copy of the DNA, which is the fixed one. Right. So if you can get enough of them going, it'll just sort of propagate from there.
11:23 And we're also the benefit that sometimes you only need to cure a small fraction to remove the symptoms so yeah lots of things going for us.
11:31 That's awesome.
11:33 So I'm sure there's a lot of people involved in this type of work. What exactly are you and your team working on.
11:39 Yeah. So our team, we call ourselves computational science team. We really sit in the in the middle of the research and development arm of the organization processing all of our sequencing data and some other data as well.
11:54 And as you can imagine, with our technology changing DNA changing genomes, there's a lot of sequencing data, because what we're trying to do is change a genetic sequence. So we have to read out that genetic sequence and then figure out has it changed how many copies are changed and things like that. The field of the techniques of next generation sequencing NGS are pretty broad. And we deal with a lot of different types of these next generation sequencing assays that are being done our team really processes, analyzes and collaborates with the experimental scientists on performing and developing these experiments. Cool.
12:37 So the scientists will do some work and they'll attempt to use CRISPR like technology to make changes, and then they measure the changes they've made. And you all sort of take that data and compare it and work with it.
12:50 Right? Yeah. The act of measuring the changes itself is relatively computationally intensive. So we run and support pipelines for all of these standard assays as well, which is part of our job.
13:03 How much data is in DNA? I know that human biology stores or just biology stores an insane amount of data, but also computers store an insane amount of data and processing. So I'm not sure where the sort of trade off is, but what sort of data are we talking about? How much?
13:20 Yeah. It definitely depends on the type of say that we're doing. Like the scale of the data. I would say usually we're not looking. Sometimes we're looking at things that are looking at all the bases in the genome. More often we're looking at defined regions of the genome where we're trying to make the change.
13:38 Right. You're like this one gene is the problem. Right. And let's look at that, right?
13:43 Yeah. We're trying to target here. So that's where we're looking. It definitely depends on the assay. But but I guess in terms of data scale, in terms of file sizes, perhaps that would be accessible for standard things. It would be on the order of a few gigabytes per experimental run for some of our larger assays.
14:01 It's ten to 100 times that per experiment.
14:05 That's a lot of data, but not impossible to transmit sort of amounts of data or store.
14:11 Right. Every little piece of it is pretty manageable when you start combining them together and looking at some down, like your downstream results of things. The data does pretty large. But I wouldn't say we're at the scale of, like, big data analytics at Google or anything like that.
14:26 Yeah. The LHC, if you've ever looked at the data flow layers of LHC, it's like the stuff near the collectors. It's just unimaginable amounts of data, right?
14:37 I haven't looked, but I'm sure it's certainly.
14:39 Yeah. They've got a lot of stuff built in on board the detectors to filter down a bunch, and then it goes for processing. Then it gets filtered down some more, and then eventually it gets to the point where now we have enough space on hard drives to save it. But before that, it was like it couldn't be saved on hard drives. It was just too much stuff. Yeah. Very interesting. But it sounds to me like some of the real big for you all is the computational bits, right. Because if you take all these things and you can end up in, like, combinatorial comparison scenarios, I'm sure that that can really blow up the computing time.
15:12 Yeah. I think definitely. There's a lot of the challenges is sometimes in these combinatorics, and there's also just a lot of steps that have to go on to process a lot of this data. There's a lot of biology specific pieces of software that we use for various things, and we have to string them together to create a complete data processing pipeline. And a lot of that art is part of our job is how to get all these tools to behave together and act in unison to actually process the data effectively.
15:46 I can just imagine some of these pipelines are tricky. Like. Well, okay. So it starts over here where the robot gets it and it reads this data. Then we can access that off the hard drive, and then it has to be sent to this Windows app that actually doesn't have an API. So we got to somehow automate that thing and then get data out of it. And then is that what it's like?
16:06 That's some of it for sure.
16:07 That's about your life. Okay.
16:09 Yeah. There's a couple of aspects that I think we always touch on them. But for sequencing experiments, the the pipelines are more defined because we usually get the data from a source that's already in the cloud, which I'm always happy about. If we could start in the cloud, we'll stay in the cloud. And that's a nice place to be, for instance, for data that's coming directly from instruments on premises, there is another layer of art that has to do with software on the instrument software that gets the data to the cloud and moves it around between our other database sources. And that is some fun projects itself there.
16:44 Yeah. I can imagine. I've worked in some areas where it's like collecting data from all these different things and then to process it and move it along. And yeah, it's not always just send it from this Python function to that Python function to that Python function. There's a lot of chunky stuff that wasn't really meant to be part of pipelines. Possibly.
17:03 Absolutely. Yeah.
17:05 This portion of Talk Python to Me, is brought to you by 'Shortcut', formerly known as 'Clubhouse.IO'. Happy with your project. Management tool. Most tools are either too simple for a growing engineering team to manage everything, or way too complex for anyone to want to use them without constant prodding. Shortcut is different, though, because it's worse. No, wait, no, I mean it's better. Shortcut is project management built specifically for software teams. It's fast, intuitive, flexible, powerful, and many other nice positive adjectives. Key features include team based workflows. Individual teams can use default workflows or customize them to match the way they work. Org wide goals and Roadmaps The work in these workflows is automatically tied into larger company goals. It takes one click to move from a roadmap to a team's work to individual updates and back. Height version control integration.
17:56 Whether you use GitHub.
17:57 GitLab or Bitbucket Club House ties directly into them so you can update progress from the command line keyboard friendly interface. The rest of Shortcut is just as friendly as their power bar, allowing you to do virtually anything without touching your mouse. Throw that thing in the trash. Iteration-planning, set weekly priorities, and let Shortcut run the schedule for you with accompanying burndown charts and other reporting.
18:21 Give it a try over at 'talkpython.fm/shortcut' again, that's 'talk Python.fm/shortcut'. Choose Shortcut because you shouldn't have to project manage your project management.
18:35 So With robots. Do you have to actually talk to the robots like any of the type of automated things?
18:42 Yeah, for lab automation is what we call our team that has the robots, as we like to call them. As you can imagine, with a lot of these types of experiments, they can be made much more efficient if we can have robots doing the actual transfer of liquids and incubation and centrifugation these scientific techniques that sometimes you need someone in the lab to do, but oftentimes you can automate them. So the lab robotics aspect is an important part of how we can efficiently generate data. A lot of the issues around that come with how to pass instructions to the instrument and how to get back data from the instrument, what it's done. And then there's a whole other art of making the instruments actually orchestrated together, which is held in a different world of software. I don't actually work on that part myself.
19:36 But yeah, that part of programming definitely seems a little bit magical getting factories or little automations of different machines to work together. It's very cool.
19:45 There's a whole lot of proprietary software involved in actually running the instruments, but in terms of the problems in getting the data to them, I think it's one of those could be a relatively common software problem getting information from somewhere in the cloud. We have electronic cloud notebooks and laboratory data systems that are in the cloud, and users will be submitting information about how they want their samples processed by the robots. There's the problem of getting that to the instruments of cells, and I think it actually sort of reminded me of the talk. You guys had an episode. I think it was 327 the Small automation projects.
20:26 We end up with quite a few of those here where we have these relatively small tasks of take data from an API, put it somewhere where a robot can access it. Usually we use AWS S3, and these sort of very small data handling tasks end up being these nice little projects for Python to come into play.
20:47 Awesome. I can see that definitely happens.
20:51 If a file shows up here, grab it upload it to that bucket and name it. Whatever the active experiment is with the data or something like that. Right. That's a very small program to write. But then that becomes a building block in this flow, putting these pieces of machines together.
21:06 Right? Yeah. Once it comes into the cloud, we do another set of data processing audit. We upload it to our databases and all of that we orchestrate on AWS using. I guess they call it the serverless design patterns.
21:23 We don't have to handle anything on our own computers.
21:26 That's really nice in the serverless stuff. It probably helps you avoid running just tons of VMs in the cloud. Right. Like it can all be on demand. Like the Lambda trigger is a file appears in this S3 buckets, so then it starts down the flow, right?
21:40 Absolutely. Yeah. I really don't like maintaining a lot of infrastructure, although we do have a good amount of it that we do have to maintain. I find that these small Python functions are the perfect use case for those event driven Lambda functions, which are running these very simple pieces of code.
21:58 When an object appears in S3, they get a small event about when the optic was uploaded, and then they do their thing. A little bit of data conversion, send it to an API, and now the data is in our data store, and those things just happen, and they're super consistent. They don't require anything on my end to maintain it's. Pretty, pretty beautiful pattern.
22:20 That's awesome. They don't need things like, oh, there's a new kernel for Linux that patches of vulnerability. So let's go and patch our functions, right. Just all magic. It all happens on its own, right?
22:32 It does feel like magic. So a lot of the time setting them up can be a little bit of a challenge, but once they're there, they are very consistent.
22:40 Cool. One of the things you talked about doing use was the AWS CDK or cloud development Kit thing.
22:47 I've heard of this before, but it's I've definitely not used this personally. Tell us, how does it help you?
24:20 Yeah, it seems super neat. I have not used it on the page, which I'll link to in the show notes. They have Verner Vocals, CTO of Amazon AWS and talks about some of the benefits and kind of how it all fits together. And but you said you can store your cloud structure definition in Source Control. You can run unit tests against your infrastructure to say if I apply all these commands to AWS. So I actually get what I was hoping to get out of it. And yeah, it seems like a really neat thing for this infrastructure as code bits.
24:53 I think it definitely really shines when you're developing larger pieces of infrastructure, but I would encourage people to check it out, even if they have a small automation type projects. This is what I was thinking of when I was listing the episode 327 the other day. We have these things. You want to rub it on your computer with a Cron job, you can actually run them for free on AWS and you get a bunch of free time on the free tier and you try it out. You don't need to make sure your system D processor for whatever is running. And it's a pretty cool way to get familiar with how to do some of these things on AWS. I'm not sure if this also exist for other cloud providers that we use AWS in particular. So that's what I know, but it may also exist for things like Azure.
25:44 Maybe. I don't know if it does either, but it definitely seems useful for what you all are doing. There one of the things it could be useful for and one of the challenges I suspect that you run into more than a lot of places, certainly more than like an ecommerce site or something like that is reproducibility. Right. If you're going to say we're going to come up with a treatment that literally gets injected into people.
26:06 You've got to go through FDA, you've got to have possibly peer reviewed stuff.
26:11 There's a lot of reproducibility and data stewardship going on, I imagine. Right.
26:17 Absolutely. That's definitely the case. We are very cognizant of the fact that we have to make our software as reproducible as possible. I think a lot of that lends itself to just using good development practices.
26:30 Containerizing everything that you're doing for data processing, pinning all your versions, prorate, source control, all of these things.
26:40 And the infrastructure is definitely a piece of that, because if we can't deploy the pipeline in the same way a year from now, we won't be able to get the same results from the data. And that's the problem.
26:53 That's a big challenge. It's tricky, right. Because things like containers and source control, they'll absolutely get you very far. But then you're also, depending on these external things that have been very stable and very likely will be like Lambda, for example. But what if Amazon went out of business? Right. Which is kind of laughable. It's only like double the revenue this year or something, but theoretically it's possible that AWS decided to shut down or something like that. Right?
27:24 Yeah. It's definitely true.
27:25 These are trade offs. Right. But at the same time.
27:28 It'S so enabling for you to just scale out all this workload in terms of how we create data pipelines. I think a lot of the ways we do it, we do have to be aware of creating them in a way that you can run them outside of the cloud. Sometimes we need to allow a third party to run our data analysis in a regulated way, and that requires us to have essentially running it internally. We run it in the cloud, which is efficient for scaling.
27:57 But we also need to be able to take that same piece of software and run it in a way that may not be on AWS, so may not be in the cloud at all, and that creates some interesting software challenges.
28:08 Yeah, I'm sure, because so many of the APIs are cloud native, right.
28:13 Import Boto Boto do this thing or something like that. Right.
28:17 How do you that I think a lot of it comes down to structuring your workflows and your data pipelines in ways that are they're not really using the cloud as much as they're using the cloud as the file system and a way to gain compute. But they're not using any piece of the cloud to actually affect the data processing itself. So ideally, if you create the data pipeline in a way that is appropriate, you can both run it in the cloud and run it locally and achieve identical results because it's containerized primarily. That's the reason, I guess. But when doing that, I think, is a challenge itself. It's really like the process of getting a data pipeline built is a big asset or field. I think a lot of they spend a lot of time thinking about this.
29:06 Yeah, I'm sure they do.
29:08 Getting it right is a huge enabler. Right. So let's talk about the process of maybe coming up with a new data pipeline. What does a software project look like for you all?
29:18 So it usually begins with a a collaborative meeting with some experimental scientists where we discuss what's the experimental design going to be like. What are we going to be looking at in the data?
29:30 And then the experimentalists will go and they'll generate some or most sequencing data. At that point, we generally take the data, open up some Jupyter notebooks or some small I thought sometimes even just bash scripts to try to use some of those standard third party tools. These are things like making sure all the sequences are aligned to each other so that, you know, when there's differences and making sure the quality is correct, things like that. These are pretty standard bioinformatics things.
30:03 Then for sequencing assays, there's usually a couple spots where there's some real experimental logic going into it, where often we'll have to write custom code in Python to say, if there's this sequence here, it means we should keep the sequence. And if there's this sequence here, we should divide by the sequence in half or something like that. And so that code gets written in Python. Maybe it's in the Jupyter notebook or another script. And we sort of do this really slow testing depending on the size of the data. It might be locally on a laptop or in a small cloud based HPC type cluster so this is where we're doing.
30:42 You're not trying to process all the results. You just want to spot, check and see if it's coming out right before you turn it loose. Right.
30:49 Right. Or we're very patient. It's a little bit of buff. Sometimes it's very difficult to take only a small fraction of the data, but we try when we can. Once we settle at something that we think is pretty locked down, we'll take it out of the Jupyter notebooks. We don't try to use paper bill or anything like that. We try to get it out of there as soon as possible into some more complex script. They might be a shell script that runs a number of other scripts in order.
31:15 Or we might start using some sort of workflow manager. The workflow managers and bioinformatics are pretty often because everyone has the same problem of writing all these third party tools together and custom code. Right.
31:29 It's a lot of shared tools as well. Probably right. They all use this library. That app.
31:33 Absolutely. Yeah.
31:34 there's whatever it is. Right.
31:37 There's a whole bunch of standard bio informatic tools that we run on almost everything. And so some of the workflow managers are designed to specifically work very well with those tools, and others are pretty agnostic of what you're doing with them.
31:52 But one of the things I find interesting and listening to you talk about this, it just remind me is so often we see these problems that people are solving right over here. We're using CRISPR to do all this work. And then you talk about the tools you use. It's like, yeah, we're using, like, NumPy, Pandas and Jupyter and these kinds of things. And the thing that I find really interesting is for software development. There's so much of the stuff that it's just the same for everyone. Right. They're doing the same thing. And then there's ten to 20% that this field. Does this part different. But there's, like, 80% of yeah. We should use source control. We're using Python.
32:33 We're using notebooks, were using Pandas and that kind of stuff. And it's the similarities are way more common than I think they appear from the outside.
32:44 It's a great point. I think we we'd all be better off if we reminded ourselves so that more often that we're not just because we're doing biotechnology and things with Python that we're largely very similar to other software developers doing data science on business topics or finance or just standard web development.
33:05 You could be at a hedge fund. And you're like, this is isn't that different now what I'm used to, actually.
33:09 Yeah. And if you think like that, you end up with, I think, better practices overall in software.
33:15 Yeah. I totally agree. So I suspect that this fact is what also makes the data science tools and all the tooling around Python libraries. And what not so good because it's not just like all the biologists have made this part of it really good. It's like most of the people are all really refining the core parts, right?
33:36 Yeah. I think that's definitely the case. And some of the biology specific tools are are a little wacky. When you start using things like Pandas are really amazing.
33:47 You can tell the amount of attention or something is different.
33:51 What is moving to production look like for you? So you talked about sometimes you start the exploration and stuff in notebooks, which is exactly what they're built for and then moving to maybe a little more composition of scripts and whatnot. And eventually, somehow you end up with Lambda cloud databases, things like that. What's that flow.
34:10 Yeah. So the process of us, we say productionizing a pipeline is we've had pretty well set now and generally how it works. As we say, this pipeline is about done, and we'll hand it off to myself or one of my colleagues to start the process of getting it fully cloud capable and scalable. And what that means for us is to pick the software in whatever form we've gotten it from our colleagues and put it into a workflow manager. And I think every company has their own version of workflow manager that they choose. We're using Luigi, which is fully Python based. It was originally developed that Spotify to do this sort of task.
34:56 It uses like a GNU make type target file Dag creation.
35:03 I don't know all the technical terms to describe how the the tasks are built, but essentially you have a task at the end and you say it requires the output of this other task, and then that task requires.
35:15 The more as they build up, you create a graph of what tasks need to be done to get to our output a directed acyclic graph.
35:24 And then the work that you can decide. Like, all these two things are independent. So let's scale them out separately. But now this one has to wait for these two to finish and then get his results.
35:33 That coordination can be really tricky.
35:35 Exactly. And there's a number of common workflow managers and bio informatics. I think the two most common are steak Bake and ex flow.
35:43 Luigi has also been really good for us. We like it primarily because it is fully Python based, and it uses standard Python syntax, which allows us to really if we need to get out of the hood and add some customization extend it where we need to or fix things that we don't like about it. And that was a really important part of our decision and choosing Luigi over some of these other workflow managers.
36:09 Yeah, for sure. I had a nice conversation with the Airflow Apache Airflow folks not too long ago, and one of the things that really struck me about this is the ability for people to work on little part of the processing a little bit like that little Python automation tools or little Python projects that you described earlier in episode 327. In that instead of trying to figure out all this orchestration, you just have to figure out, well, this little task is going to do a thing, and that like I said, maybe it means see the file and then copy it over there. And if your job was see a file here, copy it over there. That's a really simple job. You can totally nail that. You know what I mean?
36:50 If your job is to orchestrate this graph and make sure that it runs on time, and here's the failure case. All of a sudden that becomes Super Super hard, so they seem really empowering, almost like the promise of microservices. It's like you get to focus on one little part and do that.
37:05 Yeah, it definitely helps. And it helps with that idea of having these small tasks.
37:11 It really helps with how you can develop and reuse the components for each task. We might, as we said earlier, that there are these third party tools that end up being used in almost all of our pipelines and using something like Louis or any workflow manager, you can reuse the tasks in different contexts as the need be, and you can have your perfectly optimized way of using that task everywhere. That reuses is really nice. It's something I think a lot of software developers appreciate.
37:41 Yeah, for sure. So if you look at some of the folks that are using Luigi so Spotify, as you said, created it, but also for Square Stripe Asana C geek.
37:52 A lot of companies that people probably heard of like, these places are doing awesome stuff. Let's be like them.
37:58 Yeah. A lot of places use it for, like hadoop and things like that.
38:03 And one of the nice things is you mentioned about like how Airflow has the same model where you can create these contributions, which are different connectors for Luigi or for Air Flow, where you can connect them to either different cloud providers or different data stores. Things like that. And that allows you to use Luigi any workflow manager in numerous different context, whether it's locally on your own computer, running things in Doctor containers, or whether it's deploying out to AWS and scaling massively horizontal.
38:38 These workflow managers really support that. And that's why they're a necessary component in how we productionize our data by bots.
38:46 Again, they just seem so empowering for allowing people to focus on just each step independently, which is excellent. Did you consider other ones? Did you consider Airflow or Daxter or any of these other ones, or did you find this fit? And we're going with this.
39:01 We did look at some other ones. We were using Next Flow for a little bit, which is a Bioinformatic flavored workflow manager. It's very focused on Bioinformatics as its primary use case, although you could use it for anything. It's syntax is similar to Groovy, and it's based in Groovy, and that was one of the attractive for us is that it was a little hard to get under the hood and use that because of it. I did briefly look at Disruptive hearing a few episodes, I think made a different podcast.
39:33 Yeah, I did have Tobias Mason to give us an overview of the whole data engineering landscape, so possibly I know he spoke about it then, but I'm not sure when you heard about it.
39:41 Yeah. So I heard about it. Auto pockets probably this one as well, and I did look into it, but it didn't have it at that time. It was pretty early. I didn't have any connectors to AWS and the ways that we like to use Luigi connectors.
39:55 That's such an important thing, because otherwise you've got to learn the API of every single thing you're talking to.
40:00 Yeah. These days, knowing how Luigi works, it actually wouldn't have been that big of a task to look under the hood. So we did choose Luigi, and particularly we like how it handles deployment to AWS, and we use it on the service called AWS batch, which is, I guess it might be similar to, like Kubernetes Pod, although I haven't done anything with it or anything like that, not speaking from experience, but it essentially scales up EC2 instances. These elastic compute instances on the cloud as you need them, and it gives out jobs to the virtual computers as necessary. So it spins them up. Allocates jobs. That a docker container. They run when there's no more jobs, all instance, it shuts off.
40:55 So we come up with an AMI and Amazon machine image that's pre configured, set up, ready to run. And then you say, I'm going to give you a bunch of data.
41:05 Each one of these data gets passed the machine and it runs and then maybe even shuts down when it's done.
41:10 Yeah, there is an AMI. We keep the AMI pretty simple because it's sort of the base for all of them.
41:15 The way batch works is you have your top level AMI that's called a compute environment, I believe.
41:23 And then inside of it, you run the actual job. The job runs inside of a docker.
41:29 I see. So the Docker container is pre configured with all the Python dependencies and the settings that it needs and whatnot right.
41:36 So we have each task in Luigi, each little piece of work as its own Docker container. And then we push those out into the cloud and they get allocated out onto these machines.
41:50 They run their task.
41:51 Data comes in from S3, goes back out to S3. Nothing is left on the hard drive, and then they disappear. They're these little ephemeral compute instances. And that's all managed by a workflow manager such as Luigi or Airflow or Next Flow.
42:08 That's pretty awesome. One of the things that I remember reading and thinking that's a pretty crazy use of the cloud was this our technical article from look at that year 2011. So if you think back to 2011, the cloud was really new.
42:26 And the idea of spending a ton of money on it and getting a bunch of computer out of it was still somewhat for into people. So there's this article linked to called the hundred or 1279 per dollar per hour 30,000 core cluster built on Amazon EC2 cloud, which is this company that pharmaceutical company that needed to do a lot of computing. And they said instead of buying a supercomputer, basically, we're going to come up and just fire off an insane amount.
42:57 Of course.
42:57 And I think if I remember reading this correctly that they weren't allowed to make that many cores in a single data center. So they had to do this across also to coordinate, like, multi data center type of processing as well, because, yeah, it's just the scale of everything. This seems like the type of work that you all might be doing.
43:18 Yeah, it does look very familiar.
43:21 Okay, you got any more stories? Can you tell us anything about this?
43:25 Yeah, we do occasionally have certain type of molecular modeling job that we can scale very wide. And I think the sort of 30,000 number looks pretty familiar. I think our largest jobs today have been about 10,000 CPUs wide and running for a few days. So maybe like four or five days.
43:48 I think the number was like four or five days on the 10,000 cores.
43:52 Yeah, it's a lot. And I think it was like over a million CPU hours on AWS batch. And that was just something that we could really heavily parallelized. And we needed the data fast and it worked.
44:11 You pull it all back and aggregate it all together at the end that it was a really useful data set. And it's pretty amazing what you can do on some of these cloud providers by going really wide.
44:23 It's crazy. There's some places where it makes sense to, like, build true super computers like Oak Ridge has a thing called Summit, which is like this insane supercomputer that they have. But a lot of times there's the latency of getting something like this set up. There's the overhead of guess what. Now you're an admin for like an insane. There's only three in the world supercomputer type system.
44:48 It's got to be empowering to be able to just go, go and then let all this happen and not worry about all those details, right?
44:55 Yeah. There's definitely still some stress involved in starting one of these jobs. If you just it's not cheap. Anyway, slice it. We try to do everything we can to make it as efficient in cost as possible.
45:08 But yeah, I'm sure there's two aspects to it that really jump out to me. One is that the wrong version of the code, the one that still had the bug. We got to throw away all the data we just ran for five days and thousands of dollars got burned up and there's a bit of a problem there. So and now it's five days later until you get the answer as well. Right. That's one. And then I'm losing track. My thought, my other one. But yeah. Anyway, it's just gotta be stressful to be like to set that up and then press go. Right.
45:40 It definitely is. I think one of the benefits of a lot of our competition team does have a spare mental background and doing experiments in the lab, these sort of numbers, like tens of thousands of dollars for an experiment or not. Really that on come and those kind of things, you get really careful and you check all the boxes and double check everything. And so I think a lot of us have had that, that experience. And so even when we're dealing with software, we will be very careful and will do the tests and quality control before we really let her rip.
46:10 I worked on some projects when I was in grad school at We're on Silicon Graphics, big mainframe type thing and obviously much lower importance than solving diseases and stuff is just solving math problems. But I remember coming in to work on the project one day and none of our workstations could log in to the Silicon Graphics machine. And what is wrong with this thing? And it was so loud. It was in the other room. You could hear it roaring away in there. It clearly was loud. And what happened was it wasn't me. It was someone else in the group had written some code. These things have run all night. We come in the morning, we check them. And what had happened was they had a bug in their code, which they knew they were trying to diagnose it. So they were printing out a bunch of log stuff or something like that. Well, they do that in a tight loop on a high end computer for a night, and it filled up. The hard drive, still had zero bytes left. And apparently the Silicon graphics machine couldn't operate anymore was literally zero bytes. And so it just stopped working. They couldn't get an eternal. I was like it took days to get it back, I believe.
47:17 But it's like that kind of stuff, right? I mean, you're not going to break EC2, but you don't know until the next day that. Oh, look, you filled up the computer and it doesn't work anymore, right. When you're doing that much computing, you could run out of different resources. You could run into all kinds of problems.
47:33 Absolutely. And we are without our war stories of doing this. But I think we definitely learned a lot of lessons along the way of how to monitor your job effectively and double check things. But sometimes you run a big job and it doesn't quite turn out right. But the cost of doing business.
47:53 Yeah. I mean, it's very computational to explore stuff like this kind of stuff, but that's also what enables it, right. Without this computing power would just not be a thing.
48:02 Absolutely. A lot of the data just takes a lot of time in process, and there's not really any way around it. And even when you're iterating, you have to go through all the Hoops to look at the data at the end.
48:12 So you talked about APIs, you talked about data store. What are you using for a database? Is this, like, hosted RDS AWS thing or what is the story with that?
48:24 Yeah. So we have a few different places to store data or larger scale internal data. We store in Django based web app, and we use the Django ORM for SQL based database. MySQL database on AWS, and that has worked surprisingly effectively. Actually, I've heard some people say that the Django ORM, it's really slow when you scale out and things, but if you decide it correctly, I think it'll sprout.
48:55 I think that's so true.
48:57 I hear so many things about Orms are slow.
49:00 Or this thing is slow in this way. And if you have the queries structured well, you do the joins ahead of time. If you have indexes and you put the work into finding all these things, it's mind blowing when I go to sites.
49:14 I won't call anyone. I don't know if they've been updated here, but you go to a site and you're like, this site is taking four or 5 seconds to load. What could it possibly be doing? I mean, I know it has some data, but it doesn't have unimaginable amounts of data. Right. Surely somebody could just put an index in here or worst case, a cache and it would just transform it. Right. So yeah. I'm glad to hear you're having good experiences.
49:38 We definitely fairly regularly run into slow queries. They're usually not too bad to solve. I'm sure at some point we'll get to something really wacky that will be challenging, but for the most part, we've able solve it through better query design and better indexing.
49:56 Yeah. Do you ever do things where you opt out of the sort of class based query syntax and go straight to SQL queries here and there to make sure that that part works better.
50:07 We have tried it for some particular sequence based searches that we do, and I actually found that most of the time you can write it in the orm. It's just a little more complicated, but I do expect that at some point we will be writing raw sequel queries because out of necessity.
50:25 But it's not the majority, mainly using the Orm and then it's okay.
50:29 Yeah. And I think a lot of benefits from the appropriate data model will help the queries along the way. Yeah.
50:35 Absolutely. The thing I found the slowest about Orms and ODM, if you're doing document databases is the deserialization. Actually, it's not the query time, but it's like I got to generate 1000 Python objects, and that just happens to be not that fast relative to other things.
50:53 You're talking about my Monday morning cycle.
50:58 And in those cases, I think that's the place where it makes sense to maybe do some kind of like a projection or something. I don't know how to do it in Django Orm, but in a Mongo engine, you can say, I know what I'm going to get back is a an Iterable set of these objects that match to the data, but only actually just fill out these two fields.
51:18 Most of the data just throw it away. Don't try to parse it and convert it, just like I just want these two fields, and that usually makes it dramatically faster.
51:26 Yeah. We run into a number of bottlenecks at the serialization layer, and we have been experimenting with a variety of different ways to solve those issues. And sometimes it means putting fewer thought objects between you and the data, and that often speeds it up, even if it makes it a little bit harder to interpret in your development development.
51:49 Yeah, absolutely. Or just say, you know what? I just need dictionaries this time. I know it's going to be less fun, but that's what it takes.
51:56 That was the fixed on Monday morning. We do try to extensively use data classes for a lot of our interoperability.
52:04 When data comes in and out of a pipeline, we like to have it in a data class is essentially stored repository, and then our Django web app also has access to that. So it knows what the structure of the data coming in is, and it knows what to serialize it to when it's coming out. And the Python data classes has been a really useful tool for that. But I think you're talking about that. Another podcast a few weeks ago may be it was the Python bite on the data classes can be can be slow, and sometimes it's better to just have a dictionary, even if it is a very highly structured dictionary.
52:40 The problem is maintainability and whatnot but you know, if it's five times slower like you know what? This time it matters that we're going to just bite the bullet and have to deal with it a sad time for just a little bit more digging into it. You talked about the Django ORM Django REST framework, which is all great. What's the server deployment story like? How do you run that thing? Is it with Gunicorn? Is it micro whiskey? What's your on that side of things?
53:07 Ours is a little a little custom, I guess in some ways it's pretty standard. I think we're like on exactly how it's set up now, but there's a NGINX proxy I am liking on it might be Gunicorn.
53:22 I feel like Gunicorn and Django go together frequently. I'm not sure why they got paired up specifically, but yeah, it's a good one.
53:30 And we ended up deploying it out to AWS Elastic Bean stock, which is a source of some conversation in our team, because there's some things we really like about it, and there's some things that are really annoying in terms of the deployment is much more complicated than we would like it to be, but we have everything wrapped up in a pretty gnarly CDK stack that does a lot of the work.
53:54 It was messy, but you've solved it with a CDK now it's just you push the button and it's okay.
53:59 It's exactly like that. We have a very automated deployment process. I wouldn't like to refactor it, but it's there that works that works for us, but I think it's a pretty standard Django deploy on the cloud and edit it works well.
54:16 Yeah. Cool. Well, I think that's probably about it for time to talk about the specifics of what you all are doing, but there's the last two questions, as always.
54:26 So let's start with the editor. If you're going to write some Python code.
54:29 What editor do you use the VS code a lot of the remote development environment on that.
54:34 Yeah. You use the remote aspect of it.
54:37 Yeah. We're doing a lot of work on EC2 instances as our day to day work and the VS code the way it works with instances in the Cloud is really amazing. So I encourage anyone to check out that extension.
54:51 You get access to the file system on the remote machine. And basically it's just your view into that server, but it's more or less standard VS code. Right. But when you hit run, it just runs up there.
55:02 It feels exactly like you're on your own computer. Sometimes I actually get confused whether I'm on a remote.
55:07 It doesn't work because I'm in Virginia. I see. Yeah.
55:10 All right.
55:10 And then notable PyPI package.
55:12 But I'll have to shout out some of the ones we talked about it. I would encourage people to look at AWS CDK if they're on AWS, I think it has some really interesting things there. And then also Luigi as a workflow manager. If people are do any of these types of, you know, data pipelines that have as they can reuse, these sort of workload managers are really cool. Luigi is a pretty accessible one. Probably one that's familiar with Python.
55:35 Yeah. All right. Fantastic. Yeah. I'm just learning to embrace these workflow managers, but they do seem really amazing. All right. So for all the biologists and scientists out there listening, you've got this really cool setup and all this cool computational infrastructure, what do you tell them? How do they get started? Maybe biology or whatever.
55:54 I think biologists a good place to start. We're also happy to have people come from a software background that are really interested in learning the biology. And I guess as a final plug, we do have a few open positions. So if you're interested, it got our career stage and give us an application.
56:10 Are you guys a remote place.
56:12 Remote friendly or what's the story these days?
56:14 We're remote friendly. I'm actually living in Philadelphia that our company is based in Cambridge.
56:19 Interesting. I mean, it's a short trip to the cloud, no matter where you come from. Basically.
56:26 Right on.
56:27 David, thank you for being here and giving us this look inside all the gene editing Python stuff you're doing.
56:33 Thank you, Michael. It's, pleasure.
56:34 Yeah, you bet. Bye bye.
56:35 This has been another episode of Talk Python to me.
56:39 Our guest on this episode was David Born, and it's been brought to you by Shortcut & Us over at Talk Python training, and the transcripts were brought to you by assembly AI.
56:48 Choose Shortcut, formerly Club House IO for tracking all of your projects work because you shouldn't have to project manage your project management. Visit 'talkpython.fm/shortcut'.
57:00 Do you need a great automatic speech-to-text API? Get human level accuracy in just a few lines of code?
57:05 Visit 'talkpython.fm/assemblyAI'. Want you level up your Python, we have one of the largest catalogs of Python video courses over at Talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async. And best of all, there's not a subscription in sight. Check it out for yourself at 'training.talkpython'. Be sure to subscribe to the show. Open your favorite podcast app and search for Python. We should be right at the top.
57:31 You can also find the itunes feed at /itunes, the Google Play feed at /Play and the Direct RSS feed at /RSS on 'talkpython.fm'. We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at 'talkpython.fm/youtube'.
57:52 This is your host, Michael Kennedy.
57:53 Thanks so much for listening.
57:55 I really appreciate it.
57:56 Now get out there and write some Python code.