Learn Python with Talk Python's 270 hours of courses

#237: A gut feeling about Python Transcript

Recorded on Friday, Sep 27, 2019.

00:00 Let's start this episode with a philosophical question.

00:03 Are you human? Are you sure?

00:05 We could begin to answer that question physically.

00:08 Are you made up of cells that would typically be considered as belonging to the human body?

00:13 It turns out the answer is, maybe.

00:16 We have many ecosystems within us.

00:19 Understanding them is essential to our own well-being.

00:22 In this episode, you'll meet Sebastian Proust, who is using Python to study the bacteria in our world.

00:29 This is Talk Python to Me, episode 237, recorded September 27th, 2019.

00:34 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

00:54 This is your host, Michael Kennedy.

00:56 Follow me on Twitter, where I'm @mkennedy.

00:58 Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via at Talk Python.

01:04 This episode is brought to you by Linode and Tidelift.

01:07 Please check out what they're offering during their segments.

01:09 It really helps support the show.

01:11 Hey, folks, I want to tell you about a new course that we just launched over at Talk Python Training.

01:16 This course lets you create web apps with pure Python.

01:20 Now, you may be thinking, sure, I can use Flask and create web apps with pure Python now.

01:25 No, I mean on the browser as well as on the back end.

01:30 This course is about the Anvil platform, and the course is called Anvil Web Apps with nothing but Python.

01:35 What's really cool is, yes, you have a back end.

01:37 You can write Python code there, and that's cool.

01:39 They manage the database.

01:41 But also, they've set it up so you can write Python code and use a visual designer with events on the front end to create a single page app or a spa and write it entirely in Python.

01:53 So if all the different technologies involved in building web apps are overwhelming to you, but you still need to build a web app, check out this course.

02:00 It's a five-hour course, and it's even free.

02:02 Just visit talkpython.fm/Anvil dash course, and it'll take you right there.

02:07 Log in, click join this course, and you got it.

02:10 Now let's get to that interview.

02:12 Sebastian, welcome to Talk Python to Me.

02:14 Hi, Michael.

02:15 Thanks for having me.

02:16 It's great to have you here.

02:17 It's an honor.

02:17 We're going to have so much fun talking about some very small things that have maybe a big impact on our world.

02:25 Exactly, yeah.

02:26 So we're going to talk about studying bacteria and communities of bacteria using Python, which is going to be fascinating.

02:33 But before we get to that, let's start with your story.

02:35 How did you get into programming in Python?

02:37 So I actually got into programming about 20-something years ago.

02:40 I was 12, and my favorite game was Quake at the time.

02:45 And apart from the actual game, I also had a disk with all kinds of modifications.

02:49 And on there was the source code for the game logic that you need to make your own modification and the QuakeC compiler.

02:58 Now, the cool thing with QuakeC is that you can actually decompile the code from other modifications to readable code.

03:08 So I was just like decompiling stuff I liked, trying to figure out how it worked, and then trying to basically come up with my own stuff.

03:16 And I never really succeeded into making like the decisive version of Quake, but I did get a pretty good understanding of what's a function, what's a for loop, what's an if statement.

03:27 So I really learned programming through that medium.

03:30 I used to play Quake and Doom, and those were fun games.

03:34 And I remember the mods.

03:35 I think it was not Quake, but Doom, which was by the same software.

03:42 It was the predecessor, right?

03:44 Yeah.

03:44 And you could change things up.

03:46 I remember one we had was a Barney mod.

03:49 Do you know Barney, the weird purple dinosaur?

03:54 Yeah.

03:54 It would replace all the characters with this dinosaur thing and different dinosaurs, and they made weird noises, and it was amazing.

04:02 There was another one where it had these super scary creatures, and a friend of mine was playing it, and it was dark.

04:08 And I spooked him from behind, right?

04:10 As this creature from this mod came out, he got so scared, he ripped the keyboard drawer off of my desk.

04:16 I thought, oh my gosh, I should have.

04:18 It was just totally frightening.

04:19 It was so, so funny.

04:20 I really enjoyed those mods.

04:22 Those are great.

04:23 It's a really cool way to learn programming.

04:24 Yeah.

04:25 I had like a whole bunch of them, and some were like tiny.

04:27 They were just like a secondary fire for one of the guns.

04:30 Yeah.

04:31 I was just looking for a way to combine them, because you could only load one at a time,

04:36 and I wanted to have multiple running at the same time.

04:38 And so after I played around with QuakeC for a while, I did pick up some proper programming languages through high school.

04:45 But then ultimately, I went on and studied biology.

04:49 I always was a bit in doubt whether it was the right choice or not.

04:52 At the end of my biology degree, I noticed that there's something like bioinformatics.

04:56 I was like, oh, that's great.

04:58 It allows me to kind of combine the best of both worlds.

05:01 And I started a PhD in bioinformatics.

05:03 Now, at the time, we were using like a combination of Perl with some Java applets for our web stuff.

05:11 There was also PHP in there.

05:13 So it was kind of like...

05:15 Early days.

05:16 Yeah.

05:16 It was kind of like a whole bunch of things glued together.

05:18 So a couple of years later, when I had the opportunity to start like a new platform, I thought like, okay, I want to stick with like one language for the backend, one for the frontend.

05:28 And that's how I basically discovered Python and Flask.

05:32 Okay.

05:33 And you haven't looked back ever since.

05:35 It's just such a convenient way to develop things.

05:37 Yeah.

05:38 I got to think that the relevance of Python to biology is even more significant now than it was back then with all the data science developments and the other extra packages getting developed and all.

05:50 Sure.

05:50 So in the beginning, bioinformatics was really fueled by Perl.

05:54 And just over the course of a couple of years, this has been completely taken over by Python.

06:00 Yeah.

06:00 I can imagine.

06:01 Is it a lot?

06:02 What kind of data files and stuff are you working with?

06:05 Is it, you know, I imagine like a lot of, you know, genetic sequence character strings or like what kind of stuff are you working with when you're doing bioinformatics?

06:15 Right.

06:15 So this really is dependent on what field you're in.

06:17 In my case, it's mostly sequence information.

06:19 So that can be either raw sequences that come from a machine.

06:24 So those are very, very short reads, about 20 to a couple hundred nucleotides and the quality scores for each of them, like how reliable they have been sequenced.

06:33 Or it can also be like a full genome sequence, which is just a string of several millions of characters.

06:40 Right.

06:41 A lot of ACT and G in there.

06:42 But they're like bioinformatics is a very, very broad field.

06:45 So there are people that are doing completely different things with other types of data.

06:49 If you look at metabolite data, those are like the chemical compounds that appear in a certain sample.

06:55 That's a very different type of data entirely.

06:58 Okay.

06:58 It's super interesting.

06:59 It sounds like a fun place to be working for sure.

07:02 Biology is super cool.

07:03 Now let's talk a little bit just at large about communities of bacteria.

07:08 Because as humans, a lot of people are kind of creeped out by bacteria.

07:13 The thought of like there's something on my finger or worse, if I go to like a school and I touch the handle to the door, you know, what other stuff came off of other people, you know, sniffly kids or whatever.

07:28 Now I have this bacteria on me and probably it's actually viruses you care about, but whatever, right?

07:33 People are freaked out about it.

07:34 But it's a much more common and actually important part of life for all creatures, including humans, right?

07:42 Like there's a really interesting TED talk, which I'll link to from your boss.

07:47 And he starts the TED talk with something like, so you think you're a human, huh?

07:52 Yes.

07:52 So I might have to creep those people that are afraid of bacteria out a little bit more because basically for every cell that makes up your body, there is a bacteria somewhere on or in your body.

08:05 And if you look at your genetic information, the number of genes that you have compared to the total number of genes those bacteria have, that's like almost a hundred fold higher for bacteria.

08:16 Wow.

08:17 So there's a hundred times more genetic information of bacteria on us.

08:21 And we're about 50-50 in, maybe not in mass, but in terms of number of cells.

08:27 Just half bacteria, half human cells.

08:30 Exactly.

08:31 That's going to take a moment to sink in.

08:33 Now, I kind of knew that, but there's another stat that you threw out there that just numbers, I think this is an interesting topic partly because these numbers are so big and the things are so small that it's hard to have a concept around it.

08:46 It's hard to conceptualize it.

08:48 So you said that if you get, say, a gram of soil, that a gram, I mean, that's like tiny, right?

08:56 Like a little sugar cube size bit of soil that has up to a billion bacteria in it.

09:03 Just that little square, right?

09:04 Yes.

09:05 So it's like a very, very minute amount of material.

09:09 And there's like a ton of bacteria in there.

09:11 And if you look at the complexity or the number of different species in that sample, that's stellar.

09:17 Like in that single gram of soil, you could easily have like 10,000 different kinds of bacteria.

09:22 So it's a very, very complex environment.

09:24 And that interacts really with, in case of soil, the roots of plants, right?

09:29 On our body, there are so many different bacteria on our skin, but also on the inside in our gut.

09:35 And they actually affect us.

09:37 And it could be bad, but a lot of times it could be good, right?

09:40 Like the bacteria in your gut has a lot to do with your health.

09:44 You know, you hear about people taking antibiotics and it like killing out some of that stuff in their stomach.

09:50 And they're not as healthy necessarily because of that, right?

09:53 So yeah, it's crazy.

09:55 We actually tend to think of bacteria as something that causes disease, that causes inflammation and infections, right?

10:02 But a lot of bacteria can do exactly the opposite thing.

10:05 A lot of the bacteria that are living kind of like as our symbionts, really,

10:10 they're essential for our health.

10:11 So they can really improve our quality of life.

10:14 Right.

10:14 Maybe they help you with digestion, like certain types of proteins or things can be broken down because they're there, right?

10:21 Yeah, I can imagine so.

10:22 So I'm mostly on like the computational part.

10:25 So I'm not like microbiologist.

10:27 So that's a bit farther off what I'm doing.

10:30 Yeah, yeah, sure.

10:30 But yeah, so there are tons of bacteria that are actually beneficial for us as well.

10:34 So we really need them as much as they need us.

10:37 Yeah, I think at least in the US, there's been a trend against things like antibacterial soap and stuff like that that tries to completely kill off bacteria.

10:45 I think as much as it might creep people out, I think there's a realization that bacteria are important.

10:49 Yeah, I think that's true, right?

10:52 Like if you take antibiotics, you're killing off the bad guys.

10:54 But if it's a broad spectrum of antibiotics, you're also killing some of like the healthy bacteria, the ones that you do need.

11:00 And that actually has a long lasting effect on your microbiomes.

11:04 It takes a while to recover from that.

11:06 So people are really looking into like kind of like better ways to treat these infections.

11:11 Yeah, for sure.

11:12 So we had this quote or this number that there's a billion bacteria in a gram of soil and there's 10,000 different kinds of bacteria.

11:24 So how do you come up with that?

11:26 Do you like train grad students to recognize bacteria and they take notes?

11:30 And they spin around.

11:31 How do you come up with that number?

11:33 It can't be counting it, right?

11:34 Yeah.

11:35 Usually when I tell people, like we look at the amount of bacteria and which bacteria are in like different samples,

11:40 they think of me hunched over a microscope and just counting what I see.

11:44 It doesn't quite work like that.

11:46 So all bacteria in their DNA have like a small part.

11:51 And it's kind of like similar enough that we can pick it up and copy it from all those different species of bacteria.

11:57 But it's different enough that we can actually identify them to some extent.

12:02 So there are like tiny differences in that sequence.

12:05 And that allow us to kind of like see like, okay, this read is different from that.

12:08 So what we do is we basically take the DNA, copy over that one part a bunch of times, and then sequence all the DNA in those copies.

12:18 And it allows us to kind of like look in databases, like which part matches which bacteria.

12:23 And we get more reads from one specific bacteria, then we know, okay, there's probably more of them in the sample than from another one where we see fewer reads.

12:31 Right.

12:32 You just basically count the number of matches of that species' DNA.

12:36 And you say, well, that must be roughly representative of its ratio in the general population.

12:41 Exactly right.

12:41 But this is kind of like a fingerprint, if you want.

12:44 So it tells us that certain bacteria with that sequence is in the sample, but it doesn't tell us all that much about the sequence, about the bacteria that we're looking at.

12:54 There's also like a more modern version, if you want, where we really try to go in and sequence everything, all the genetic information in the sample.

13:01 And that gives us much more detail about what those bacteria can do, because we get the full genome information for at least some of them.

13:08 But for now, our research is mostly based on the fingerprint sequencing.

13:14 Sure.

13:15 So something I've always wondered about, just sequencing DNA in general, is you've got so many cells mixed in there.

13:22 And especially in this case, you've got different species mixed in there.

13:25 Like, how do you know that this part is all one species?

13:29 How do you separate that?

13:31 How does it not just look like a giant scramble of, you know, if everyone in a room took their name and put them into letters and just threw that into a bucket?

13:39 Like, how do you know which goes right back to which, right?

13:43 How does that happen?

13:44 In case for the sequencing, the omplicon sequencing, it's kind of like built in the design that's designed to deal with that kind of information.

13:51 If you really take like everything, then you need to sequence pretty deep.

13:56 There's like some overlapping parts in there, and it allows you to assemble different genomes.

14:00 But there are like other techniques that people use as well, looking at the number of times you see a certain copy, because that means that it needs to be from a bacteria that's equally abundant, give or take.

14:12 Right?

14:13 So that allows people to link certain parts together.

14:17 But that's not something I've actively done myself yet.

14:20 Sure.

14:21 So I might actually be giving wrong information here to be a bit careful.

14:25 Yeah, no worries.

14:26 It just seems super interesting to me that we're able to do that with such certainty and accuracy, given like how messy that it seems, you know?

14:35 So it's, I don't know, there's a lot about science.

14:37 It's pretty amazing.

14:37 Yes, absolutely.

14:38 For sure.

14:39 Let's go with the fingerprint side of things that you've been working on most recently.

14:43 You've got some soil or some other area that you've gotten this bacteria from.

14:49 You've done the sequencing.

14:50 You have these ratios of different fingerprints.

14:52 This is probably where you start using some Python to answer some questions, right?

14:57 Exactly, right.

14:58 So at that point, let's say you have about like a thousand different samples and you get like the relative abundance of 10,000, 20,000 different bacteria across all those samples.

15:08 Then you need to start using Python or some other framework to make sense out of it.

15:13 So normally, you should collect metadata for all your samples.

15:17 So that describes how a sample was collected, right?

15:22 So for soil, you might want to get the parameters of that soil, like the pH, nutrients that are in there, the place where you sample it, how much water is in there, right?

15:33 If it's from a human, you might want to ask them some medical questions.

15:37 You might want to get some information about their age, their body mass index, stuff like that, right?

15:43 And you need to really pull all those things together to get some new insights in biology, microbiology.

15:50 And to do that, Python is a great platform.

15:53 Sure it is.

15:53 That's awesome.

15:54 So are there like packages or libraries specifically built to address those types of questions?

16:00 There are some packages specifically for microbiology.

16:04 I think Chime is one that's called.

16:06 Also, Pandas, NumPy, like your traditional data science packages work really well for these kinds of questions.

16:13 So yeah, if I need to do some very specific analysis, I like to fire up Jupyter notebook and just load everything in a data frame and start working with that.

16:23 Yeah.

16:23 Yeah.

16:24 The problem is that's like your farther you are in your analysis, the more and more in-depth knowledge you need to have about microbiology or the specific environment you're looking at.

16:34 And very often the biologists that have that knowledge, they don't have the computer skills to kind of like make sense of the ginormous matrix of abundances combined with metadata.

16:45 And so what I try to do is actually develop a platform where we can upload those matrices with all the abundances.

16:53 And then there's an interface hooked up to it that allows people without like the bioinformatic skills to just to also make sense from that data.

17:01 How do they do it?

17:02 Is that visually with like graphs and pictures or like heat maps or is that like CSVs and drop down into Excel or what do you end up giving them?

17:12 So it's kind of like a combination of all of those things, right?

17:15 So on the front page, there will be like a search button, very similar to what you will find on Google, right?

17:21 So where you can just type the name of their favorite bacteria and just see how abundant it is across the different samples.

17:27 And you can group samples based on different things in the metadata.

17:31 There are also different tools that allow you to visualize bacteria that are very often found together in samples as a network.

17:40 So that's an interactive people can drag and drop things, color code the nodes or the edges based on the properties that they want.

17:46 And this is all a combination of basically Flask and Python in the backend and then some JavaScript on the front end.

17:53 So it's all like online that they can basically use their browser as an interface to the data.

18:01 This portion of Talk Python to me is brought to you by Linode.

18:04 Are you looking for hosting that's fast, simple, and incredibly affordable?

18:08 Well, look past that bookstore and check out Linode at talkpython.fm/Linode.

18:13 That's L-I-N-O-D-E.

18:15 Plans start at just $5 a month for a dedicated server with a gig of RAM.

18:19 They have 10 data centers across the globe.

18:21 So no matter where you are or where your users are, there's a data center for you.

18:25 Whether you want to run a Python web app, host a private Git server, or just a file server, you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24-7 friendly support, even on holidays, and a seven-day money-back guarantee.

18:40 Need a little help with your infrastructure?

18:41 They even offer professional services to help you with architecture, migrations, and more.

18:46 Do you want a dedicated server for free for the next four months?

18:49 Just visit talkpython.fm/Linode.

18:52 Is this the thing called MetaConnect that you built?

18:57 Exactly, yes.

18:57 So it's not published yet, so it's not out there for people to look at, but I hope to get that done soon.

19:04 Nice.

19:04 So this is a website that you just described, written in Python and Flask.

19:08 Is it tied to your university work or a company you work with?

19:13 Or is this something that's open source and that's going to be coming out?

19:16 Or what's its status?

19:17 Initially, I created the website as a way to communicate results back to one of the companies for whom I'm analyzing data.

19:25 So I quickly realized, like, okay, I can process all the data, but I'm not an expert in the field that they're active in.

19:31 So I'm going to need some way to get the data to them in a way that's useful for them, right?

19:37 So that's how it started, but we have lots and lots of different projects, also academic ones in the group.

19:44 And one of them is the Flemish gut flora project, where we're trying to get, like, an impression of the bacteria that are resident in the gut of humans, mostly within the Flemish population.

19:57 So Flanders is like the north part of Belgium.

20:00 Yeah. So would you say that that is the most important or the most influential bacterial community that humans interact with or affect humans?

20:12 Like, I mean, there's bacteria in our skin, it's in our hair, all over, and so on.

20:17 But our stomach probably has a bigger effect than others.

20:21 Well, there are quite a lot of bacteria in there, right?

20:23 So if you take a stool sample, there's somewhere between a billion and a trillion bacteria in there, so even more than soil.

20:30 And there are already some associations that we know of of those bacteria with our well-being.

20:37 So a colleague of mine recently published an article where she could demonstrate that there's a link between some of the bacteria in your gut and your mental well-being.

20:46 So we could see that some bacteria are less frequent in people that are actually struggling with mental problems.

20:52 And there are associations with all kinds of diseases.

20:55 Think about, like, irritable bowel syndrome and stuff like that.

20:59 Wow. So this Flemish gut flora project is just collecting all of this data.

21:04 And then do you guys use a similar analysis project and similar tools to study it, or is it different?

21:11 A lot of the methods between studying something like soil samples and human samples are similar, right?

21:17 So basically, you identify what's in there through sequencing.

21:20 And then once you have sequences, the pipelines are very, very similar.

21:25 We use, so there's not really a big difference between the methods that we use.

21:29 Just the microbiology and the conclusions and all that, totally different, possibly, but not the way you get the frequency diagram or whatever it is.

21:38 The biggest differences are in how the things are sampled and how the samples are processed, right?

21:45 So once it's turned into sequences, then it kind of all becomes bytes.

21:48 And then we start using some very similar workflows for different kinds of samples, yeah.

21:53 Yeah. What's that workflow look like?

21:54 It sounds like, you know, if you have 10,000 samples, that's a lot of work.

21:59 It sounds like that needs some kind of automation.

22:01 To some extent, yes, right?

22:02 But of course, the work in the lab that precedes the analysis, that's something that has to happen to some extent manual.

22:09 That's just a lot of hard work, right?

22:11 Exactly, right.

22:11 So there's like a whole logistics part before what I actually do.

22:16 That's just a ton of work.

22:18 And fortunately, we have some very talented lab technicians that take care of this process.

22:23 Yeah, for sure.

22:24 So another thing around this Gut Flora project that you were talking about is you have to do an online questionnaire because it's one thing to just go collect the samples from thousands of people.

22:35 And the people who eat meat more often have this outcome or people who are vegan and don't have any dairy.

22:44 They have that other outcome, right?

22:45 So you actually had to set up a way to collect all that information as well, right?

22:50 Yes.

22:51 So we try to recruit volunteers to participate in this project through different channels.

22:57 And the first thing I need to do is to kind of like fill in like a registration form where they say like, okay, I understand what's going on.

23:05 There's a privacy statement in there.

23:08 Also, they need to give informed consent that they know what they're volunteering to and that they can opt out at any point of the study.

23:14 And then there's an intake questionnaire where people give us some information about their health, their lifestyle, and their diet.

23:22 Now, this is done with commercial software, but it doesn't quite get us all the way there, right?

23:29 So it does about 90% of the things we wanted to do.

23:31 But then there are a few very specific things where we wanted to do something else.

23:35 For instance, we sent sampling kits to all the participants.

23:40 So we want to know that the address that they entered in the questionnaire actually exists.

23:45 So when they're filling in their address, it will quickly use a Python web service that I created to check this against the vPost, so the Belgian post office website, to make sure that that address exists.

23:58 And in case they might have made a typo, we will tell them like, oh, there's something wrong.

24:02 Are you sure this is your address?

24:04 Maybe like there's something off with it.

24:06 Maybe you made a typo somewhere.

24:07 Can you correct it?

24:08 Also, as part of this process, people need to schedule a visit with their general practitioner.

24:15 So to make it very easy for our participants, we made another web service that can kind of suggest the details for a general practitioner based on what they entered.

24:25 So it's a little bit like this type of hat form.

24:27 Once they start typing the last name of their doctor, it will immediately suggest like, okay, is it this doctor that you mean?

24:33 Here's this address.

24:34 And if they click it, it automatically completes the form.

24:36 You know, I wish more sites were like that.

24:39 It would be so nice to be so friendly and say, yeah, you've given us enough.

24:45 Here's just click here.

24:47 We got you.

24:47 It's actually extremely important in this type of research because we're asking these people to basically give up some of their free time to help us out with our research.

24:56 So we really are very grateful that people actually do this.

25:01 Also, they need to take a stool sample, freeze it, and then bring it to like a drop-off point,

25:06 which is usually a local pharmacy.

25:07 But again, that's actually quite a lot of effort that you're asking of someone.

25:11 So at least the parts that we can make easy for participants, we try to make as easy as possible.

25:15 Yes.

25:16 Yeah, yeah.

25:17 This is an easy one.

25:17 But definitely, I mean, there's so many places that could adopt that philosophy and just don't.

25:23 One of the things that's interesting about this that I think is worth pointing out is you said this is a pre-packaged commercial questionnaire site company.

25:34 Yeah, they take care of it, but they let you customize the page by putting in a little bit of JavaScript, right?

25:39 Essentially, when they filled in their address and they click the next button, that's the point where we can fire a little bit of JavaScript.

25:46 So mostly using jQuery to call the web service with the data they entered and then send a response back.

25:54 Yes, this is valid or no, this is not valid.

25:56 And then have the website respond appropriately according on that.

26:00 I think that's a really interesting way to bring in the capabilities that you have using Python and Flask to something that is not really extensible, but it has this little tiny hook where it lets you just do just enough to get into the space.

26:16 And, you know, it lets you not have to say, well, we either can use this the way it is or we can from scratch build our own questionnaire site, but we can just lay on top of it.

26:25 It definitely feels a little bit like cheating, right?

26:28 But yeah, especially because we're working with medical data, it's so important to get everything right in terms of security.

26:35 You have to comply to GDPR and some other like regulations as well.

26:40 So building our own platform would just be like a tremendous amount of work.

26:44 So that's why we opted to go with this commercial platform.

26:48 But indeed, we do need to use some tricks here and there to actually get it to do everything we want it to do.

26:53 It makes sense because your goal is not to build a questionnaire site.

26:56 Your goal is to get samples and information about bacteria colonies, right?

27:02 If you could avoid building that site, like as much as you can, just avoid it, right?

27:06 Absolutely, yes.

27:07 Yeah, but the other one, the first one we talked about, the MetaConnect that the microbiologists can use to study and get the data back in the way they want it.

27:16 That one sounds more core to your goals, right?

27:20 So that one, you said you built that one from scratch with Flask, huh?

27:23 It's actually something that has been going on for some time.

27:26 So previously, it was used to study gene expression in different types of tissues of plants.

27:34 But if you think of gene expression, so that's basically which genes are on or off and how much are they expressed in different tissues.

27:42 It's also an abundance matrix.

27:44 And so I kind of realized fairly early on when I started working on the microbial data that it's very, very similar.

27:50 And a lot of the techniques and tools are the same.

27:54 So I just forked the repository and started like making the necessary changes to actually support microbiome data as well.

28:01 Yeah, sure.

28:01 Get in there, change the title and just go from there.

28:04 Nice.

28:05 Yeah.

28:05 Tell us about it.

28:06 Tell us about some of the technologies that you used and stuff like that.

28:09 Biological data is actually very well suited for relational databases.

28:12 So usually you have like a species or a data set, and then there are sequences related with that.

28:18 And with a sequence, you can have functional information.

28:21 You can group sequences into clusters or gene families.

28:25 So you kind of have this like structure that fits a hierarchical or a relational database very well.

28:32 So there's definitely some MySQL connection there.

28:36 And for that, I'm using all like the Flask packages.

28:38 So just Flask MySQLAlchemy is in there when you're uploading the data through like an admin panel.

28:45 So that's just Flask admin that I used.

28:47 Users don't actually get to see that part.

28:49 So it's a very standard interface.

28:51 So through that interface, you can upload the data.

28:53 So basically as a by-infantition, you processed all the reads that you got,

28:58 and you have your abundance matrix and you upload it.

29:01 So then there are some standard analyses that are just triggered on the server and just ran.

29:07 As soon as you upload the data, it's being...

29:09 Right.

29:10 So some sort of like data ingestion.

29:12 Yeah.

29:12 It loads it up, converts it into the database format, things like that.

29:16 Exactly right.

29:16 And so it also does a couple of analyses immediately as you upload the data.

29:20 And so for that part, I'm just using like Pandas, NumPy, and the Scikit-learn to kind of do those kind of stuff.

29:26 So just a whole lot of Flask going on there on the back.

29:30 And it's cool that you're using Flask admin.

29:32 And of course, if people don't have to, they don't see it.

29:34 If it's just the admin see it, it could just, you know, admin panels are always like the worst looking part of the UI.

29:40 Exactly right.

29:41 So the parts that actually users get to see, that's something where I do spend a bit more time on how it actually looks.

29:47 Nice.

29:48 So this thing, the goal is going to be to open source it.

29:51 Do you think that it makes sense?

29:53 I guess probably it doesn't make sense to run it as like a service that people can generally use or does it?

30:00 I'm thinking because of the privacy.

30:02 But maybe if it's for things like plant data and stuff, like the plants don't have a GDPR.

30:06 Yeah, that's true.

30:06 Is it something that can be generally useful as a site or just as a project people could clone for their projects?

30:12 So I kind of see a couple of different use cases for this.

30:14 So one is communication, right?

30:16 Let's say you did an awesome study with some microbiome data that you can make public.

30:20 You could actually set up your own platform, upload all your data in there and use this kind of like a supplementary web page to your publication.

30:27 So it's a great way for communicating things to other people that might not have the bioinformatics expertise to do that kind of analysis themselves.

30:37 I also personally use a lot to just locally run it, upload the data set that I'm analyzing, and I just get like all those tools at the click of a button rather than using different scripts and making all those like one-off scripts to do some kind of analysis.

30:51 It's almost like a GUI on top of the libraries that you guys are using to study the bacteria.

30:57 It really simplifies analyzing the data.

31:00 There are also now some students that are starting to use it in our group.

31:03 So they have their project, they have their samples.

31:06 They went through the learning curve to learn how to process the reads and get that abundance matrix.

31:11 It's mostly done using our scripts.

31:13 So at that point, it's kind of like nice if you can give them something that's nice and easy to finish up the rest of the analysis.

31:19 And so that's also a use case we have.

31:22 Yeah, you talked about having students maybe working on some new projects that are kicking off at the university there.

31:29 For students out there listening, what kind of skills, you know, maybe they studied biology or maybe they just studied data science.

31:36 What kind of skills do you look for for people to work on your projects?

31:41 If it would be to extend something like MetaConnect, it would be nice to have someone that has some coding skills already.

31:48 Right.

31:49 So this semester there will be one student and he'll definitely be working a little bit on the code of MetaConnect.

31:55 So then I'm looking for someone that has some coding skills.

31:58 But also having some understanding of the biology is necessary.

32:02 Right.

32:02 You don't get that far if you don't understand any of the biology at all.

32:06 So it's really like this combination.

32:08 And bioinformatics is such an interesting and broad field that the amount you need of either the informatics or the biology can differ based on different projects.

32:17 So projects that I would supervise, they would generally not include any work in a lab.

32:23 But there are also people who are doing kind of like a combination of some lab work with some bioinformatics.

32:30 That's also perfectly possible.

32:31 Yeah, I'm sure it sounds fun.

32:32 Students are always looking for research projects to be part of so that they can make those connections at universities.

32:39 But it's interesting just to think about biology students trying to get into the software side, kind of like you did with bioinformatics or the reverse, right?

32:48 Software people who are kind of interested in biology and whether they can cross over for that.

32:53 Actually, I know people did it both ways, right?

32:56 So there are plenty of bioinformaticians that have a background in biology.

33:00 And then the first couple of years really struggled with the scripting, picking up those kind of skills.

33:05 I don't know, people that have like a traditional CS degree and that way got into bioinformatics.

33:09 So both are possible.

33:11 Actually, it was a couple of years ago necessary to pick one part or the other because there was no formal master in bioinformatics at the university at the time.

33:20 So now we actually have that kind of course at the university.

33:23 So you can also decide to just do a bioinformatics master and then you kind of have the biology and the computer science in equal amounts.

33:31 Yeah, that's great.

33:32 One of the libraries that you said you're using on the MetaConnect is something called cytoscape.js.

33:41 Well, it's actually a JavaScript library.

33:44 So that's basically a library that allows you to visualize networks in the browser.

33:49 And so I briefly mentioned it before.

33:51 You can represent these communities of bacteria as networks.

33:54 So think of it as a social network of bacteria, right?

33:58 So bacteria that tend to appear across different samples, they're co-abundant.

34:04 And could be that they're somehow related to each other.

34:07 They're either dependent on each other or maybe they're dependent on the same other resource.

34:12 But the thing is you spot them very often in the same samples.

34:15 And so that allows you to kind of like represent the bacteria as nodes and draw an edge between them.

34:21 Allows us to kind of study the communities of bacteria because we do expect that bacteria that are often in the same samples somehow need each other to do what they need.

34:34 This portion of Talk Python to Me is brought to you by Tidelift.

34:37 Tidelift is the first managed open source subscription, giving you commercial support and maintenance for the open source dependencies you use to build your applications.

34:46 And with Tidelift, you not only get more dependable software, but you pay the maintainers of the exact packages you're using, which means your software will keep getting better.

34:55 The Tidelift subscription covers millions of open source projects across Python, JavaScript, Java, PHP, Ruby, .NET, and more.

35:02 And the subscription includes security updates, licensing, verification, and indemnification, maintenance and code improvements, package selection and version guidance, roadmap input, and tooling and cloud integration.

35:14 The bottom line is you get the capabilities you'd expect and require from commercial software.

35:20 But now, for all the key open source software you depend upon, just visit talkpython.fm/Tidelift to get started today.

35:30 And I bring it up because it looks like such a cool library.

35:32 Obviously, it just does graphing type stuff.

35:35 But the variety of the visualizations there are super cool.

35:39 So people who are looking to draw those types of relationships, I think there's a lot of stuff to take here.

35:46 There's even one where it represents like different animals linked together and the vertices are actually pictures of animals and stuff like that.

35:54 So it's a really nice library for that.

35:56 Was it pretty easy to use?

35:57 It's very, very performant in terms of drawing the graphs.

36:00 Yeah.

36:01 But you kind of have to build the interface around it.

36:04 It just draws nodes and edges, gives you some functions to interact with them.

36:08 But if you want to have like an option to change the color of the nodes, those little things you have to add yourself.

36:15 So it's pretty straightforward once you have your data in the right format to visualize it using Cytoscape.js.

36:21 I struggled quite a bit to build the interface and all the buttons that I required for my viewer to get them to link.

36:29 To get it to behave just right.

36:31 Exactly.

36:31 Sure.

36:32 It's one thing to draw and it's another to make it interactive.

36:35 For sure.

36:36 So one of the areas that I thought would be fun to talk about would be the pressures of being in an academic space and what counts as like merit and credit in that world versus industry.

36:48 So it sounds like a lot of the work that you're doing is around academics, but you also have a little bit of industry experience going on at the same time.

36:58 And, you know, with academics, it's all about the academic paper, right?

37:03 Yes.

37:03 Is that a challenge for my goal is to build this cool library in Python to help biologists, but I can't publish that.

37:10 So what am I going to actually do along with that?

37:12 What's the tension like there?

37:14 I am analyzing some data for a company.

37:18 And I think it's a really cool project, but it's their intellectual property.

37:21 So I'm not talking too much about it.

37:23 And so basically the findings there, that's all the company's thing.

37:27 But so software that I'm designing, that's more the academic part.

37:30 It's true that like in academia, basically your work is measured by the number of publications that you have.

37:36 So I need to kind of keep a balance between them, right?

37:39 So I need to make sure that I also get some publications for my resume.

37:43 Do you think that the industry is going to, or academia, I guess, is going to change somewhat to recognize software contributions as part of research more?

37:53 I mean, this is not a problem just focused on biology, right?

37:57 This is a problem in physics and chemistry and mathematics and astronomy, right?

38:03 Whatever, right?

38:03 Anywhere where there's people building these great libraries, that's taking away from their time to do research, maybe.

38:09 If you're really building like a library for other people to use, for sure, that's going to be an issue.

38:14 I've mainly been developing applications.

38:17 And so then the way we did it was we kind of released the application along with data sets.

38:23 And then you can kind of do some research with that data set as well.

38:28 So you kind of have done a combination of novel data sets with novel software.

38:33 And people can start using it.

38:34 And that can also become like a publication on its own.

38:37 However, there it does make it challenging to keep on supporting the software.

38:41 Because once the publication is out, right?

38:44 You got what you need.

38:46 There's no more reason to keep adding features to it or whatever, right?

38:50 Yes.

38:50 So then I need to continue basically building a larger data set in combination with better software to kind of keep it going.

38:59 And so I think during my PhD, one of the projects I was working on, really were very successful in doing that.

39:04 It was always a combination of like, here's an online platform with like a cool tool that we developed.

39:10 But also each time, here's like a larger data set, more species.

39:14 It certainly makes sense to build tools and applications to answer your questions, right?

39:19 Because I've worked in research groups before.

39:22 And a lot of the work was, we need to answer this question.

39:27 We need to visualize this thing.

39:28 We need to transform that data.

39:29 And it was just writing software to do that.

39:32 So have you heard about the Journal of Open Source Software, Joss?

39:36 I have.

39:36 I actually think it was through one of your podcasts that I learned about it.

39:40 Yeah.

39:40 Yeah.

39:40 We had it on the show.

39:41 Is that something relevant in your space?

39:44 So far, I don't have any experience with it.

39:46 But I do like the idea of breaking down these like big research projects into smaller components

39:53 that can be reused, right?

39:55 MetaConnect, for instance, is definitely a monolith.

39:57 Yeah.

39:58 It's like a whole package that does so many different things.

40:02 It would be cool if I could basically take some components out of there and maybe get

40:07 like a smaller publication out of it through Joss or some other journal.

40:10 And it's something I've been trying to do recently, but so far, I haven't really been able to test

40:15 it or to put it in practice.

40:16 Yeah.

40:17 I was just thinking, okay, it seems like some of your stuff that you're building, it repackaged

40:22 correctly, it might fit in there, right?

40:23 Yes.

40:24 So for instance, we have a lot of metadata for the Flemish Gut Flora project.

40:28 Right.

40:28 So we have thousands of people and we have all this medical information, this information

40:33 about their diet and their lifestyle.

40:34 And I have a viewer for that.

40:36 Now, that's something you can basically use across different disciplines, right?

40:40 Every study where you have a ton of metadata would benefit from such a viewer.

40:45 That's, for instance, one of the parts that I would like to kind of like separate into its

40:48 own repository that could become its own project.

40:51 Cool.

40:51 Well, maybe I guess school has just started, so it's probably not the easiest time to find

40:56 free time, but maybe next summer or something or over winter break.

40:59 Yeah.

41:00 I'll have a look at it once I have some free time on my hands to work on it.

41:04 Absolutely.

41:04 So where's this research going and what are you building next?

41:09 Like what more analysis do you want to do?

41:11 What more libraries or apps are you building?

41:13 I'm actually quite happy with the microbiome studies.

41:16 It's a very challenging field.

41:19 And a lot of the things that we are doing now are association studies.

41:23 So we look at the population and see like, okay, if you have more of this bacteria, then

41:28 maybe there's a higher chance that you have certain disease.

41:32 But that's just an association, right?

41:34 It doesn't mean that having that bacteria will actually cause a disease.

41:37 Correlation is not causation.

41:38 Correlation or causation.

41:39 Yep.

41:39 Exactly.

41:40 Exactly.

41:41 And so right now we're moving also towards more like longitudinal studies, intervention

41:47 studies to kind of like really establish those causal links.

41:51 So that we can say like, yeah, it's really the bacteria that causes something or no, it's

41:55 actually disease that causes the person to eat something differently or behave differently.

42:00 And that is actually changing the microbes in the gut, for instance.

42:05 Is it a signal or is it a cause?

42:06 So that's now happening.

42:08 And then of course, if you get like time resolved data, that's going to be like a next challenge,

42:12 I think, to integrate in all of this.

42:13 So it sounds like more data might be on hand and you already have a lot of data.

42:18 Where does machine learning factor in your world?

42:22 Have you tried to use like machine learning models to understand this and make these inferences?

42:27 I know machine learning is extremely good at taking a picture and saying, oh, that means

42:32 it's like this, but what about one of these like abundancy grids or something like that?

42:38 One of the things that we noticed or that basically my colleagues noticed a couple of years ago

42:44 is that we can kind of stratify the Flemish population into four different groups based

42:50 on the bacteria that are in the gut.

42:52 What is the distinctions there, the categories?

42:54 Certain people just have like more of one kind of bacteria, right?

42:58 And that seems to be kind of like one group and they're not distinct groups.

43:01 It's more like a spectrum.

43:03 So we noticed there are these four different groups, right?

43:07 So one of the things that would be nice is if basically we could use some machine learning

43:12 to automatically classify new samples into these different enterotypes.

43:16 You could also start thinking of kind of trying to predict something based on the gut microbes.

43:22 But I think that's still some ways off.

43:24 It would be great if you could just look at the sample and say like, okay, this person is at

43:28 risk for a certain disease.

43:29 Maybe that will work.

43:31 Maybe not.

43:32 So that's still very vague right now, whether we'll be able to do that or not.

43:36 But people are probably looking at it.

43:38 Yeah, it sounds exciting.

43:39 It doesn't sound out of reach at all.

43:41 It really depends on like the data that we can generate.

43:44 And yeah, we'll need to explore that.

43:46 I guess it depends on how reliable the science is that says if it's like this, that means

43:52 that, right?

43:53 Because you've got to train the models to say, I don't know, maybe the models could make discoveries

43:58 that people don't see.

43:58 I don't know.

43:59 I haven't done enough machine learning, but there's probably some really interesting things

44:03 you could do with machine learning and that data.

44:05 For sure.

44:05 Yes.

44:06 So, but whether or not we'll be able to make any kind of predictions towards health, that's

44:11 something I'm very careful about.

44:12 Categorization one sounds likely.

44:14 That's something that will probably be possible based on what I've seen so far.

44:18 And as we're generating more and more data, it would be very nice to just do that initial

44:22 classification into the enter type in this automatic way.

44:26 For sure.

44:26 So, well, it sounds like you're making some cool use of Python to study these microbiomes.

44:31 And they're just so fascinating because they're really out of reach for us in any way that

44:37 we can conceive about it other than a very small view with a microscope.

44:41 It's super interesting.

44:42 I think so too.

44:42 Hopefully that inspires the biologists out there to do more Python and things like that.

44:47 Super cool.

44:49 So thanks for sharing that with us.

44:51 I think probably we'll leave it there for that conversation.

44:53 But before you go, I do have the two questions for you.

44:57 So if you're working on this project, you can write some Python code.

45:01 What editor do you use?

45:02 So if I'm working on MetaConnect or any kind of like large project, that would be PyCharm.

45:06 If it's more like a very specific analysis, I'll fire up Jupyter Notebooks.

45:11 Where do you make that tradeoff?

45:12 Where do you say, okay, this is now it's too much for the Jupyter Notebook.

45:15 I'm switching to PyCharm.

45:16 Like when do you transition from one to the other?

45:19 So I have like a couple projects and they really are from the beginning decided to be

45:23 like an application I'm developing.

45:25 And for that, it's automatically something I will do in PyCharm.

45:28 If it's more like processing data, preparing it to do something else with it or something

45:33 I just have to do once, maybe twice, then I'll go for the notebook.

45:37 Also for reporting back something.

45:39 I think the notebooks are really powerful.

45:41 So you can just, you know, generate all the graphs, export it to PDF and say, here's the

45:45 thing that you asked.

45:46 Yeah, that's perfect.

45:47 Super.

45:47 And then notable PyPI package that makes your life better?

45:52 One that I discovered recently that really like saved my day is something called UltraJSON.

45:57 I think it's pip install ujson to get it.

46:00 And it's basically a very performant version of the JSON library in the standard library.

46:07 So I was developing an endpoint that had to like deserialize some JSON objects that were

46:12 stored in the database, combine them and serialize them again to something I could then send to

46:17 a viewer.

46:18 And that's just made everything a couple hundred milliseconds faster, which made the website

46:23 so much more responsive.

46:25 Oh, that's super cool.

46:26 Yeah.

46:26 UltraJSON is a JSON encoder and decoder written in pure Python.

46:31 I'm sorry, pure C with bindings for Python.

46:33 Yeah, that's great.

46:34 So you probably have a lot of data on those endpoints, right?

46:37 Like the 10,000 measurements or whatever.

46:40 Yeah, exactly.

46:41 So if like I need to kind of like deserialize one of those complex objects, do something with

46:46 it, combine it with like another one and serialize it again, very quickly takes seconds, which for

46:52 a web interface is very slow.

46:53 So everything that can just take like a couple hundred milliseconds off that time is really

46:59 great.

46:59 And is it the same API as regular JSON module?

47:02 For the most of it, yeah.

47:04 Pretty much everywhere in my code, I was able to just write import UJSON as JSON and it worked.

47:09 That's sweet.

47:10 All right.

47:10 Well, that's a great suggestion then because it sounds so easy to adopt and will be helpful

47:14 for a lot of folks.

47:15 Cool.

47:16 All right.

47:16 Well, final call to action.

47:18 People are out there interested in biology or biologists are out there interested in Python.

47:22 when they want to get started, what do you say to them?

47:24 Right.

47:24 So definitely try some coding in Python.

47:28 Even if you're mostly into biology, just knowing a little bit of the basics can make your life

47:33 a lot easier.

47:33 Just being able to automate something for a couple times can prove very useful, right?

47:40 Yeah, absolutely.

47:40 You know, escape Excel, right?

47:42 Yes.

47:43 Or MATLAB.

47:43 One of the two.

47:44 Yeah.

47:44 Like we're collecting more and more data.

47:46 So I encounter it so often that people were using Excel and at one point it just hit that

47:52 wall and then, you know, Python, Panda, scikit-learn and help you break through that.

47:58 So have a look at it if you need it.

48:00 Cool.

48:00 All right.

48:00 Well, Sebastian, thanks for being on the show.

48:02 It's great to chat with you.

48:03 Yeah, it's great to chat with you as well, Michael.

48:04 Thanks.

48:06 This has been another episode of Talk Python to Me.

48:08 Our guest in this episode was Sebastian Proust, and it's been brought to you by Linode and Tidelift.

48:14 Linode is your go-to hosting for whatever you're building with Python.

48:18 Get four months free at talkpython.fm/linode.

48:21 That's L-I-N-O-D-E.

48:23 If you run an open source project, Tidelift wants to help you get paid for keeping it going

48:28 strong.

48:29 Just visit talkpython.fm/Tidelift, search for your package, and get started today.

48:34 Want to level up your Python?

48:37 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

48:42 Or if you're looking for something more advanced, check out our new async course that digs into

48:47 all the different types of async programming you can do in Python.

48:50 And of course, if you're interested in more than one of these, be sure to check out our

48:54 Everything Bundle.

48:54 It's like a subscription that never expires.

48:56 Be sure to subscribe to the show.

48:58 Open your favorite podcatcher and search for Python.

49:01 We should be right at the top.

49:02 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the

49:08 direct RSS feed at /rss on talkpython.fm.

49:11 This is your host, Michael Kennedy.

49:13 Thanks so much for listening.

49:15 I really appreciate it.

49:16 Now get out there and write some Python code.

49:17 I'll see you next time.

49:38 Thank you.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon