Monitor performance issues & errors in your code

#237: A gut feeling about Python Transcript

Recorded on Friday, Sep 27, 2019.

00:00 Michael Kennedy: Let's start this episode with a philosophical question. Are you human? Are you sure? We could begin to answer that question physically. Are you made up of cells that would typically be considered as belonging to the human body? It turns out the answer is, uh maybe. We have many ecosystems within us. Understanding them is essential to our own wellbeing. In this episode, you'll meet Sebastian Proost, who is using Python to study the bacteria in our world. This is Talk Python To Me, Episode 237, recorded September 27, 2019. Welcome to Talk Python To Me, a weekly podcast on Python. The language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm @MKennedy. Keep up with the show and listen to past episodes at talkpython.fm. And follow the show on Twitter, via @talkpython. This episode is brought to you by Linode and Tidelft. Please check out what they're offering during their segments. It really helps support the show. Hey folks, I want to tell you about a new course that we just launched, over at Talk Python Training. This course lets you create web apps with pure Python. Now you may be thinking, sure I can use Flask and create web apps with pure Python now. No, I mean on the browser, as well as on the back end. This course is about the Anvil platform and the course is called Anvil Web Apps With Nothing But Python. And what's really cool is, yes you have a back end, you can write Python code there, and that's cool. They manage the database, but also, they've set it up so you can write Python code and use a visual designer with events on the front end, to create a Single Page App or a SPA, and write it entirely in Python. So if all the different technologies involved in building web apps are overwhelming to you but you still need to build a web app, check out this course. It's a five hour course and it's even free. Just visit talkpython.fm/anvil-course and it'll take you right there. Log in, click "Join this course", and you got it. Now let's get to that interview. Sebastian, welcome to Talk Python To Me.

02:14 Sebastian Proost: Hi Michael, thanks for having me.

02:16 Michael Kennedy: Hey, it's great to have you here. It's an honor. We're going to have so much fun talking about some very small things that have maybe a big impact on our world.

02:25 Sebastian Proost: Exactly, yeah.

02:27 Michael Kennedy: So we're going to talk about studying bacteria and communities of bacteria using Python, which is going to be fascinating. But before we get to that, let's start with your story. How'd you get into programming and Python?

02:37 Sebastian Proost: So I actually got into programming about 20 something years ago. I was 12 and my favorite game was Quake at the time. And apart from the actual game, I also had a disc with all kinds of modifications. And on there was the source codes for the game logic, that you need to make your own modification, and the Quake C Compiler.

02:59 Michael Kennedy: How nice!

03:00 Sebastian Proost: Now the cool thing with Quake C is that you can actually decompile the code from other modifications to readable code. So I was just de-compiling stuff I liked, trying to figure out how it worked and then trying to basically come up with my own stuff. And I never really succeeded into making like the decisive version of Quake, but I did get a pretty good understanding of what's a function, what's a for loop, what's an if statement. So I really learned programming through that medium.

03:30 Michael Kennedy: I used to play Quake and Doom, and those were fun games. And I remember the mods! I think it was not Quake but Doom, which was by that same software, it was the predecessor right?

03:44 Sebastian Proost: Yeah.

03:45 Michael Kennedy: And you could change things up. I remember one we had was a Barney mod, do you know?

03:50 Sebastian Proost: The dinosaur, right?

03:52 Michael Kennedy: The weird purple dinosaur, yeah. It would replace all the characters with this dinosaur thing and different dinosaurs. And they made weird noises and it was amazing. There was another one where it had these super scary creatures, and a friend of mine was playing it, and it was dark, and I like spooked him from behind right as this creature from this mod came out. And he got so scared, he ripped the keyboard drawer off of my desk. I thought, "Oh my gosh, I should..." He was just totally frightened, it was so funny. I really enjoyed those mods. It was great, that was a really cool way to learn programming.

04:25 Sebastian Proost: Yeah, I had like a whole bunch of them. And some were like tiny. They were just like secondary fire for one of the guns. And I was just looking for a way to combine them because you could only load one at a time and I wanted to have like multiple running at the same time. And so after I played around with Break C for a while, I did pick up some proper programming languages through high school. But then ultimately, I went on and studied biology. I always was a bit in doubt whether it was the right choice or not.

04:52 Michael Kennedy: Sure.

04:54 Sebastian Proost: At the end of my biology degree, I noticed that there's something like bioinformatics. So I'm like, "Oh, that's great! Allows me to kind of combine the best of both worlds." And I started a PhD in bioinformatics. Now at the time, we were using a combination of Perl with some Java applets for our web stuff. There was also PHP in there, so it was kind of like...

05:15 Michael Kennedy: Early days.

05:16 Sebastian Proost: Yeah, just kind of like a whole bunch of things glued together. So a couple years later, I wanted the opportunity to start like a new platform. I thought okay, I want to stick with one language for the back end, one for the front end, and that's how I basically discovered Python and Flask.

05:32 Michael Kennedy: Okay.

05:34 Sebastian Proost: And we haven't looked back ever since. It's just such a convenient way to develop things.

05:38 Michael Kennedy: Yeah. I got to think that the relevance of Python to biology is even more significant now than it was back then, with all the data science developments and the other extra packages getting developed and all.

05:50 Sebastian Proost: Sure. So in the beginning, bioinformatics was really fueled by Perl and just over the course of a couple years, this has been completely taken over by Python.

06:00 Michael Kennedy: Yeah, I can imagine. What kind of data files and stuff are you working with? I imagine like a lot of genetic sequence character strings or like what kind of stuff are you working with when you're doing bioinformatics?

06:15 Sebastian Proost: Right, so this really is dependent on what field you're in. In my case, it's mostly sequence information. So that kind of be either raw sequences that come from a machine. So those are very, very short reads, about 20 to a couple hundred nucleotides, and quality scores for each of them. Like how reliable they have been sequenced? Or it can also be like a full genome sequence, which is just a string of several millions of characters.

06:41 Michael Kennedy: Right, a lot of A, C, T, and G in there?

06:42 Sebastian Proost: But they're, like bioinformatics is a very, very broad field. So there are people that are doing completely different things with our types of data. If you look at metabolite data, which are like the chemical compounds that appear in a certain sample, that's a very different type of data entirely.

06:58 Michael Kennedy: 'Kay, it's super in... It sounds like a fun place to be working, for sure. Biology's super cool. Now let's talk a little bit just, at large, about communities of bacteria. Because as humans, a lot of people are kind of creeped out by bacteria, the thought of like, "There's something on my finger." Or worse, "If I go to like a school and I touch the handle to the door, what other stuff came off of other people, sniffly kids or whatever? Now I have this bacteria on me." And probably it's actually viruses you care about, but whatever, right? People are freaked out about it. But it's a much more common and actually important part of like for all creatures, including humans right? Like, there's a really interesting TED Talk, which I'll link to, from your boss. And he starts the TED Talk with something like, "So you think you're human, huh?"

07:52 Sebastian Proost: Yes. So I might have to creep those people who are afraid of bacteria out a little bit more. Because basically for every cell that makes up your body, there is a bacteria somewhere on or in your body. And if you look at your genetic information, the number of genes that you have compared to the total number of genes those bacteria have, that's like almost 100 fold higher for bacteria.

08:16 Michael Kennedy: Wow. So there's 100 times more genetic information of bacteria on us and we're about 50/50 in, maybe not in mass, but in terms of number of cells.

08:27 Sebastian Proost: Exactly.

08:27 Michael Kennedy: Just half bacteria, half human cells.

08:30 Sebastian Proost: Yeah, exactly. That's going to take a moment to sink in. Now, I kind of knew that, but there's another stat that you threw out there, that numbers. I think this is an interesting topic, partly because these numbers are so big and the things are so small, that it's hard to have a concept around it. It's hard to conceptualize it. So you said that if you get, say a gram of soil. A gram I mean, that's like tiny right? Like a little sugar cube sized bit of soil that has up to a billion bacteria in it. Just that little square, right? Yes, so it's like a very, very minute amount of material, and there's a ton of bacteria in there. And if you look at the complexity or the number of different species in that sample, that's stellar. Like in that single gram of soil, you could easily have like 10,000 different kinds of bacteria. So it's a very, very complex environment that interacts really with its soil, the roots of plants, right? On our body there are so many different bacteria on our skin but also on the inside, in our gut. And they actually affect us.

09:38 Michael Kennedy: And they could be bad, but a lot of times it could be good, right? Like the bacteria in your gut has a lot to do with your health. You hear about people taking antibiotics and it killing out some of that stuff in their stomach and they're not as healthy necessarily, because of that, right? So yeah, it's crazy.

09:55 Sebastian Proost: We actually tend to think of bacteria as something that causes disease, that causes inflammation and infections, right? But a lot of bacteria can do exactly the opposite thing. A lot of the bacteria that are living kind of like as our symbiants really, they're essential for our health. So they can really improve our quality of life.

10:14 Michael Kennedy: Right, maybe they help you with digestion, like certain types of proteins or things can be broken down because they're there, right?

10:21 Sebastian Proost: Yeah, I can imagine. So I'm mostly on the computation part, so I'm not like a microbiologist. So that's a bit farther off of what I'm doing.

10:30 Michael Kennedy: Yeah, yeah sure.

10:31 Sebastian Proost: But yeah, so they're tons of bacteria that are actually beneficial for us as well. So we really need them as much as they need us.

10:37 Michael Kennedy: Yeah, I think at least in the U.S., there's been a trend against things like antibacterial soap and stuff like that, that tries to completely kill off bacteria. I think as much as it might creep people out, I think there's a realization that bacteria are important.

10:49 Sebastian Proost: Yeah, I think that's true, right? Like if you take antibiotics, you're killing off the bad guys. But if it's a broad spectrum antibiotic, you're also killing some of the healthy bacteria, the ones that you do need. And that actually has a long lasting affect on your microbiomes. It takes awhile to recover from that. So people are really looking into better ways to treat these infections.

11:12 Michael Kennedy: Yeah, for sure. So we had this quote or this number that there's a billion bacteria in a gram of soil and there's 10,000 different kinds of bacteria. So how do you come up with that? Do you like train grad students to recognize bacteria and they take notes and they sit around counting? How do you come up with that number? You can't be counting it, right?

11:34 Sebastian Proost: Yeah, usually what I tell people, like we look at the amount of bacteria, which bacteria are in like different samples, they think of me hunched over a microscope and just counting what I see. It doesn't quite work like that. So all bacteria, in their DNA, have like a small part, and it's kind of similar enough that we can pick it up and copy it from all those different species of bacteria, but it's different enough that we can actually identify them to some extent. So there are like tiny differences in that sequence and that allows us to kind of see like, okay, this read is different from that one. So what we do is we basically take a DNA, copy over that one part a bunch of times, and then sequence all the DNA and those copies. And that allows us to kind of look in databases, like which part matches which bacteria, and get more reads from one specific bacteria. Then we know, okay there's probably more of them in the sample than from another one, where it's fewer reads.

12:31 Michael Kennedy: Right, you just basically count the number of matches of that species DNA and you say well that must be roughly representative of its ratio in the general population.

12:41 Sebastian Proost: Yup, exactly right. But this is kind of like a finger print if you want. So it tells us that certain bacteria with that sequence is in the sample but it doesn't tell us all that much about the sequence or about the bacteria that we're looking at. There's also like a more modern version if you want, where we really try to go in and sequence everything, all the genetic information in the sample, and it gives us much more detail about what those bacteria can do, as we get full genome information for at least some of them. But for now, our research is mostly based on the fingerprint sequence.

13:15 Michael Kennedy: Sure, so something I've always wondered about. Just sequencing DNA in general is, you've got so many cells mixed in there, and especially in this case, you've got different species mixed in there. Like, how do you know that this part is all one spe... How do you separate that? How does it not just look like a giant scramble of... If everyone in a room took their name and put 'em into letters and just threw that into a bucket, like how do you even know which goes back to which, right? How does that happen?

13:44 Sebastian Proost: In case for the sequencing, the applicant sequencing, it's kind of like built in the design, that's designed to deal with that kind of information. If you really take like everything, then you need to sequence pretty deep. If there's like some overlapping parts in there and it allows you to assemble different genomes. But there are like other techniques that people use as well. Looking at the number of times you see a certain copy, 'cause that means that it needs to be from a bacteria that's equally abundant, give or take, right? So that allows people to link certain parts together. But it's not something I've actively done myself yet so I might actually be giving wrong information here. So be careful.

14:26 Michael Kennedy: Yeah, no worries. It just seems super interesting to me that we're able to do that with such certainty and accuracy, given like how messy that it seems, you know? So there's a lot about science that's pretty amazing.

14:37 Sebastian Proost: Yes, absolutely.

14:39 Michael Kennedy: For sure. Let's go with the fingerprint side of things that you've been working on most recently. You've got some soil or some other area that you've gotten this bacteria from, you've done the sequencing, have these ratios of different fingerprints. This is probably where you start using some Python to answer some questions, right?

14:57 Sebastian Proost: Exactly right. So at that point, let's say you have about 1,000 different samples, and you get the relative abundance of 10,000, 20,000 different bacteria across all those samples, then you need to start using Python or some other framework to make sense out of it. So normally, you should collect metadata for all your samples. So that describes how a sample was collected, right? So for soil you might want to get the parameters of that soil. Like pH, nutrients that are in there, the place where you sampled it, how much water is in there, right? If it's from a human, you might want to ask them some medical questions. You might want to get some information about their age, their body mass index, stuff like that, right? And you need to really pool all those things together, get some new insights in biology, microbiology. And to do that, Python is a great platform.

15:53 Michael Kennedy: I'm sure it is. That's awesome. So are there like packages or libraries specifically built to address those types of questions?

16:00 Sebastian Proost: There are some packages specifically for microbiology. I think Chime is one, it's called. Also pandas and NumPy, like your traditional data science packages work really well for these kind of questions. So yeah, if I need to do some very specific analysis, I like to fire up Jupyter Notebook and just load everything in the data frame and start working with that. The problem is, that the farther you are in your analysis, the more and more in-depth knowledge you need to have about microbiology or the specific environment you're looking at. And very often, the biologists that have that knowledge, they don't have the computer skills to kind of make sense of the ginormous matrix of abundances combined with metadata. And so what I try to do is actually develop a platform where we can upload those matrices with all the abundances and then there's an interface hooked up to it, that allows people without bioinformatics skills to also make sense from the data.

17:01 Michael Kennedy: How do they do it? Is that visually, with like graphs and pictures or like heat maps? Or is that like CVS's and they drop down into Excel? Or what do you end up giving them?

17:12 Sebastian Proost: So it's kind of like a combination of all of those things, right? So on the front page, there will be a search button, very similar to what you will find on Google, right? So where you can just type the name of their favorite bacteria, and just see how abundant it is across the different samples. And you can group samples based on different things and metadata. There are also different tools that allow you to visualize bacteria that are very often found together in samples, as a network. So that's an interactive and drag and drop things, color-code the notes or the edges, based on the properties that they want.

17:46 Michael Kennedy: Okay, that sounds fun.

17:48 Sebastian Proost: And this is all a combination of basically Flask and Python back end, and then some Javascript on the front end. So it's all like online that they can basically use their browser as an interface to the data.

18:01 Michael Kennedy: This portion of Talk Python To Me is brought to you by Linode. Are you looking for a hosting that's fast, simple, and incredibly affordable? Well look past that bookstore and check out Linode at talkpython.fm/linode. That's L-I-N-O-D-E. Plans start at just $5 a month, for a dedicated server with a gig of RAM. They have 10 data centers across the globe, so no matter where you are or where your users are, there's a data center for you. Whether you want to run a Python web app, host a private Git Server, or just a file server, you'll get native SSD's on all the machines, a newly upgraded 200 gigabyte network, 24/7 friendly support, even on holidays, and a seven day, money-back guarantee. Need a little help with your infrastructure? They even offer professional services to help you with architecture, migrations, and more. Do you want a dedicated server for free for the next four months? Just visit talkpython.fm/linode. So is the thing called MetaConnect that you built?

18:57 Sebastian Proost: Exactly, yes. So it's not published, so it's not out there for people to look at. But I hope to get that done soon.

19:04 Michael Kennedy: Nice, so this is the website that you just described, written in Python and Flask. Is it tied to your university work or a company you work with? Or is this something that's open-source and that's going to be coming out? What's its status?

19:17 Sebastian Proost: Initially, I created the website as a way to communicate results back to one of the companies for whom I'm analyzing data. So I quickly realized like, okay I can process all the data, but I'm not an expert in the field that they're active in. So I'm going to need some way to get the data to them in a way that's useful for them, right?

19:37 Michael Kennedy: Yeah.

19:38 Sebastian Proost: So that's how it started. But we have lots and lots of different projects. Also academic ones in the group, and one of them is the Flemish Gut Flora Project, where we're trying to get an impression of the bacteria that are resident in the gut of humans, mostly within the Flemish population. So Flanders is like the north part of Belgium.

20:00 Michael Kennedy: Yeah, so would you say that that is the most important or the most influential bacterial community that humans interact with or affect humans? I mean, there's bacteria on our skin, it's in our hair, all over, but our stomach probably has a bigger affect than others?

20:21 Sebastian Proost: Well there are quite a lot of bacteria in there, right? So if you take a stool sample, there's somewhere between a billion and a trillion bacteria in there. So even more than soil. And there are already some associations that we know of of those bacteria, with our wellbeing. So a colleague of mine recently published an article where she could demonstrate that there's a link between some of the bacteria in your gut, and your mental wellbeing. So we could see that some bacteria were less frequent in people that are actually struggling with mental problems. And there are associations with all kinds of diseases. Think about like irritable bowl syndrome and stuff like that.

20:59 Michael Kennedy: Wow, so this Flemish Gut Flora Project, it's just collecting all of this data and then do you guys use a similar analysis project and similar tools to study it, or is it different?

21:11 Sebastian Proost: A lot of the methods between studying something like soil samples and human samples, are similar. So basically you identify what's in their pre-sequencing and then once you have sequences, pipelines are very, very similar, we use. So there's not really a big difference between the methods that we use.

21:30 Michael Kennedy: Just the microbiology and the conclusions and all that, totally different possibly, but not the way you get the frequency diagram or whatever it is?

21:39 Sebastian Proost: The biggest differences are in how the things are sampled and how the samples are processed, right? So once it's in sequences, then it kind of all becomes bytes, and then we start using some very similar workflows for different kinds of samples, yeah.

21:53 Michael Kennedy: Yeah, what's that workflow look like? It sounds like if you have 10,000 samples, it's a lot of work. It sounds like that needs some kind of automation?

22:01 Sebastian Proost: To some extent, yes right? But of course the work in the lab that precedes that analysis, that's something that has to happen to some extent manual.

22:09 Michael Kennedy: That's just a lot of hard work, right?

22:11 Sebastian Proost: Exactly right. So there's like a whole logistics part before what I actually do, that's just a ton of work and fortunately we have some very talented lab technicians that take care of this process.

22:23 Michael Kennedy: Yeah, for sure. So another thing around this Gut Flora Project that you were talking about, is you have to do an online questionnaire. Because it's one thing to just go collect the samples from thousands of people, and the people who eat meat more often, have this outcome. Or people who are vegan and don't have any dairy, they have that other outcome, right? So you actually had to set up a way to collect all that information as well, right?

22:50 Sebastian Proost: Yes, so we tried to recruit volunteers to participate in this project, through different channels. And the first thing I need to do is to kind of like fill in a registration form, where they say like, okay I understand what's going on. There's a privacy statement in there. Also they need to give informed consent that they know what they are volunteering to. And if they cannot, probably then quit the study. And then there's an intake questionnaire where people give us some information about their health, their lifestyle, and their diet. Now this is done with commercial software, but it doesn't quite get us all the way there, right? So it does about 90% of the things we want it to do, but then there are a few very specific things where we want it to do something else. For instance, we sent sampling kits to all the participants. So we want to know that the address that they entered in the questionnaire, actually exists. So when they're filling in their address, it will quickly use a Python web service that I created to check this against the B Post, or the Belgium Post Office website, to make sure that the address exists. And in case they might have made a typo, we will them them, "Oh, there's something wrong. Are you sure this is your address?" Maybe like there's something off with it, maybe you made a typo somewhere, can you correct it? Also, as part of this process, people need to schedule a visit with their general practitioner. So to make it very easy for our participants, we made another web service that can kind of suggest the details for a general practitioner, based on what they entered. So it's a little bit like this type of platform, once they start typing the last name of their doctor, it will immediately suggest like okay, is it this doctor that you mean? Here's his address. And if they click it, it automatically completes the form.

24:36 Michael Kennedy: You know, I wish more sites were like that. Right? It would be so nice to be so friendly and say, "Yeah, you've given us enough, just click here. We got ya."

24:47 Sebastian Proost: It's actually extremely important in this type of research. Because we're asking these people to basically give up some of their free time to help us out with our research. So we really are very grateful that people actually do this. Also they need to take a stool sample, freeze it, and then bring it to a drop-off point, which is usually a local pharmacy. But yet again, that's actually quite a lot of effort that you're asking of someone. So at least the parts that we can make easy for participants we try to make as easy as possible, yes.

25:16 Michael Kennedy: Yeah, yeah, this is an easy one. But definitely there's so many places that could adopt that philosophy and just don't. One of the things that's interesting about this that I think's worth pointing out, is you said this is a pre-packaged commercial questionnaire site/company and they take care of it, but they let you customize the page by putting in a little bit of Javascript, right?

25:39 Sebastian Proost: Essentially, when they filled in their address and they click the "next" button, that's the point where we can fire a little bit of Javascript. So mostly using jQuery to call the web service with the data they entered, and then sends a response back, "Yes this is valid", or "No this is not valid". And then have the website respond appropriately, reporting on that.

26:01 Michael Kennedy: I think that it's a really interesting way to bring in the capabilities that you have, using Python and Flask, to something that is not really extensible, but it has this little tiny hook, where it lets you do just enough to get in to the space and it lets you not have to say, "Well, we either can use this the way it is, or we can, from scratch, build our own questionnaire site. But we could just lay on top of it."

26:26 Sebastian Proost: Definitely feels a little bit like cheating, right? But yeah, especially because we are working with medical data, it's so important to get everything right in terms of security. You have to comply with GDPR and some other regulations as well. So building our own platform would just be like a tremendous amount of work. So that's why we opted to go with this commercial platform. But indeed, we do need to use some tricks here and there to actually get it to do everything we want it to do.

26:53 Michael Kennedy: It makes sense. Because your goal is not to build a questionnaire site. Your goal is to get samples, information about bacteria colonies, right? If you could avoid building that site as much as you can, just avoid it, right?

27:06 Sebastian Proost: Absolutely, yes.

27:07 Michael Kennedy: But the other one, the first one we talked about, the MetaConnect that the microbiologists can use to study and get the data back in the way they want it, that one sounds more a core to your goals, right? So that one, you said you built that one from scratch with Flask, huh?

27:24 Sebastian Proost: That's actually something that has been going on for some time. So previously, it was used to study gene expression in different types of tissues of plants. But if you think of gene expression, so that's basically which genes are on or off and how much are they expressed in different tissues. It's also an abundance matrix and so I kind of realized fairly early on, when I started working on the microbial data, that it's very very similar. And a lot of the techniques and tools are same. So I just forked the repository and started like making necessary changes to actually support microbiome data as well.

28:01 Michael Kennedy: Yeah, sure. Get in there and change the title. And just go from there. Nice, yeah, tell us about it. Tell us about some of the technologies that you used and stuff like that.

28:09 Sebastian Proost: Biological data is actually very well suited for relational databases. So usually you have like a species or a data set and there are sequences related with that. And with a sequence, you can have punctual information, you can group sequences into clusters or gene families, so you kind of have this like structure that fits a theoretical or relational database very well. So there's definitely some MySQL connection there, and for that, I'm using all like the Flask packages. So just Flask MySQLAlchemy is in there. When you're uploading the data through like an admin panel, so that's Flask-Admin that I used. Users don't actually get to see that part so it's a very standard interface. So through that interface, you can upload the data. So basically the biostatistician processed all the reads that you got and you have your abundance matrix and you upload it. So then there are some standard analysis that are just triggered on the server and just ran as soon as you upload the data, it's being...

29:10 Michael Kennedy: So, sort of like data ingestion. It loads it up, converts it into the database format, and things like that?

29:16 Sebastian Proost: Exactly right. And so it also does a couple of analysis immediately, as you upload the data. And so for that part, I'm just using like pandas, NumPy, and the scikit-learn to kind of do those kind of stuff.

29:26 Michael Kennedy: Uh-huh.

29:28 Sebastian Proost: So just a whole lot of Flask going on there on the back end.

29:30 Michael Kennedy: And it's cool that you're using Flask-Admin. And of course if people don't have to, if they don't see it, if just the admins see it, it could just you know, admin panels are always like the worst lookin' part of the UI.

29:40 Sebastian Proost: Exactly right. So the parts that actually users get to see, that's something where I do spend a bit more time on how it actually looks.

29:47 Michael Kennedy: Nice. So this thing, the goal's going to be to open source it. Do you think that it makes sense, I guess probably it doesn't make sense to run it as like a service that people can generally use, or does it? I'm thinking because of the privacy of it. But maybe if it's for things like plant data and stuff, like the plants don't have a GDPR.

30:06 Sebastian Proost: Yeah, that's true.

30:07 Michael Kennedy: Is it something that can be generally useful as a site, or just as a project people could clone for their project?

30:12 Sebastian Proost: So I kind of see a couple of different use cases for this. So one is communication, right? Let's say you did an awesome study with some microbiome data that you can make public, you could actually set up your own platform, upload all your data in there, and use this kind of like a supplementary web page to your publication. So it's a great way for communicating things to other people that might not have the bioinformatics expertise to do that kind of of analysis themselves. I also personally use it a lot to just locally run it, upload the data set that I'm analyzing, and I just get like all those tools at the click of a button rather than using different scripts and making all those one-off scripts to do this kind of analysis.

30:51 Michael Kennedy: It's almost like a GUI on top of the libraries that you guys are using to study the bacteria.

30:58 Sebastian Proost: It really simplifies analyzing the data. There're also now some students that are starting to use it in our group. So they have their project, they have their samples, they went through the learning curve to learn how to process the reads and get that abundance matrix. It's mostly done using R scripts. So at that point, it's kind of like nice if you can give them something that's nice and easy to finish up the rest of the analysis. And so that's also a use case we have.

31:23 Michael Kennedy: Yeah, you talked about having students maybe working on some new projects that are kicking off at the university there. For students out there listening, what kind of skills, you know, maybe they studied biology or maybe they just studied data science in some way. What kind of skills do you look for, for people to work on your projects?

31:41 Sebastian Proost: If it would be to extend something like MetaConnect, it would nice to have someone that has some coding skills already, right? So this semester there will be one student and he'll definitely be working a little bit on the code of MetaConnect, so then I'm looking for someone that has some coding skills. But also having some understanding of the biology is necessary, right? You don't get that far if you don't understand any of the biology at all. So it's really like this combination and bioinformatics is such an interesting and broad field, that the amount you need of bioinformatics or the biology can differ, based on different projects. So, projects that I would supervise, they would generally not include any work in a lab, but there are also people who are doing kind of like a combination of some lab work with some bioinformatics. That's also perfectly possible.

32:31 Michael Kennedy: Yeah, I'm sure it sounds fun and students are always looking for research projects to be part of, so that they can make those connections at universities, but it's interesting just to think about biology students trying to get into the software side, kind of like you did with bioinformatics. Or the reverse, right? Software people who are kind of interested in biology. And you know, whether they can cross over for that.

32:54 Sebastian Proost: Actually I know people that it's both ways, right? So there are plenty of biomathematicians that have a background in biology and in the first couple years, really struggled with the scripting, picking up those kind of skills. And I know people that have like traditional CS degree and that way got into bioinformatics. So both are possible. Actually it was a couple years ago, necessary to pick one path or the other, 'cause there was no formal Master in Bioinformatics at the university at the time. So now we actually have that kind of course at universities, so you can also decide to just do a bioinformatics master. And then you kind of have the biology and the computer science in equal amounts, give or take.

33:32 Michael Kennedy: Yeah, that's great. One of the libraries that you said you were using on the MetaConnect, is something called Cytoscape.js.

33:41 Sebastian Proost: Well that's actually a Javascript library.

33:43 Michael Kennedy: Yeah, yeah.

33:44 Sebastian Proost: So that's basically a library that allows you to visualize networks in the browser. And so I briefly mentioned it before, you can represent these communities of bacteria as networks. So think of it as a social network of bacteria, right? So bacteria that tend to appear across different samples, they're co-abundant and could be that they're somehow related to each other. They're either dependent on each other or maybe they're dependent on the same other resource. But the thing is, you spot them very often in the same samples. And so it allows you to kind of like represent the bacteria as nodes and draw an edge between them. Allows us to kind of study the communities of bacteria because we do expect that bacteria that are often in the same samples, somehow need each other to do what they need to.

34:34 Michael Kennedy: This portion of Talk Python To Me is brought to you by Tidelift. Tidelift is the first managed open-source subscription, giving you commercial support and maintenance for the open source dependencies you use to build your applications. And with Tidelift you not only get more dependable software, but you pay the maintainers of the exact packages you're using. Which means your software will keep getting better. The Tidelift subscription covers millions of open-sourced projects across Python, Javascript, Java, PHP, Ruby, .Net, and more. And the subscription includes security updates, licensing verification indemnification, maintenance and code improvements, package selection and version guidance, road map input and tooling, and cloud integration. The bottom line is, you get the capabilities you'd expect and require from commercial software. But now, for all the key open-source software you depend upon. Just visit talkpython.fm/tidelift to get started today. And I bring it up 'cause it looks like such a cool library. Obviously it just does graphing type stuff, but the variety of the visualizations there are super cool. So people who are looking to draw those types of relationships, I think there's a lot of stuff to take here. There's even one where it represents like different animals, linked together, and the vertices are actually pictures of animals and stuff like that. So it's a really nice library for that. Was it pretty easy to use?

35:57 Sebastian Proost: It's very, very performant in terms of drawing the graphs, but you kind of have to build the interface around it. It just draws nodes and edges, gives you some functions to interact with them, but if you want to have like an option to change the color of the nodes, those little things, you have to add yourself. So it's pretty straight-forward, once you have your data in the right format, to visualize it using Cytoscape.js. I struggled quite a bit to build the interface and all the buttons that I required for my viewer to get them to link.

36:29 Michael Kennedy: To get it to behave just right?

36:31 Sebastian Proost: Yes, exactly.

36:31 Michael Kennedy: It's one thing to draw it. It's another to make it interactive. For sure. So one of the areas that I thought would be fun to talk about, would be the pressures of being in an academic space and what counts as like merit and credit in that world, versus industry? So it sounds like a lot of the work that you're doing is around academics. But you also have a little bit of industry experience going on at the same time. And you know in academics, it's all about the academic paper, right?

37:03 Sebastian Proost: Yes.

37:04 Michael Kennedy: Is that a challenge for... My goal is to build a school library in Python to help biologists but I can't publish that. So what am I going to actually do along with that? Or what's the tension like there?

37:14 Sebastian Proost: I am analyzing some data for a company and I think it's a really cool project, but it's their intellectual property, so you cannot talking too much about it. And so basically the findings there, that's all the company's. But so software and design, that's more the academic part of it. It's true that in academia, basically your work is measured by the number of publications that you have. So I need to kind of keep a balance between them, right? So I need to make sure that I also get some publications for my resume.

37:43 Michael Kennedy: Do you think that the industry is going to or academia I guess, is going to change somewhat to recognize software contributions as part of research more? I mean, this is not a problem related, just focused on biology, right? This is a problem in physics, in chemistry, in mathematics, and astronomy, right? Whatever, right? Anywhere where there's people building these great libraries that's taking away from their time to do research maybe.

38:09 Sebastian Proost: If you're really building a library for other people to use, for sure that's going to be an issue. I've mainly been developing applications. And so then the way we did it, was we kind of released the application along with data sets. And then you can kind of do some research with that data set as well. So you kind of have done a combination of novel data set with novel software that people can start using. And that can also become like a publication on its own. However, there it does make it challenging to keep on supporting software. 'Cause once the publication is out, right, You got what you need.

38:47 Michael Kennedy: There's no more reason to keep adding features to it or whatever, right?

38:50 Sebastian Proost: Yes. So then I would need to continue basically building a larger data set in combination with better software, to kind of keep it going. And so I think during my PhD, one of the projects I was working on, we really were very successful in doing that. It was always a combination of like here's an online platform with a cool tool that we developed, but also each time here's like a larger data set, more speciesism.

39:14 Michael Kennedy: It certainly makes sense to build tools and applications to answer your questions, right? Because I've worked in research groups before and a lot of the work was, we need to answer this question, we need to visualize this thing, we need to transform that data, and it was just writing software data to do that. So have you heard about the Journal of Open Source Software, JOSS?

39:36 Sebastian Proost: I have. I actually think it was through one of your podcasts that I learned about it

39:40 Michael Kennedy: Yeah, we had it on the show. Is that something relevant in your space?

39:44 Sebastian Proost: So far, I don't have any experience with it but I do like the idea of breaking down these like big research projects into smaller components that can be reused, right? MetaConnect for instance, is definitely a monolith. It's like a whole package based on, and it does so many different things. It would be cool if I could basically take some components out of there, and maybe get like a smaller publication out of it, through JOSS or some other journal. And it's something I've been trying to do recently, but so far haven't really been able to test it or put it in practice.

40:17 Michael Kennedy: Yeah, I was just thinking about it 'cause it seems like some of your stuff that you're building. Repackaged correctly, it might fit in there, right?

40:23 Sebastian Proost: Yes, so for instance, we have a lot of metadata for the Flemish Gut Flora Project, right? So we have thousands of people and we have all this medical information, this information about their diet and their lifestyle, and I have a viewer for that. Now that's something that you basically use across different disciplines, right? Every study where you have a ton of metadata would benefit from such a viewer. That's for instance, one of the parts that we'd like to kind of like separate into it's own repository, it could become its own project.

40:51 Michael Kennedy: Cool, well that may be... I guess, school's just started, so it's probably not the easiest time to find free time, but maybe next summer or something? Or over winter break?

40:59 Sebastian Proost: Yeah! I'll have a look at it once I have some free time on my hands to work on it.

41:04 Michael Kennedy: Absolutely. So where's this research going and what're you building next? Like what more analysis do you want to do, what more libraries or apps are you building?

41:13 Sebastian Proost: I'm actually quite happy with the microbiome studies. It's a very challenging field and a lot of the things that we are doing now, are association studies. So we look at the population and see like, okay, if you have more of this bacteria then maybe there's a higher chance that you have certain disease. But that's just an association, right? It doesn't mean that having that bacteria will actually cause the disease. Correlation's not causation.

41:38 Michael Kennedy: Correlation or causation, yup exactly.

41:40 Sebastian Proost: And so right now, we're moving also towards more like longitudinal studies and intervention studies, to kind of like really establish those causal links. So that we can say like, "Yeah, that string of bacteria's causes something." Or, "No, it's actually disease that causes the person to eat something differently or behave differently, and that is actually changing the microbes in the gut.", for instance.

42:05 Michael Kennedy: Is it a signal or is it a cause?

42:06 Sebastian Proost: And so that's now happening. And then of course, if you get like time resolve data, that's going to be like the next challenge I think, to integrate all of this.

42:13 Michael Kennedy: So it sounds like more data might be on hand? And you already have a lot of data. Where does machine learning factor in your world? Have you tried to use like machine learning models to understand this and make this inferences? I know machine learning is extremely good at taking a picture and saying, "Oh that means it's this." But what about one of these abundancy grids or something like that?

42:38 Sebastian Proost: One of the things that we noticed or that basically my colleagues noticed a couple years ago, is that we can kind of stratify the Flemish population into four different groups, based on the bacteria that are in the gut.

42:52 Michael Kennedy: What is the distinctions there, the categories?

42:54 Sebastian Proost: Certain people just have like more of one kind of bacteria, right? And that seems to be kind of like one group. And they're not distinct groups. It's more like a spectrum. So we noticed there are these four different groups, right? So one of the things that would be nice, is if basically we could use some machine learning to automatically classify your samples into these different enterotypes. You could also start thinking of kind of trying to predict something based on the gut microbes. But I think that's still some ways off. It would be great if you could just look at the sample and say like, "Okay, this person is at risk for a certain disease." Maybe that will work, maybe not. So it's still very vague right now, whether we'll be able to do that or not. But people are probably looking at it.

43:38 Michael Kennedy: Yeah, it sounds exciting. It doesn't sound out of reach at all.

43:42 Sebastian Proost: It really depends on like the data that we can generate and yeah, really to explore that.

43:46 Michael Kennedy: I guess it depends on how reliable the science is, that says if it's like this, that means that, right? Because you've got to train the models to say, I don't know. Maybe the models could make discoveries that people don't see. I don't know, I haven't done enough machine learning. But there's probably some really interesting things you could do with machine learning and that data.

44:05 Sebastian Proost: For sure, yes. But whether or not we'll be able to make any kind of predictions towards health, is something I'm very careful about.

44:13 Michael Kennedy: Categorization one sounds likely.

44:14 Sebastian Proost: That's something that will probably be possible based on what I've seen so far. And as we're generating more and more data, it would be very nice to just do that initial classification and to the enterotype, in this automatic way.

44:26 Michael Kennedy: For sure. So well, it sounds like you're making some cool use of Python to study these microbiomes. And they're just so fascinating because they're really out of reach for us, in any way that we can conceive about it, other than a very small view with a microscope. It's super interesting.

44:42 Sebastian Proost: I think so too.

44:43 Michael Kennedy: Hopefully that inspires the biologists out there to do more Python and things like that. Super cool. So thanks for sharing that with us. I think probably we'll leave it there for that conversation, but before you go, I do have the two questions for you. So if you're workin' this project, you're going to write some Python code, what editor do you use?

45:02 Sebastian Proost: So if I'm working on MetaConnect or any kind of like large project, that would be PyCharm. If it's more like very specific analysis, I'll fire up Jupyter Notebooks.

45:11 Michael Kennedy: Where do you make that trade-off? Where do you say, okay now it's too much for the Jupyter Notebook, I'm switching to PyCharm? Like when do you transition from one to the other.

45:19 Sebastian Proost: So I have like a couple projects and they really are, from the beginning, decided to be like an application I'm developing. And for that, it's automatically something I will do in PyCharm. If it's more like processing data, preparing it, do something else with it, or something I just have to do once, maybe twice, then I'll go for the Notebook. Also for reporting back something, I think Notebook's are really powerful. So you can just generate all the graphs, export it to PDF, and say here's the thing that you asked.

45:45 Michael Kennedy: Yeah, yeah. That's perfect. Super, and then notable PyPI package, that it makes your life better?

45:52 Sebastian Proost: One that I discovered recently that's really like saved my day, is something called UltraJSON. I thinks it's pip insall ujson to get it. And it's basically a very performant version of the JSON library and the standard library. So I was developing an endpoint and had to deserialize some JSON objects that were stored in the database, combine them, and serialize them again to something I could then send to a viewer. And that just made everything a couple hundred milliseconds faster, which made the websites only so much more responsive.

46:25 Michael Kennedy: Oh, that's super cool, yeah. UltraJSON is a JSON encoder and decoder written in pure Python. I'm sorry, pure C, with bindings for Python. Yeah, that's great. So you probably have a lot of data on those endpoints, right? Like 10,000 measurements or whatever?

46:40 Sebastian Proost: Yeah, exactly. So if like I need to kind of like deserialize one of those complex objects, do something with it, combine it with like another one and serialize it again, very quickly takes seconds, which for a web interface, is very slow. So everything that can just take like a couple hundred milliseconds off that time, is really great.

46:59 Michael Kennedy: And is it the same API as regular JSON module?

47:03 Sebastian Proost: For the most of it, yeah. Pretty much everywhere in my code, I was able to just write import ujson as json, and it worked.

47:10 Michael Kennedy: That's sweet. All right, well that's a great suggestion then. 'Cause it sounds so easy to adopt and will be helpful for a lot of folks. Cool, all right well, final call to action. People are out there interested in biology or biologists are out there interested in Python, they want to get started. What do you say to them?

47:24 Sebastian Proost: Right, so definitely try some coding in Python. Even if you're mostly into biology, just knowing a little bit of the basics can make your life a lot easier, just being able to automate something for a couple times can prove very useful, right?

47:40 Michael Kennedy: Yeah, absolutely. You know, escape Excel right?

47:42 Sebastian Proost: Yes.

47:43 Michael Kennedy: Or MATLAB, one of the two.

47:44 Sebastian Proost: We're collecting more and more data so. I encountered so often that people were using Excel, at one point, they just hit that wall. And then you know, Python, pandas, and scikit-learn and help you break through that. So have a look at it, if you need.

48:00 Michael Kennedy: Cool, alright well Sebastian, thanks for being on the show. It's great to chat with you.

48:03 Sebastian Proost: Yeah it was great to chat with you as well Mike, thanks.

48:05 Michael Kennedy: This has been another episode of Talk Python To Me. Our guest in this episode was Sebastian Proost, and it's been brought to you by Linode and Tidelift. Linode is your go-to hosting for whatever you're building with Python. Get four months free at talkpython.fm/linode. That's L-I-N-O-D-E. If you run an open-sourced project, Tidelift wants to help you get paid for keeping it going strong. Just visit talkpython.fm/tidelift, search for your package, and get started today. Want to level up your Python? If you're just getting started, try My Python Jumpstart By Building 10 Apps course. Or if you're looking for something more advanced, check out our new Async Course that digs into all the different types of Async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle, it's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python, we should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the Direct RSS feed at /rss, on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening, I really appreciate it. Now get our there and write some Python code!

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon