Monitor performance issues & errors in your code

#164: Python in Brain Research at the Allen Institute Transcript

Recorded on Friday, May 4, 2018.

00:00 Michael Kennedy: The brain is truly one of the final frontiers of human exploration. Understanding how the brain works has vast consequences for human health and for computation. Imagine how computers might change if we actually understood thinking and even consciousness. On this episode, you'll meet Justin Kiggins and Corinne Teeter, who are research scientists using Python for their daily work at the Paul Allen Brain Institute. They are joined by Nicholas Cain, who is a software developer there, supporting scientists using Python as well. Now, even if you aren't interested in brain science directly, I really encourage you to listen to this entire interview, it's super fascinating. This is Talk Python To Me, Episode 164, recorded May 4th, 2018. Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @MKennedy. Keep up with the show, and listen to past episodes at talkpython.fm, and follow the show on Twitter via @talkpython. This episode is brought to you by Cox Automotive and Rollbar. That's right, Cox Automotive has joined the show as a sponsor. They're looking for new developers, so check out what they're offering during their segment or the link in the show notes. Justin, Corrine, Nick welcome to the show.

01:29 Panelists: Thank you for having us. Hello. Hi.

01:30 Michael Kennedy: Hello, hello. It's super exciting to have you here on the podcast. I'm very, very interested in learning about how you're applying Python, and data science type things to brain science. It's going to be really, really fun.

01:44 Panelists: Great.

01:44 Michael Kennedy: For sure, but before we get into all the details, let's start with your story, I guess, Justin, go first. How did you get into programming in Python?

01:51 Panelists: I started programming mostly in kind of college, working in research labs, you know, part of engineering classes, and that was largely kind of MATLAB and Labview. MATLAB is kind of the dominant language in most neuroscience research environments.

02:08 Michael Kennedy: What was your degree, what were you studying at the time?

02:11 Panelists: I was studying bioengineering, biomedical engineering, and then I went and started a PhD in neuroscience. And it was was during my PhD that I decided that... There's this old C code, like raw C, that my advisor had written for some of our experiments, and I was chasing pointers, and like trying to figure out how to do memory buffers with audio, and I was like, this is brutal I don't want to do this. And I basically cold turkey switched everything that I was doing over to Python. So, rewrote a bunch of that code, taught myself Python by kind of rewriting that, implementing the Python, starting to use some of scientific Python stuff for my analysis, building out a Django database to maintain, to keep track of my research that I was working on. It was kind of a cold turkey switch for me about 2012 while I was working on my PhD.

03:03 Michael Kennedy: And it was a good switch, you're happy with it?

03:05 Panelists: I mean I think that, it's done me well, and the rest of the field I think is starting to catch up, and it's only become more powerful since then.

03:15 Michael Kennedy: If you look at the popularity of Python it's been going upward, but there was a major inflection point where it became the rate of popularity growth increased around 2012, and I think there's just, a lot of that is due to the data science tool improvements, that whole space.

03:31 Panelists: Absolutely. I think that I really just caught the edge of that wave, so.

03:37 Michael Kennedy: You're part of that wave for sure, for sure. Nice, and Corrine, how about yourself?

03:41 Panelists: I started coding after my undergraduate degree. I had an undergraduate degree in physics and psychology. Afterwards went to Los Alamos National Lab, and so my first coding language there, I was doing very, you know, physics-y dominated stuff, so...it was FORTRAN, actually. After that I went back to grad school in computational neuroscience and there the main coding language we used was MATLAB, as Justin mentioned. And then after that, I had a couple positions at, for example, Qualcomm and Sandia National Labs, and there I was still using mostly MATLAB, so we'd have to buy licenses, and then I came to the Allen Institute...and here, Nick and I were both here during the very beginning of kind of our latest 10-year plan, and we wanted to make sure that everything we use, like one of the goals of the Allen Institute is to be able to make standardized data that the community can use, and part of that is wanting it to be open source, so a lot of us on the ground were thinking about this and we had the option at the time to use whatever coding language we wanted to for the projects that we were pursuing, but we really all got together and was like, you know what, we want everyone to be able to look at our code, we want each other to look at our code, and... we're going to go with Python, so I learned Python on the fly when I came to the Allen.

05:04 Michael Kennedy: What was the transition like coming from say, MATLAB to Python?

05:08 Panelists: It was a learning curve, I'd like to say I was kind of floundering around for probably about three months and there was a lot of like indexing in MATLAB and Python is different, so you kind of, and we do a lot of time-series analysis data, so just the indexing and things like that, was a transition. But at the end of the day, I'm very glad that we chose to do that.

05:30 Michael Kennedy: That's cool, yeah it's...it's different, but it's not that different, right?

05:34 Panelists: No.

05:35 Michael Kennedy: It does still have a similar feel, at least I think, so-

05:38 Panelists: So we could have gone to like, different... I mean we could have gone to C or some other language like that, but really it was a great transition for people on the outside. We knew a lot of them would be very MATLAB savvy, since that was kind of the main code at the time, I think people are transitioning now, but, it's still a high-level code.

05:56 Michael Kennedy: Right, and I think the ecosystem for Python aligns very well with your mission, and of trying to have everything open source and stuff, right? Of all the different languages, Python embraces this sort of zen of open source more, more than average I would say. Nick, how about yourself?

06:13 Panelists: So just like Corrine and Justin, I started programming in MATLAB as an undergraduate. When I went to graduate school at the University of Washington for my PhD, I was in Applied Math department and my advisor encouraged me to learn FORTRAN, so I wrote my first project in FORTRAN and just like Justin was saying, I was chasing all sorts of things that I didn't really understand and having a difficult time, and decided I would try out this language that some of my colleagues were telling me about, and started rewriting all my algorithms in Python and using that as sort of my learning case. Then as I got more into computational neuroscience, there's actually a lot of packages that are written in low level languages, packages like Nest or Neuron that have developed really good Python bindings. So, I realized that I didn't have to sacrifice efficiency or engagement with these other theoretical and computational neuroscience communities, but I could still program in a language with a ton of flexibility and a ton of tools, so it was a really natural transition over. So, then when I came to the Allen Institute, brought that knowledge in, and to be honest, really haven't looked back. I've been using Python for most of my day-to-day work.

07:30 Michael Kennedy: That's really a cool story. Because of the bindings, right, because there's underlying libraries, and people can still use those libraries, it's just you happen to be able to program at a higher level language and if they want to go write and see your FORTRAN, like that's all well and good for them, right?

07:46 Panelists: You can use the expertise of really core developers working on highly technical material in really efficient, multi-processing libraries, but then be able to define, at a high level, define simulations and define models in a much user-friendly syntax, and but really not sacrifice efficiency.

08:08 Michael Kennedy: That's really really cool. So I kind of want to go through projects that you're each working on and give people a sense of, what is it you do day-to-day at the Allen Institute, because it's not like, well I work at this e-commerce site, or I work at a bank, we all know what that looks like, but this, you work at a pretty special place, so I'm going to keep the same order I guess. Justin, like what kind of stuff do you do day-to-day?

08:32 Panelists: I'm a scientist in the visual behavior team. So, in general my, the broader... I'm in a chunk of the Institute that is very interested in neural coding. We can think a little bit about, one way of thinking about what the brain does and what neurons in the brain do is they have some representation, some way of encoding what is out in the environment. So if we have, if I'm looking at something, there's a particular pattern of activity that that's going to elicit in the cells in my brain. In general, this is understanding how this happens and what the, how these types of representations emerge, what they are, and then how other parts of the brain use those representations to make decisions, to do whatever the other parts of the brain need to do with those kind of intermediates. In general that's the kind of stuff that we do. So we have a large experimental pipeline, one of the interesting things about the Allen Institute is we kind of take an industrial approach to generating data for these types of experiments. We have these very large pipelines that generate very large data sets on standard experimental rigs.

09:45 Michael Kennedy: So are these experiments, are just like you're bringing folks in, and you have them... do you hook them up with a EEG, or-

09:52 Panelists: Yeah, no, so most of the work that we do here in my group is dealing with mice. So we can actually present images on the screen for the mouse, and then record individual neurons in the mouse's brain. It's very hard to do this in humans, but it's a little bit easier to do it in mice.

10:11 Michael Kennedy: That's wild, how do you get them to pay attention to the screen?

10:13 Panelists: This is part of what my project deals with. They are... So there's an experimental set up, so at the end of the day we basically need to fit a very large microscope over them in order to record from the individual neurons, and a very small glass window is implanted in order to be able to see into their brain. They are basically trained to be comfortable with getting their head attached through the, up against the microscope, and they've got a little running wheel they can run on, and then we have the screen next to them that's presenting images.

10:47 Michael Kennedy: I see, so they're kind of fixed, like looking straight at it.

10:50 Panelists: They're kind of stuck, but they've got a wheel in front of them, so we are controlling the visual environment but they're kind of free to move otherwise, so it's almost like a little virtual reality type-

11:01 Michael Kennedy: We just need a miniature oculus rift type thing--

11:06 Panelists: Exactly.

11:07 Michael Kennedy: I mean it's basically...

11:08 Panelists: So then... I mean that's an interesting segue, 'cause then, your question...you know, how do we actually get them to pay attention, is that we put a lick spout in front of them, and they can lick the lick spout and if they lick at the right times, then we make sure that they get a little bit of water. And so they basically through, trial and error, start realizing what we're trying to get them to pay attention to on the screen. They're basically in a little video game. I mean they basically are, you know, We're controlling what's on the screen and they have to lick when the game rewards them for licking.

11:43 Michael Kennedy: That's really wild. That's quite interesting. So then you capture all this data and sort of analyze it afterwards, huh?

11:51 Panelists: So we generate data, we... I mean it's a little bit of... So we've got some, so some of the data gets streamed and analyzed in real time to give the trainers feedback on, on what the mice are doing and their wellbeing. We have a...in order to train them and to do this at scale, we have to standardize these training procedures, the game has to go from easy to hard. So we have an entire system that Nick and I actually have coordinated on, where at the end of each training session, the data gets uploaded, some automatic analysis happens, that determines what the next stage is that the mouse is going to have the next time they come in for training. And so that requires, you know pushing data back and forth between servers, sending it off to a microservice that Nick is running, and then, you know the next day, the mouse is on that next stage. So we train them up, then when they're ready, then we put them under the microscope, and so they're in a similar situation, but now we've got a microscope that is recording the activity of individual neurons in their brains. This gets acquired over a few days, all that data, I mean we're talking very large data files that are literally movies of neurons in their brain, that all gets-

13:08 Michael Kennedy: Wow, that was going to be one of my questions, like how big are one of these files, like how much data are we talking about? Yeah, how big is one of these files Nick, do you even know?

13:16 Panelists: Terascale, I think, it's less than a Terabyte, but it's you know, many hundreds of gigs.

13:22 Michael Kennedy: Wow, okay.

13:23 Panelists: So this gets pushed up to the server and then we've got a whole other team that's developed algorithms for...basically extracting these signals out. So you've got a bunch of, kind of ML that has to happen in order to basically do segmentation, a bunch of image recognition stuff, you know, where are the cells in this movie, and then extracting the activity of those cells. And then basically kind of at the end of a bunch of that pipeline, of that kind of processing ML pipeline, a bunch of this data then comes back to me, where now I have signals, and I have the record of the images represented on the screen, I've got other data about when the mouse licked, when it didn't, and so basically then I'm... Take this data and try to make sense of it. So to what extent can I, you know if I'm just looking at the activity, can I decode what was on the screen from that activity? If I'm looking at the activity, can I predict what the mouse's choices were at any given time, whether it chose to lick or whether it chose not to lick, in the context of his performance on the game. And so a lot of that is basically-

14:31 Michael Kennedy: That's just fascinating. Yeah, this is really wild, I had no idea. That you can do these kinds of things.

14:35 Panelists: At the end of the day, basically to do this, it's all largely scikit-learn and pandas by reducing this stuff into a feature matrix where your feature vectors, or the identity, is the activity of any given cell. So every... If I've got 100 cells that we've recorded from, each cell becomes one dimension in my vector, and I've got a bunch of categorical, or continuous information about what was on the screen, and then it's just a regression or a classification problem at that point. And this basically is what lets us kind of... You know by approaching it in this way, we can build the inferences and say, "Well,this area over here did really good at decoding images, this area over here didn't. But that area was very good at predicting what the mouse's decision was." So we can kind of start to build out inferences about kind of what different parts of the brain are doing and how they are doing that through this type of approach.

15:37 Michael Kennedy: This portion of Talk Python to Me is brought to you by Cox Automotive. They're leading the way in cutting edge, industry changing technology that is transforming the way the world buys and sells its cars, and they're looking for software engineers and technical leaders to help them do just that. You hate being stuck in one tech-stack? Well that's not a problem at Cox Automotive. Their developers work across multiple tech-stacks and platforms. They give you the room you need to grow your career. Bring your technical skills and coding know-how to Cox Automotive. You'll create real-world solutions to today's business problems alongside some of the best and brightest minds. Are you ready to challenge today and transform tomorrow with Cox Automotive? Go to talkpython.fm/cox, C-O-X, and check out all the exciting positions they have open right now. Couple of thoughts, one...who would have thought that a library coming out of the financial industry, pandas, would be helping us understand the brain and also who knew mice could generate so much data?

16:39 Panelists: Well I mean, the amount of data that we can generate, I mean I think it's probably obvious, and we're not even getting all the data we can generate out of these guys. We have...This is, I mean, this data that we're talking about, I mean we're literally talking zooming in on an area of the mouse's brain that is maybe a few, what like, hundreds of microns, micrometers wide, and maybe like, a really thin, thin piece. So we're talking, I mean we're talking about a couple, you know, many dozens to hundreds of neurons out of the thousands-

17:13 Michael Kennedy: The whole brain,

17:15 Panelists: ...and thousands- ...of neurons in the mouse's brain, right? We're not, I mean...this is just the tip of the iceberg in what we could be potentially recording as technologies improve.

17:23 Michael Kennedy: Someday we probably will be, right?

17:25 Panelists: I mean there's tons of initiatives. I mean, the Allen Institute is leading on a bunch of efforts. The recording modality I just described to you is, is the... What we have currently released, and the kind of stuff that is currently on our website, not with all the behavior, but that recording modality. There's a new effort of new neural pixels, probes that the Allen Institute has been involved in that will get us kind of up into the thousands range of simultaneously recorded cells and there are even more forward-thinking efforts to be much more comprehensive in what we can record from in this level of detail.

18:05 Michael Kennedy: It's amazing. All right, Corrine? How about yourself?

18:08 Panelists: Justin works on kind of a higher level project where you have actual behaving mice. I'd say I work at a one scale downwards, so a large part of this institute who is trying to really define what the components are in the brain. So you have a bunch of neurons and theoretically those are differentiated into different cell types. So researchers in the field have been trying to figure out what sorts of types of neurons there are in the brain for probably 100 years. This is something that people haven't really solidified their ideas on.

18:46 Michael Kennedy: It's still an open question? Like people don't know all the types?

18:50 Panelists: It's an open question, so, so we're really devoting a lot of resources to try to get to some sort of ground truth. It's not clear that there's going to be specific types of neurons. There's probably a continuum but how well can we define it and after we have some definitions can we figure out what those different types are doing, what function they're performing in the brain. So...Justin didn't mention that there's a reason we use mice and the reason we use mice is that we have a lot of genetic controls. So, we specifically breed different types of mice to fluoresce looking under a microscope for different types of genes that are expressed in the neurons. The mice don't fluoresce but the individual The neurons ...neurons- ...the neurons fluoresce. so when you're looking under a microscope. ...you see a bunch of different neurons and depending on what type of neuron we're marking, that neuron will fluoresce under a microscope. So we have a lot of genetic control over recording for from neurons that we kind of know what kind of transgenic type, they are. In my group, or the group that I work on and project I work on, we are looking at electrophysiology data. So what that means is you stick an electrode into a neuron, so these mice are sacrificed and you have slices of the brain tissue, and we also do this in human. We have a lot of agreements with the hospitals within the area where if they're excising part of the brain during surgery, we will get that tissue and we'll record from those neurons also. So this is a nice project to kind of try to relate mice to humans.

20:31 Michael Kennedy: How similar are they?

20:32 Panelists: My first position, as I mentioned... I came from a physics background, was basically building a modeling pipeline. So you stick an electrode into a neuron, you inject current, and you record the voltage output. And then you try to come up with mathematical equations that will recreate behavior of a neuron based on current injection just like a circuit. And so, we recently wrapped up this project where we were looking at how much specific mathematical equations are needed to reproduce the behavior of these neurons? And so this is all available on our website now, and the idea here is that when people are building larger scale networks, you want to use realistic spiking behavior of individual neurons. So now, depending on the level of abstraction someone might want to use in a network that they're building, they can choose from this different range of abstraction that we have on our website. So, first project was that, building a whole pipeline to do this all automated. So data's taken, it goes into our storage facility. I take the...well then there's some QC algorithms that we built up to, you know, QC the data and then I pull that data out, come up with algorithms, test them in a very machine-learning type way. You basically have a test set and a training set, and then the project I work on now is trying to figure out the components. So you inject current into one neuron, and you measure the voltage that's happening on another neuron it's connected to.

22:10 Michael Kennedy: How complicated does that get? Is it kind of simple to some degree, like Newtonian mechanics? Is it crazy like complex, dynamical, chaotic systems? Like, what are you working with here?

22:23 Panelists: So it depends on...So Nick will talk to, about this a little bit, cuz he spent a lot of time building actual network models. So what I do is, I would say it's relatively simple mathematical equations. There's also a level of models that are made that are what we call biologically realistic where you try to model all of the ion channels in a neuron. So you have a lot of different ion channels in a neuron calcium...sodium, potassium, lots of them...25 different, you know, you know many, 25 to 100 different channels. You actually try to model the gates opening and closing and current flowing into the neuron but we abstract away from that because we have found that's not necessarily too necessary to predict the spiking behavior of a neuron, but we also have those high level, or sorry, those very complex models, too. So it depends.

23:19 Michael Kennedy: Okay. Yeah, so you can look at it at different levels

23:21 Panelists: Exactly-

23:22 Michael Kennedy: -...depending on what you're trying to ask questions. And, yeah, Nick how about, go for...

23:25 Panelists: Well I just wanted to jump in there and sort of highlight one of our Python packages that the institute has been building over the past year. It's a package we call the BMTK, the Brain Model and Toolkit. It's available on our website and it's gone through sort of a soft release but it's a Python package, a Python wrapper around several neural simulators, like I was mentioning earlier, that allow researchers to construct and simulate neural circuits like Corinne was saying, and a bunch of these different levels of biological realism. So, you know, Corinne was highlighting some of her work it sort of...one differential equation per neuron-type scale, but you can go much deeper and simulate individual compartment models that can resolve the complicated morphologies of the dendritic trees and how those interact with each other. Or even all the way down to the synapse of ion channels. That's the sort of extreme and biological realism. My first project at the institute was on the other extreme, so called population density modeling, where we used partial differential equations to simulate entire populations as sort of one homogenous group. There's different biological questions that you'd want to pose at different sort of points on this continuum. I'll give you and example. Is it important to simulate the exquisite complicated nature of the tress of these neurons to understand their input/output properties? Well if the answer is yes, then you're going to have to use simulation tool like Neuron... although it might be sufficient to just look at the spiking behavior down near the soma of the cell, which case a simulation tool like Nest, which also has a Python API, would be more appropriate. If you just are interested in the sort of mean field dynamics, the population to population contributions to circuit dynamics, then the neural simulator that I wrote called DIPD, which is actually in pure Python, would be the tool that would be the most appropriate. So we have a Python package that actually wraps all these different levels of detail so you can move between each of the different scales, as your work sill simply demands it.

25:37 Michael Kennedy: That sounds really interesting because you might start a research project thinking I'm going to look at one level but realize no actually, we need to try to think about it differently, but you have the same API or something like that, right?

25:48 Panelists: Exactly. You know, there's some big switching costs associated with having to learn a whole set of tool chains. It was originally written that the simulator Nest had its own custom language for describing the network technologies I think it was called the SLI. I know neuron has its own language called a , right Corinne? Yes. But it also has a Python interpreter now. So if you're having to switch based on your biological question to a different type of simulation, now you got to learn yet another custom description or modeling language. It really is taxing on the individual scientist. So, that's why unified Python APIs that you can just sort of learn one language but still get the power of all these simulation tools is really helpful.

26:33 Michael Kennedy: It sounds great. So are these tools and libraries being taught in academia these days? Like are they used for-

26:41 Panelists: That's a great question-

26:41 Michael Kennedy: ...research projects?

26:41 Panelists: So there are some examples where they're taught to undergraduates, although it's a pretty specific topic. In graduate school, that's where I learned about all these tools, but you know, when you're doing your PhD and you have a neuroscience or applied math question in mind, and then you go find the tool that's most appropriate. So it's more...most of the time I'd say it's learning on your own. We do have several examples of training courses that we provide. They're actually not just the Allen Institute, but all over the world for certain specialized computational neuroscience and also experimental neuroscience. I know that at our summer training course last year using these Python APIs was one of the main focuses of the course, or it was a focus of the course.

27:31 Michael Kennedy: Are these courses taught online or are they taught in Seattle, or ... That's where you guys are?

27:34 Panelists: Yeah. So we're all in Seattle just down here on Lake Union, but I was referring to our Friday Harbor summer course which is actually, I think in its fifth year now, where there's application process really geared towards graduate students and post-docs, maybe early faculty and it's a two week in-residence up in the San Juan Islands at the Friday Harbor Labs, which is run by UDUB.

27:57 Michael Kennedy: That's the Friday Harbor tie-in, yeah. I've stayed up there on San Juan Island, it's wonderful up there, yeah.

28:02 Panelists: It's a combined effort between the University of Washington and Allen Institute and there's also funding from the Capley Institute that helps make sure that it is a success and it's not only computational neuroscience, but actually I think it's kind of morphed into a combination of big data and experimental neuroscience and there's also an introductory period where it's all taught in Python, so if the students are coming into the course without a strong Python background, there's sort of a Python bootcamp the first couple days to help train students to use our data APIs and some of the tools they might find useful on their projects. All of the course material from this, from the last few years is on the Allen Institute GitHub Repository. So there's Jupyter notebooks, you know one thing we also have ... Yeah so there's Jupyter notebooks that cover a lot of the course material that those students use and that's freely available for folks to go and download and start poking around. One thing this also offers is, I mentioned that we release a bunch of this data online, so currently we have what we call our brain observatory, which is observatory.brain-map.org. That is the website for the web version of it, and you can go there and you go to the website and you can either poke around and see a little bit of what is in this data set of I want to say 40,000 or so neurons from my work. Corinne's data is somewhere in a parallel. It's also on brain-map.org, I don't remember exactly where it is, but the stuff she's been working on is also freely released, and Nick's team manages the API and Python wrapper for the API to access this. So you can basically do pip install allensdk. That will install a Python package locally. You spin that up, and then like three or four lines of Python code you'll start downloading almost all the data that we released in this. It might take awhile, but you've got access to it at your fingertips.

30:13 Michael Kennedy: One of my thoughts around this is you talk about how much data that you're gathering for all these projects and stuff, and I know the folks at CERN... Instead of running there, downloading the data and running analysis on it, they push their analysis to where the data is, because there's so much data they've got like a cloud-computing infrastructure that... Like send your algorithm to the data and run it locally. So with yours, what is it like? Do you actually download all the data and process it or do you download little segments that you ask for or how does it work?

30:44 Panelists: This specific tool is basically downloading after a lot of the pre-processing and we've gotten it to a point that it's condensed to the level that your average post-doc or graduate student who would want to explore this data would want to play with it. That's about at the point that we release it for download. There is, I mean we have our own compute internally for a lot of our own data that relies on our cluster and where we keep everything very close to the compute. It really depends on the types of questions. We have an entirely different chunk of the institute that is not represented here that is doing very dense microscopy, so trying to build out... It's going to take them months to acquire the dataset alone. This is electron-microscopy, so it's in every single neuron within some area in incredible detail. The size of that data is just, you know, it would be largely impossible not to do the analyses that need to be done on that without staying close to the... Very close to the data.

31:50 Michael Kennedy: That's really wild. What kind of questions are they trying to answer, Corinne? On that one, do you know?

31:54 Panelists: Oh I was just going to mention that, I mean we're trying to answer all sorts of questions but we have been at Friday Harbor using Docker, too, and AWS has donated time, you know just like they did at SERN, so last year at Friday Harbor... So we used to just give out a terabyte disk to everybody that showed up with the data that we were going to be talking about- It's the most efficient way to transmit the data sometimes.

32:18 Michael Kennedy: Well, you're joking but I mean Amazon has their, ship us a bunch of disks...is the fastest way to upload large quantities of data to S3 and stuff. It's wild- Sometimes that's what you need!

32:29 Panelists: Amazon generously donated a bunch of space and credits for us for this course. Yeah for all these students and we... Yeah we had a snowball here that they sent us and we put a bunch of data on it and sent it off to Amazon. One thing that's interesting about that data. When you download data from the LNSDK, the Python API, what you're getting is bunch of actually pre-processed data that has had a lot of computational algorithms already applied to it. For example, neuro-pill subtraction, segmentation of regions of interest for the cells that basically gives time series for the activity of each cell. All that is, there's an entire algorithms team at the Institute that works on the packages and the algorithms to do that. When you access that data with the Python API, you're just getting that post-process data, not the raw imaging stacks. Those are the sort of multiple-hundred gigabyte, or Terascale, data and that's what we need the snowball for, because...We actually brought that to...made that data available actually through Amazon at Friday Harbor last year for students who poke at that raw imaging data, but there's really really significant volume. And also if anybody requests it, they do send us the disk and we put the data on it and send it back to them. That's another way that we handle our large data sets.

33:53 Michael Kennedy: Oh, wow, okay! That's really interesting. Cuz downloading a terabyte...that's going to cause all kinds of problems, I mean...even just paying for that much bandwidth...that's like $90 of bandwidth at AWS.

34:04 Panelists: And it also begs the question what you're going to do with that data when you get it. Not saying that a researcher wouldn't know what to do with it but it takes a lot of time and a lot of effort to extract signal out of that data. And that's sort of...I wouldn't call it a service we provide but it's part of our institutional work, to develop the algorithms to do that so that people don't have to retread that wheel constantly.

34:29 Michael Kennedy: Right, just pay the computational cost of trying to compute that, cuz it's got to be pretty high.

34:34 Panelists: A computational cost, I also think it's the human cost. It takes a very specialized set of skills to be able to computationally extract the meaningful data in those raw imaging stacks. We have a really world-class algorithms team that does a lot of that pre-processing for you so you can jump straight into this sort of... The data that you might think of as the really relevant class of data. What was the activity of the cell, not what was seen by the microscope. Those are two different data dimensionalities.

35:09 Michael Kennedy: This portion of Talk Python to Me has been brought to you by Rollbar. One of the frustrating things about being a developer is dealing with errors, ugh! Relying on users to report errors, digging through log files trying to debug issues, or getting millions of lurks just flooding your inbox and ruining your day. With Rollbar's full stack air-monitoring, you get the context, insight, and control you need to find and fix bugs faster. Adding Rollbar to your Python app is as easy as pip install rollbar. You can start tracking production errors and employments in eight minutes or less. Are you considering self-hosting tools for security or compliance reasons? Then you should really check out Rollbar's compliant-sass option. Get advanced security features and meet compliance without the hassle of self-hosting, including HIPA, ISO 27001, Privacy Shield, and more! They'd love to give you a demo. Give Rollbar a try today. Go to talkpython.fm/rollbar and check them out. How many people work there, at the Allen Brain Institute?

36:08 Panelists: I think we're pushing 400 or so totally in brain science, which is our kind of corner of the Allen Institute as a whole. We're the largest chunk and I think we're closer to 300... 250 or 300, somewhere in that ballpark.

36:21 Michael Kennedy: That's a lot of expertise packed in that area, right? So, Corinne talked about people shipping you disks and sharing that data in some really interesting ways and I think that leads into one of the missions at the institute which I thought was really powerful, it's...you guys are committed to the open science model within your institutes. Do you want to speak to that a little?

36:43 Panelists: Absolutely. In academia, things are generally done in smaller labs and often times you have a lot of difficulty reproducing individual experiments that happen there. Brain science is really hard. It's really hard to figure out what's going on in the brain and I believe that when this projects were started it was like what what space is really not being covered by the neuroscience community, academia, and pharmaceuticals combined. That was being able to reproduce large sets of data making them standardized, and making everybody be able to reproduce the results that you get so that you can kind of come to an agreement on ground truth and not be trying to reproduce other people's experiments. So we are one of the only institutions in that space that has done this. And now, other institutions are also trying to kind of follow suit cuz we've all recognized that we're really trying to solve this reproducibility problem and also deal with just the huge amounts of data in the brain.

38:02 Michael Kennedy: How many different data centers have to be set up to do basically the same processing, right? Like if you guys could do that processing, share that data, and not have every university set up their own equivalent computing structure to do the same thing, that would be good, right?

38:18 Panelists: Exactly! And, I know that the next step in the process now is that we have projects where you can apply to have your scientific study done in our platforms. We might be heading more towards that, you know? We've set up this huge infrastructure, people will apply to have what they think is interesting done here.

38:38 Michael Kennedy: It's a little bit of the computing close to the data type of thing.

38:42 Panelists: There's a lot of different dimensions to open science, right? We've been talking a lot about open data, we've talked a little about open software, which is open source software, which is another aspect nowadays of open science. Science has really kind of come to depend upon the software that implements science. And then there's also of course open access, right, that the what we traditionally think of as the final work product of science is the paper that you publish, and there's pre-print, free pre-print, archives, and free access of journals. So my point is there's a lot of different dimensions that you can talk about open science, but open source software is something that myself and the technology team really think about a lot. We've talked about the Allen SDK and I also mentioned the brain modeling toolkit. I don't think I necessarily mentioned that both of these are open source packages and we accept pull requests onto them and we respond to GitHub issues and there's a large backlog because there's a lot of work to do and we support a lot of different scientific projects at a very large scale. Open source software development has become something that the institute has really come to embrace. I think the community has started to recognize just how critical it is to share our algorithms, share our processing codes, share our analysis tools.

40:03 Michael Kennedy: I think that's such a great mission, and I think partly you folks have a slight advantage over say Stanford, Rutgers, the other universities because you don't depend on the publish or peril model, right? Least that's my understanding from the outside, right, it's not like-

40:23 Panelists: No, that's exactly... It is very important for us still to publish our data and to still communicate to the broader scientific public in that realm. But yeah, our incentives are a little bit different. I mean I think a good example of this is our brain observatory. We started releasing data in the brain observatory in May 2016, and we've had two or three more releases for this. This is the data set I was talking about 40,000 some odd neurons. We haven't published a paper on this yet, whereas like the rest of...Even in the open science community and open data community, the debate right now are, "Okay, well, how soon after publishing the paper do we release the data?" Do we release it immediately on publication? Do we wait six months or a year to give the primary author time to write a second publication before they move forward? And these are important debates because that's the way that the incentive structures are in academia.

41:30 Michael Kennedy: Nobel Prizes are handed out on this basis. Things like that, right?

41:33 Panelists: Exactly. We've released this data and there are already 12 pre-prints that external people have written and posted. I think two of them are peer-reviewed and published also on the data already and we haven't even written our own paper analyzing this yet. It's in the process, we have not yet published our own paper dealing with this data. So yes, so we have a very strong kind of data first, publish later kind of model that you just, you can't do in the current academic infrastructure. I mean there are communities that have sort of demonstrated that this is possible. Machine learning communities, the one that I always think of as really jumping out there early with the latest developments of the algorithms and the approaches and they are starting to make it but I think it's a real big culture change for sort of more. I almost want to say entrenched, but you know, the biological sciences have been around for a long time, and then the publication methods are the way they are for a reason. But there's a cultural change that the three of us have only been out of our PhD's now for, you know, less than a decade. We are definitely sort of seeing this enormous change and it's fun to be at a place that really sort of embraces that cultural change.

42:47 Michael Kennedy: You guys seem to be at sort of the leading edge of that-

42:50 Panelists: I'd like to say just quickly that we should acknowledge that this is all possible because of Paul Allen's generous donation to us. I mean he basically makes all of this possible and it's really the only place in the world that you can do this, so... Kudos to him, and his vision.

43:09 Michael Kennedy: Absolutely. He's got the Brain Institute and there's a couple of other ones as well, right? Now you also have the Allen Institute for Cellular Science and the Allen Frontiers Group,

43:26 Panelists: And the Allen Institute for AI. They're not hosted in our building, they're in a separate building, there's an artificial intelligence group.

43:31 Michael Kennedy: Absolutely, it's great to acknowledge what he's doing cuz it's... It sounds really unique and special and it... We were just talking, it lets a lot of you work in a way that is sort of a better fit for the logical sort of career goals necessary, right?

43:49 Panelists: And just to follow back on that too is that we do want to make sure that we publish. We do have external advisory boards, and we do apply for grants because we want make sure that what we're doing in our space is relevant to the community. We don't want to be in this one-off ball where we somehow discover something that's not relevant, or you know what I mean? We want to make sure that what we're doing is valid.

44:14 Michael Kennedy: The whole peer review process is still, still pretty interesting.

44:17 Panelists: Yes, still very valid for us yeah.

44:19 Michael Kennedy: Right, exactly. So, 2013, President Obama came out with the BRAIN initiative, where BRAIN stands for, actually an acronym as well. How did that affect you guys? Did it make any difference or in the community?

44:34 Panelists: I think that I'm pretty sure that up in our cafe there's a little, I think I'm pretty sure that we've got a little letter signed by President Obama hanging on the wall up there.

44:43 Michael Kennedy: That's pretty cool.

44:44 Panelists: Related to these efforts. It's just great to see that more and more people are seeing how important this is, especially from a government point of view that, you know a president has realized that this is one of the biggest frontiers that we really have left to solve. We know so little, and so it's really great to see that the community as a whole is investing in this sort of research. From my perspective I feel like that was one of the most important things was the recognition at the federal nation state level that brain science is maturing into the type of thing that needs large data centers, it needs large sharing and collaboration tools, it needs large investigations to really start to make a difference in people's lives. The science, the community has matured to the point where we can really start taking that standard forward and making a big impact.

45:41 Michael Kennedy: That's awesome. And with the more open science and open source projects, it's seems like that'll just amplify as people can work better together.

45:49 Panelists: To be honest we couldn't do without the open source software community. I was reflecting on this just the other day. Don't have the name of the Tweet, I know a researcher with the first name Jessie Tweeted a picture emphasizing that matplotlib, numpy, and pandas collectively are supported by 12 full-time developers. When I think about the amount of science, both my own science, our institute science, neuro and then, you know, the rest of science generally, it's a huge edifice of work that's so critical to our national interests and to our interests as humans and to see that much work supported by so few people, but such a dedicated core group of people, it really struck a cord with me.

46:32 Michael Kennedy: It's amazing to think of how just these small initiatives became such a foundation for what everyone is doing. I was happy to see that the NSF recently gave like a three million dollar grant, or something like that, to the SciPy group, and NumPy. So it's starting to get a lot more support. I might have the exact details a little bit off, like the number or whatever, but there was basically a big NSF grant to those groups to keep that going stronger cuz I think they realized exactly what your sayin, like all of these researchers, all these data scientists are day-to-day-to-day going, "Yeah we have a Terabyted mouse video, and we're going to give it to scikit-learn, and then you know like, well we need to make sure that scikit-learn keeps workin."

47:18 Panelists: I remember seeing a call for some funding from Ken Reeds for requests, you know the Python request module. It was like a month ago and he's like oh our goal is $3,000, and I was like, $3,000 to support requests? The number one most used Python module in the entire ecosystem, man that's like pennies. Think about the return on that, that is pennies on the dollar.

47:44 Michael Kennedy: It's unbelievable.

47:46 Panelists: It's unbelievable. I don't know the exact numbers. He had it on his website it's probably still there. But it's like downloaded seven million times a month or something. It's not just a little bit. It's also very nice to see that funding organizations as a whole are starting to recognize too how much one of things that we've struggled with is just how much infrastructure goes in to doing large data analysis. From 10 years this wasn't a thing. Nick was sayin that we're all in our within a decade of our PhDs and we're all really working really hard to figure out how to process large sets of data how to make that work, how to, you know, transfer our research code over to more production like code, and just how much infrastructure it takes. Whereas people don't, a lot of times don't understand that actually. So it's really nice to see that you know, government funding organizations are starting to realize... Because the people on the ground doing this work are really like shouting, "We need the researchers for this, this takes a lot of time and this is so fundamental for the mission of this science."

48:53 Michael Kennedy: Do you think people are being taught those skills in grad school these days?

48:58 Panelists: Your question is so appropriate. I don't have my finger firmly on the pulse. I can only speak to my sort of, small window into the University of Washington, right? I still maintain some contacts with my former advisor and his research group, and I know it's definitely on their minds. And I know at the graduate level, like in the University of Washington's e-science group, they talk about these types of things. Yeah Jake Vanderplas and the group over there doing e-science that's something special, as well. That's a really cool place. But I think U-Dub os really forward thinking in that respect. As far as it's, adopt. Let me give a personal anecdote to drive it home. When I came to the Allen Institute in 2012, I had just finished my PhD, and I took a research science research and quickly realized the incredible amount that I could learn from some of my colleagues in the technology team. I'd never heard of a unit test, like, we're going to test code? What? Why would you do test code? It works!

49:53 Michael Kennedy: I already tried it, yeah!

49:55 Panelists: And then I sort of got my eyes open to the way that they approach their work, and then I sort of fell into the well. I joined the group eventually, but I definitely want to see the same sort of epiphany happen to scientists not only in the graduate level, also the undergraduates, you know? Especially what I've seen the talent of some of these undergraduate scientists that are getting trained in the disciplines like didn't exist when I was an undergraduate and to see them to start to take the tools that we developed, really for industry...and of course, we have visibility in open source software...and take those and apply those to their research and build it into the DNA of how they work and think, that's only going to amplify how open source software contributes to science for the next generation.

50:42 Michael Kennedy: It's exciting.

50:43 Panelists: My lab that I came out of, I finished up my PhD about two years ago and we signed up for, the lab signed up for a GitHub account and started doing version control shortly after I switched over to Python. It was sort of myself and another student who had done an internship at Google between Undergrad and Grad School, that sort of between the two of us he kind of brought some best practices. He's like, "Well, this is the way they did things." And started getting stuff under version control and I still get pings on changes to that repo, and we kind of laid a foundation in that lab, I mean that group, its a whole. I think he's still there but there's a whole new kind of cohort of students in that lab that I didn't know and they're doing research code development there in a very different way than when I entered that exact same lab.

51:38 Michael Kennedy: Right, and to them that's just how you do it.

51:40 Panelists: Exactly.

51:44 Michael Kennedy: Cool! All right well, I want to be cognizant of your time. I could keep talking for a long time cuz there's so many things to explore.

51:49 Panelists: Well let's do it again?

51:50 Michael Kennedy: We could do a follow up sometime, absolutely. But this is super-fascinating. I think we'll leave it there for Brain science aspect. So let me just ask you all the two questions And since there's three of you we'll go kind of quick. First of all, if you're going to write some code on Python, Justin, what editor do you use?

52:07 Panelists: I'm lovin Adam right now. I kind of prototype in Jupyter, in Jupyter Lab actually now.

52:13 Michael Kennedy: That's starting to take off.

52:15 Panelists: Most of my actual packages are in Adam.

52:18 Michael Kennedy: Nice, Corinne?

52:19 Panelists: I'll either do some old school writing in an Emacs and then running it on the command line, or if I'm using an editor with a debugger, I'll use Eclipse with PyDev.

52:28 Michael Kennedy: Nice, right on. Nick?

52:31 Panelists: Emacs when I'm remote and VS Code when I'm local.

52:35 Michael Kennedy: VS Code's really taken off.

52:39 Panelists: I've just been tremendously impressed, especially as it's matured as an open source project and when the updates come in they are timely and they are squashing bugs that people report. It's awesome to watch.

52:50 Michael Kennedy: They've got a lot of momentum, for sure. Notable PyPI package. Maybe there's some package that people not necessarily requests cuz it's the most popular but something you're like, "Oh I saw thing thing the other day. It's really amazing you should know about it." Nick?

53:05 Panelists: Let me take a pass and come back at the end, I want to get a good one!

53:09 Michael Kennedy: All right, Corinne?

53:10 Panelists: Well, I really use very standardized packages just because I want to stay away from people having to install and use unmature code and the things I use the most are NumPy, SciPy, Stats Package. Stuff like that.

53:26 Michael Kennedy: Justin?

53:27 Panelists: One of the things I've been impressed with recently is CookieCutter. Kind of speaking of onboarding, we work with kind of newer Python folks into good practices of testing, tooling, documentation. Helping folks who have a little less knowledge of what a full-fledged package should look like with a nice template has been absolutely invaluable.

53:52 Michael Kennedy: Yeah that's a great idea, that's very very helpful to just run a single command and poof you got all the structure you're supposed to have. Nick, you got one?

53:59 Panelists: I got one, yeah. Bokeh. It's the continuing visualization package. I've been using it to build dashboards and widgets for doing analysis tooling, and I just can't say enough about it. The community that has grown up around it has just been so responsive, and the power of that tool as it matures into the 1.0 release, I'm just so excited to see where it goes 'cause I use it daily and I love it.

54:23 Michael Kennedy: That's awesome. All right. Well thank you so much, those are all great choices. I guess I'll give you all, whoever wants to jump in and add something here, a chance for a final call to action. If people want to work with the Paul Allen Brain Institute or get involved with some of the tools or things you've talked about, what do they do, where do they go?

54:40 Panelists: I think definitely for your users, or not your users. Definitely for your listeners, I think that our GitHub page, so we've got github.com/alleninstitute, and we've got a bunch of different packages, everything from our production things, like the AllenSDK, to smaller packages that individual people are releasing, like neuroglia, argschema, we've got a couple of kind of things in the Python world. As well as research code and packages that are affiliated with research projects. So there's a bunch of stuff there that's a whole lot of Python.

55:17 Michael Kennedy: Nice.

55:17 Panelists: So our GitHub page will have lots of great examples of how to actually utilize the data that you can download too. Which, if you want to browse around all the data, go to our website and you can see the massive plethora of data that we have there that's available for everybody. And one particular package, because it's so close to my heart, the AllenSDK. It's really the sort of the one-stop shop to get your hands dirty digging into our data. It should work, just pip install, and if it doesn't, check out an issue and assign it to me and I'll tackle it as soon as I can. And any of our research that we've got going on here, Twitter, we're on Instagram, we've got a bunch of job openings too. I think for some software developers, alleninstitute.com, and there's a button somewhere for careers. So, yeah, there's a lot of fun stuff happening.

56:04 Michael Kennedy: Awesome, yeah. It sound super exciting. Thank you for sharing this view into what you're all up to.

56:09 Panelists: Thank you so much for having us. Thank you. Thank you for having us.

56:12 Michael Kennedy: Bye. This has been another episode of Talk Python to Me. Our guests on this episode have been Justin Kiggins, Corinne Teeter and Nicholas Cain. And this episode has been brought to you by Cox Automotive and Rollbar. Join Cox Automotive and use your technical skills to transform the way the world buys, sells and owns cars. Find an exciting, technical position that's right for you at talkpython.fm/cox. Rollbar takes the pain out of errors. They give you the context insight you need to quickly locate and fix errors that might have gone unnoticed until your users complain, of course. As Talk Python to Me listeners, track a ridiculous number of errors for free at rollbar.com/talkpythontome. Want to level up your Python? If you're just getting started, try My Python Jumpstart by building ten apps or our brand-new 100 Days of Code and Python. If you're interested in more than one course, be sure to check out the Everything Bungle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, Google Play feed at /play, and direct RSS feed at /rss on talkpython.fm. This is your host Michael Kennedy. Thanks so much for listening, I really appreciate it. Now get out there and write some Python code.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon