Brought to you by Linode - Build your next big idea @ linode.com


« Return to show page

Transcript for Episode #238:
Collaborative data science with Gigantum

Recorded on Thursday, Oct 17, 2019.

0:00 Michael Kennedy: Collaborative data science has a few challenges. First of all, those who you're collaborating with might not be savvy enough in the computer science techniques, for example, Git and source control or Docker and Linux. Second, seeing the work and changes others have made is a challenge, too. That's why Dean Kleissas and his co-founders created Gigantum. It's a platform that runs either locally or in the cloud. And it spins up data science environments into Docker containers seamlessly on your local computer. And it syncs collaborative updates from machine to machine. This is Talk Python To Me, Episode 238, recorded October 17th, 2019. Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm. And follow the show on Twitter via @talkpython. This episode is brought to you by Linode and Tidelift. Please check out what they're offering during their segments. It really helps support the show. Dean, welcome to Talk Python To Me.

1:14 Dean Kleissas: Hey, how's it going?

1:14 Michael Kennedy: Hey, it's going really well. It's good to have you here.

1:17 Dean Kleissas: Yeah, it's great to be here.

1:17 Michael Kennedy: Yeah, it's going to be a lot of fun to talk about repeatable data science and collaborative data science and stuff like that. And you and your co-founder have put a ton of work into turning this into some platform that folks can use. And that's really great. So we're going to talk all about that. Before we get into it, though, let's just start with your story. How'd you get into programming in Python?

1:38 Dean Kleissas: I was pretty fortunate early on. My parents got us a computer. So I've always kind of had computers around. I remember battling with my parents to get dial-up in the house, and that was a huge deal. I definitely was in the AOL era, getting home from school and jumping right on, so, like everybody.

1:57 Michael Kennedy: Yeah, of course. And the worst was when somebody would pick up the phone, right? You're in the middle of something, and they wouldn't know. They pick up the phone in some other part of the house, and it would kill your connection. You'd be like, come on, I was downloading something. It took an hour to get a megabyte, and you just killed it.

2:11 Dean Kleissas: That's why definitely overnight, you were definitely downloading stuff, you know? And we definitely were the one phone line still, like everyone. I had my GeoCities website packed with under construction gifs, all the staples. So that kind of got me going, web development, poking around, if you would call it that then. And in high school I did a bunch of random programming classes, and through college, it was more like dabbling here and there. I took, I had a Visual Basic thing, some Java. I was doing mechanical engineering at first and started doing a lot of MATLAB and then picked up electrical engineering at the University of Rochester. And that's kind of where I started getting a little bit more into programming as more of a serious thing and instead of just something I poked around at. And really, because of my initial career path, I started at Northrop Grumman for a little while before I went to Johns Hopkins, was doing a lot of MATLAB. And it kind of got me started. And then started using Python and was just like, wow. This is amazing.

3:11 Michael Kennedy: Never looked back, right? Why was I using MATLAB?

3:13 Dean Kleissas: Never looked back. And so, yeah, especially when I was at Johns Hopkins, at the applied physics laboratory, I was there for a while, did a lot of Python. Mainly, mostly Python development there.

3:23 Michael Kennedy: What kind of stuff were you studying there?

3:24 Dean Kleissas: When I was at...

3:25 Michael Kennedy: At Johns Hopkins, yeah.

3:26 Dean Kleissas: So like I said, I did my undergrad at the University of Rochester in Mechanical Engineering and Electrical and Computer Engineering. And then I did a master at Johns Hopkins focused in robotics and control. And then when I was at the applied physics lab, that was at like a staff member, engineer there. And so my time there was in this group called the Intelligent System Center and it's this really wild place. They took a bunch of neuroscientists, computer scientists, a bunch of roboticists, smashed us together in a building and hoped for novel things to happen. And it was really cool, did a bunch of really interesting things with a lot of external collaborators so working in a lot of applied neuroscience work doing some high resolution brain mapping, some things with the hospital, so doing some natural language processing and machine learning in the clinical setting for quality improvement. So got to do a lot of really interesting stuff and almost all of that was in Python. And we would build these pretty large, production quality systems. And then, often we would do it on Python and then there would be little bits where there was somebody that was really into computational intensive or whatever and that little bit would be written in C and with C-type bit. And everything else is just lots and lots of Python, so.

4:35 Michael Kennedy: Yeah, that sounds like a ton of fun. I worked in some research groups with a bunch of cognitive scientists before and there's a lot of computation for what, to me felt almost like psychology and stuff. But it turns out there's a lot of computational stuff going on in those research groups in that discipline.

4:55 Dean Kleissas: If you're collecting some sort of data you eventually need to do something with it. And this is like across all fields. Recently I was working more and more with people in the social sciences so, people doing food science. People doing economics, obviously. But these places that you don't think as traditionally as the hard sciences as being so data-driven and data science-driven are just getting into it too and there are just all the same problems.

5:22 Michael Kennedy: Yeah they are. One of my daughters is in her second year of college and she's studying psychology and she sent me a text yesterday and it said, "Dad, do you know what RStudio is?" I'm like what? She's like "Yeah we're using this for my psychology class" I'm like oh okay, that's pretty weird. Too bad it's not in Python and Jupyter. But still it's interesting how prevalent this type of computing, that that platform that you build for and that we're talking about is across all the disciplines, right? When I think of Python and who should be a developer for any language but especially Python, and who should become a developer and get these skills I often, maybe you heard me say is that Python can be a superpower for what you actually do. It doesn't have to be "I am a Python developer." it can be, "I am an economist and I'm really good at it 'cause I know Python and Jupyter." or something like that. And I see that spreading quite a bit.

6:17 Dean Kleissas: Absolutely, the ecosystem has gotten so rich and the tooling has gotten so good. Really it's opening to everybody. And we see the split between Python/Jupyter and RStudio. And it's often aligned with fields. Like some fields use RStudio more, some fields use Python more, and so it's interesting to see how that all has played out as well.

6:41 Michael Kennedy: Yeah for sure. I think there are some good exchanges between the two environments as well. So, when you're doing this research, one of the things that was a bit of a motivation for you to create this project that we're going to talk about, is that you had folks from all these different skill levels, from all these different backgrounds, these different specialties trying to work together, right? And that's got to be a problem, or challenging, rather.

7:05 Dean Kleissas: Yeah so that experience, six or so years at Johns Hopkins doing these research projects and helping run these research projects really was, you feel this pain that makes you want to do something about it, right? And there was a couple issues that we kept running up to that became very obvious after we spent loads of energy doing something. And it's like, there's this issue with asymmetries, right? So you have people that want to work together that have very different skill sets and they also care about different things about a project. And so, how can you make these people work together and when it comes down to the technical asymmetries, it's like, even if you know how to use Git, or you know how to use Docker, or you know how to use some tool that makes your work better, that person on the other side of that exchange, if they don't know it as well, all the energy you spent it potentially wasted because you just can't work together. And so, as you start building, we're seeing the science is getting harder, as people are starting to say, discovery's getting harder. Teams are getting bigger, teams are getting more heterogeneous to solve more complicated problems and it just compounds this issue of asymmetries and everybody's contributions on these projects is important, that's why we're all working together, and so if we can make it easier for people to work together to not need to learn every single complex tool to be able to see something or interact with something or contribute, it's just going to be better. And so we definitely were going that route to start of, everyone will just learn these complicated things, this is the way you do it, why aren't you...

8:39 Michael Kennedy: You can be so better off...

8:41 Dean Kleissas: People should be building these Docker images.

8:42 Michael Kennedy: Exactly if they just learn Docker and Kubernetes, what's wrong with them?

8:46 Dean Kleissas: Yeah, I know they have a day job, that's fine. I know you're a doctor, but you need to learn how to run this Docker container, so you can run this training for me, right? It took a while to learn that firsthand that everybody's time is important. Everybody's busy and you want us all to move as fast as we can together. That was a big motivator for a lot of the science decisions we made. As doing this, a lot of these collaborations were coming from this angle that we were pairing with biologists, with neuroscientists, so our first big thing, we started working with this lab at Harvard, run by Jeff Lichtman and Bobby Kostere and a bunch of people there doing awesome high resolution brain mapping stuff. So you're using electron microscopy to actually map the brain at single synapse resolution. So you can see every single neuron, and every single synapse made. And so that whole field in general has exploded, they've got all this awesome stuff coming out now out of this program called IARPA Microns, which is what we worked on as well, where they imaged cubic millimeter of brain tissues, this is like two and a half petabytes of image data, right?

9:50 Michael Kennedy: And it's that small, right? It's like a cubic millimeter, which actually an incredibly small part of the brain.

9:56 Dean Kleissas: Correct and this was done at the Allen Institute, in Seattle, is where that imaging was done. In 2011, when we started that, it was this one lab in Harvard, imaging it, we'd go up there, the data's on a hard drive, and it's like beginning the application of can we apply some engineering to the science and make it better and it was an interesting collaboration and we built some database systems that were optimized for this type of thing. We learned a lot. And it was that idea of can we apply these things we know from software development? And from proper engineering and help apply them to the science. That was a very obvious thing, that needs to happen in data science, right? Like software engineering has done all these awesome things, data science needs some of that help. So that was one of the founding things we wanted to do as well.

10:40 Michael Kennedy: Right, because we definitely have a lot of folks coming into the data science world not specifically from the computer science side, right? They're coming from all these other fields as we've spoken about and it's probably a bit of a stretch just to be doing the programming at all. Much less go, yeah we're going to learn about refactoring and unit testing and Docker and all these other things. And so, with this big influx of folks, right? It seems almost like there's two paths. One path is to say, well we're going to all work together, and so what we're going to do is we're going to use the simplest, light weight, least structured world as possible and we're all going to install Anaconda, we're all going to run Jupyter, and we're just going to try to just, maybe we'll use version control, right? 'Cause we got to share the files, maybe we'll use something in the cloud. And that's one option. Another one is, like what you guys went after, which is to build a platform which makes those things transparent, but also actually does a lot of that stuff like versioning with Docker so you get exactly the same stuff over time, and things like this, you know?

11:47 Dean Kleissas: Yeah so there's all these things that are required to do good data science that aren't actually data science. And so, it's unfair to expect these people to be required to know it. Over time you will learn. This idea that right now, getting into data science, is like a step function right? You don't just slowly start, you have to make this leap. Chris Holdgraf had this quote on Twitter the other day from the Pangeo conference that I thought was perfect. That was, "Learning data science often feel like needing a pair of scissors to open a package of scissors." It's like, this is what we expect from people and so what we really wanted to try to do is can we make this more of a ramp? Can you click a button and get started, and that's fine. And then eventually you're like, oh, I see this is doing Git stuff, and I just merged something. And you can learn what that means, and you can get under the hood and you can write your own Docker snippets that get inserted in your Docker file if you want, but if you want to use a package manager, you can click some buttons and with time, you become more skilled. And it lets to, let's say, symmetry map be a little bit less crazy.

12:53 Michael Kennedy: So that's a great segue, I think, over to this project that you guys built called Gigantum. It's sort of the project to solve this problem that you've laid out for us, right?

13:02 Dean Kleissas: Yeah, so we had the opportunity to get a little bit of money and try to start this so we were able to quit day jobs, 100% effort to work on this project.

13:12 Michael Kennedy: Sounds like a dream, honestly.

13:14 Dean Kleissas: It kind of was. We spent so much time working on this project at Hopkins that was just coming to its peak. I left right around this time and it was kind of hard to do that. But like you said it was just perfect to go and have an opportunity to do this thing you've been whining about for six years, that someone needs to go fix this thing. And you finally have an opportunity to maybe contribute and do it.

13:33 Michael Kennedy: You can work on one project like at Johns Hopkins and make that project great. Or you can work on something like this and you could sort of meta-solve it for the world, right?

13:42 Dean Kleissas: Right. So, fingers crossed, that's the plan.

13:45 Michael Kennedy: That's the plan, all right so tell us about it.

13:46 Dean Kleissas: Yeah, what we built was this suite of tools that helps people with different skill levels do transparent, reproducible data science way faster and easier than they could through automation and ergonomics around like I said, all those things that you need to do good data science, but aren't data science. So, the core of it is this thing we call the Gigantum client, it's an open source, MIT licensed basically web application. So you can run it anywhere that manages a data science project or a dataset. And so, when I say a project, that was another thing we felt was the first thing that needed to be solved was there's no real currency in data science. There's no standards, there's no way to make an exchange with another person because there's no standard way that you represent your environment or your data or organize things. There's no easy way to ship up, pack it up and ship it.

14:38 Michael Kennedy: Yeah, 'cause you have a Jupyter notebook, that's not, potentially enough, you also need the libraries that notebook is going to depend upon, so something like a requirements file. If you have data, right? You've got to package that data. Somehow all these things need to go together, right? Like the read me, about it, and so on, yeah?

14:56 Dean Kleissas: To make a project that's reproducible but more importantly transparent, so somebody could understand what you did, why you did it, how you did it. You need to bundle together the code, the data, the environment, like you said, a read me, the work history. Like how did what when. And you need to bundle that together and you need to track that and make it be automatic. That's what we did. So the big thing that the client does is it tightly integrates with Jupyter and RStudio and this interface lets you upload your data, drag and drop your code, build your environment so you configuration your environment, you can drag and drop a requirements.txt file if you got that. You can use package managers like pip, conda, act. Or you can write custom Docker if you want to get under the hood. It's really important that it's that flexible 'cause not every tool people need to use is sitting in some package manager as well. And so the ability to be able to build whatever you want was really important. And so by putting it all together, the client then versions this all in unison, lockstep. So at any point in time, you can roll back and get the same environment, the same view of your data, code. And because of these tight integrations what's really cool with what we've done is as you are running your code, so you're writing in your Jupyter notebook, you execute some cells, we will detect that that happened, we will automatically create a version, we will automatically generate some metadata to let you know what you were doing. If you generate a figure we kind of extract it and compress it and save it. So that you can get this visual history of what you've done as well. So it's not just about making this thing easy to share but it's making it a little more intelligible. You're not looking at some Git history where it was just like, ran my code, ran my code, fixed the bug, oh crap, what's going on? It's like, created this figure, changed these parameters, right? And that's going to be getting a lot better with time as we push on that.

16:49 Michael Kennedy: Yeah that's really cool. And it's worth pointing out I guess if it wasn't obvious to folks, everything you talked about so far runs locally on your machine, 100%, right?

16:59 Dean Kleissas: Right, so like I said, the client's designed to run wherever, just a web application. So you can run on your laptop, you can run it on a server in Amazon. We have lots of users that like to do that right? They run on their laptop, they sync it some GPU instance, or whatever, they do some training, they sync it back to their laptop and keep on working and only pay for a little bit of cloud time. You can also run it in our cloud. So you can click a project and play with it there, as well. It's an easier, no installation. Because, like I mentioned, this does require a Docker. So if you want to run your laptop, you have to install Docker on your laptop. That's gotten a lot easier.

17:36 Michael Kennedy: It's so much easier. It used to be really quite a "I hope I can make this work" sort of feeling, but now it's like on the Mac, you download, the Docker mac app, you double click it, it runs in the menu bar and that's pretty much the extent of it right? You don't really have to know more, so it's pretty easy.

17:53 Dean Kleissas: Yeah and Windows is getting a lot better. And Windows, there's some really awesome advances coming out of the windows team. Windows subsystem for Linux 2 is going to make Docker containers run essentially almost native on Windows, it's going to be great, that's going to be coming out.

18:06 Michael Kennedy: Maybe tell people about the Windows subsystem for Linux, just what that is, real quick. They might not be aware of it.

18:12 Dean Kleissas: Sure, so there's a version now that you can get if you've got Windows that let's you effectively run a Linux kernel on Windows. Like the shipping of a Linux kernel with Windows now and you can install Ubuntu and fire it up right there in Windows. And it's not really using a VM, so in the sense that it was before. So if you install Docker Desktop which is Docker's product that you should install when you're running on your laptop, it's got like a little applic like you were describing. That's going to run in a virtual machine to give you that Linux environment. So Windows subsystem for Linux is pushing all of this way farther down, closer to the operating system so that it's really almost like native performance. And this WSL 2 or Windows Subsystem for Linux 2, that's coming out is just a much much better version of that that lets you run Docker inside of it. Which is what's really exciting. And so...

19:02 Michael Kennedy: Yeah, thanks for the side track, yeah so you were talking about it's pretty easy to run Docker on your machine now.

19:06 Dean Kleissas: Yeah, and we do, like as we talked about asymmetries, installing software on your computer is a thing that people don't really do as much anymore. And it's daunting, and so we do have a desktop app that helps walk you through the Docker install and configuration our stuff and just get this all going for you as well. So you just download that, double click it, and it should hold your hand to the point where you've just got a Jupyter notebook open. And that Jupyter notebook is running in the Docker container in a Git repository for you.

19:33 Michael Kennedy: Yeah when I was playing with it, that's how it worked. I downloaded the app, I ran it, it just said, "Please wait, we got to download a bunch of data 'cause we're downloading Ubuntu." or something like that and it fine. I just chilled for a minute and you're right, I was right there. First thing it dropped me was into Gigantum, the web app. Which is a little bit like a JupyterLab feel a tiny bit. Shows you your projects and your data and stuff and then you can actually launch or create new projects which then launch into Jupyter or RStudio and so on. This portion of Talk Python to Me is brought to you by Linode. Are you looking for hosting that's fast, simple, and incredibly affordable? Well look past that book store and check out Linode at talkpython.fm/linode. That's L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server with a gig of ram. They have ten data centers across the globe so no matter where you are, or where your users are, there's a data center for you. Whether you want to run a Python web app, host a private get server or just a file server, you'll get native SSDs on all the machines, a newly-upgraded 200 gigabit network, 24/7 friendly support, even on Holidays and a seven-day money back guarantee. Need a little help with your infrastructure? They even offer professional services to help you with architecture, migrations and more. Do you want a dedicated server for free for the next four months? Just visit talkpython.fm/linode.

20:58 Dean Kleissas: That's really the client and like I said, we have these apps that help run the client for you and make it real easy so you can run on your laptop with our desktop app. You can run it on a server. We have a CLI, so if you prefer the command line, there's a little CLI version that you can just be like "install Gigantum, run it" and so that's good for when you're running on some remote resource. And them we have our Gigantum.com, there is the hub, which lets you sync and share your work and preview other people's work and play around. And that's how you can collaborate and move things around from computer to computer, find content, find examples. It's a very decentralized system in that you can copy things wherever you want. And that's like, then there's this central piece where if you want to put your stuff in one place, that's where it goes.

21:42 Michael Kennedy: Yeah and you can synchronize across that. So, a lot of clouds systems, you log in, you run your code there and it just stays there, right? I'm editing there, you're in that cloud for this project or whatever. But at least the desktop client version, the way it works, you can synchronize it up to the cloud, and other people can download it locally and run it and so on right? Basically behind the scenes seamlessly uses Git to keep everything in sync right?

22:10 Dean Kleissas: Yeah, so we're using Git, we're Git LFS, which is Git Large File Service, so it's like, when files get over a certain size, Git explodes and so this is a way to handle files that are a bit bigger. And when files get even larger, we use a different service that we've built, and that all is transparent, so you upload your data, we'll use LFS if we need to use it, we won't, we'll use regular Git if we don't, so that's managed as well, for the user. But yeah, this idea of being able to sync versus, like you said, having your code right there in that platform, it lets you move it around wherever you want and put it on the right resource to do what you want to do.

22:45 Michael Kennedy: Yeah, I was looking over at your GitHub organization for Gigantum and you have a bunch of different projects there and it looks like the stuff that you're talking about is open source. Not necessarily the cloud, because a lot of SaaS providers, well, almost every SaaS provider does not open source their SAS thing and that makes a lot of sense, but the Gigantum client, a lot of that stuff seems to be open source, is that the right reading of that?

23:09 Dean Kleissas: The client is open source, MIT licensed. Also our desktop application and our CLI are there. And we've got some other smaller packages we've built for various reasons as well. And that'll always be the case, right? The whole idea is this client, the actual workhorse of the whole thing is free to use, let's you use it wherever you want for however long you want. And then it's only, like you said, we do have this somewhat proprietary piece which our cloud infrastructure that lets you compute in our cluster and all of that sort of thing.

23:37 Michael Kennedy: Yeah okay, interesting. So let's maybe talk through creating a project, I think that'll give people a sense of what Gigantum offers and why they might use it and so on.

23:48 Dean Kleissas: Sure.

23:48 Michael Kennedy: So yeah, go ahead and take us through it.

23:49 Dean Kleissas: When you create a project, effectively you're creating this specially formatted Git repository with a bunch of extra information in it. So, you start by running the client, again, like we just talked about, you can run on your laptop, you can run it wherever, you can run in our cloud. You create a project and you choose a base. So all projects start from some base container. So we build a handful of them and have them available so you can use Python 2, Python 3, RStudio, Jupyter, JupyterLab, they're different configurations. We've got some that have CUDA support, if you want to do deep learning with GPUs, we have some that come pre-built with a whole bunch of data science packages. It's kind of like, choose what you want, there.

24:28 Michael Kennedy: Yeah, I could choose, for example if I pick Python 3, it says, "Do you want the full-on data science work station Docker image or do you want the bear bones just has Python 3 on it?" why would I choose one over the other? First impressions are maybe if I choose the full-on data science one, it's another gig download or something, but why would you choose one over the other?

24:50 Dean Kleissas: Yeah, really it just comes down to disc space, right? The pre-baked one has a ton of packages that are this somewhat community agree upon set of things that are core data science tools in Python. And so it's much larger. And then the minimal one is obviously smaller, and so let's say, if you know what you're doing, if you're doing something where you only need a couple packages, then maybe you just install those yourself, but if you quick start...

25:17 Michael Kennedy: I just need requests and pandas, that's all I need, I know it.

25:21 Dean Kleissas: Yeah, then you don't need to rest of this, yeah absolutely.

25:23 Michael Kennedy: Yeah, cool. Okay, so you pick one of these base images, and technically it's a Docker image, but people don't really know that. They don't need to know that, they just need to know, I get a Linux operating system that has these capabilities, right?

25:37 Dean Kleissas: Yeah, I get a Linux operating system that has Jupyter ready to go for me, basically. And once you get the project created, you can augment its environment. So like you said, if you want to install pandas and requests, you click on an environment tab, there's a little widget there that let's you enter in the package name. If you have a specific version you want, if not it'll look it up for you. That's an important thing too, like all packages that get installed are automatically pinned to a version. So, there's none of this "just install the latest every time you run". No, it installed a specific version and if you want to upgrade it, you can upgrade it right there in the UI, and it lets you know that there's an update available. But everything is pinned.

26:13 Michael Kennedy: That definitely feeds into the reproducible science, reproducible computation side of things, right?

26:18 Dean Kleissas: Making things reproducible is kind of a side effect of what we wanted to do, right? Like, that should just be the default world we want to live in where we started talking about this as a reproducible work environment. It's not, "I did something and now I want to be able to do that exact same thing later so I'm going to do a bunch of work to make it reproducible." It's because I did this the way I did this from day one, I can, at any point in time, reproduce that. And even if you're not sharing with other people, your future self, like going back to a project you worked on six months ago is incredibly challenging. And so, where were those files? What virtual environment was I using? And just having that taken care of, it's just one less thing to worry about.

27:02 Michael Kennedy: Right, so the first time you go through this process and you pick a Docker image, obviously underneath the scenes it's doing like a Docker build, which might do a Docker pull and download some of the various container images, I would guess the second, third, and fourth time, those are like just cached and nearly instant, so that's nice right?

27:21 Dean Kleissas: Yeah absolutely, so we take advantage of the Docker caching so if you reuse the same environment, be able to almost instantly, all that kind of stuff is really nice.

27:30 Michael Kennedy: Yeah so it goes in and it creates the Docker image and it runs through all the start up which is like pip install, conda install, the things that are associated with that. You can even watch it build like I watched the little installer screen 'cause I was just sitting there waiting, like oh let me just watch the build, like yeah just a bunch of pip installs, just like I thought, right? So that's all pretty good. And then it basically, once it's all done, it launches that and it drops you into a web app pointing back at local host, some port, 8000, 8888, something like that. That goes back to this web view into your workspace and that's really the view of Gigantum that people really see, right? That's what they perceive it as, I would guess.

28:08 Dean Kleissas: Right. So that's where you can see, if there's a read me, you can see the environment configuration, you could upload if you've got existing files, that's where you would drag and drop some notebooks or drag and drop a bunch of data. And it gives you that organization, as well. We wanted to, not so much, enforce this and say this is how you have to organization things, but at the end of the day, we decided it was important starting to build some expectation around how things are organized not only makes building the software easier, but it just builds this intuition and with people when you open up a Gigantum project, you know where the code is, you know where the input data is, you know where the output data is. At least you have that much. And then you're free to organize it however else you want, right? But that's how it's broken apart, is you have these different bins, you drop your data into, your code into. And then there's, we touched on a little bit, this activity feed so this is the history of everything you've done. So you'll see, if you add a package, they'll be a little entry that you changed the package. If you add some files, if you delete some files, if you run code. So everything you do, the system's constantly monitoring it, and it's automatically making Git commits for you under the hood. And that's visible in that project view, as well.

29:22 Michael Kennedy: If you want to share it with people, you can push it to the cloud and give 'em a link, and they can download it, or you can invite them to it, or something like that. Does that include their history in your local version as well? What's the collaboration look like around that?

29:35 Dean Kleissas: Yeah so if two people want to work together on a project, someone creates it, they publish it, by default everything publishes privately. So only you will be able to access it, you then have to add a collaborator. And we've got permission model where you can add somebody as an administrator, they can have read/write, and they can have read only access. And that limits what they can edit and what they can see, but if they're able to write to the project, they're able to run the code, make changes, change packages, whatever they want to do, and when they sync and you sync, we deal with that Git operation of fetch, pull, merge, push, so by clicking sync, we're doing all of that for you, and you'll see in the activity feed, your changes interleave together as well. So you'll be able to scroll and see oh they changed this package, or they added this dataset, or whatever.

30:23 Michael Kennedy: That's interesting, so if they sync it down, locally, they jump on a plane and they're working on a project, maybe they changed some of the code, they make some comments, they edit the read me, whatever, and they get back and they push sync, it doesn't just sync like, well here's their commit message and their commit that actually goes into Git, but it actually syncs that activity back in like a richer way.

30:45 Dean Kleissas: Yeah, everything they did, you see. Which, you know, is interesting. We have more features coming around this particular activity feed around searching and filtering and changing views on that data. Because it's this rich set of information that we've never really had, to be honest.

31:01 Michael Kennedy: Yeah, like how are people expected to work with it? What questions do they have?

31:04 Dean Kleissas: Right, so like a one we want to be able to make real easy is like give me all versions of this figure or this cell, you know? So you can see as things change visually, instead of trying to somehow do git diff over and over or something crazy right?

31:19 Michael Kennedy: Yeah, I like it a lot. The activity feed can get pretty busy 'cause it has executed cells and stuff like that, but yeah so maybe some filters for whatever you deem important. You guys define important like a figure was created, or code was changed or something like that.

31:34 Dean Kleissas: Yeah, that's definitely something we're working on improving is having a small lightweight model, especially around text output of saying what's important, what's not and making that very streamlined, so.

31:46 Michael Kennedy: Yeah, for sure. Well let me think about this idea with you for a second here. One of the things that's cool is it launches into JupyterLab, right? And JupyterLab is nice, it's sort of the premier way to do Jupyter things these days, right? And it gives you more than just the notebook, it gives you GUI access to the file system, it gives you a terminal, it gives you a markdown editor, other stuff like that, right? So maybe I could fire up my project and I'm like, oh I realize I need, I don't know, some library, I need this version of OpenSSL, or I need to install, I don't know, something I've got to do on Linux, maybe I drop into the terminal which you can do from Jupyter, right? You can go and configure your environment, and do all the things there, which is cool. What's the right flow? So maybe I've done that, I've created a project and I'm like, oh I had to drop into the terminal and do this, now I want to share this thing with people, what is the right way to record that that happened? Should I go back and make an equal change in my Docker config in my project in Gigantum that would have had that affect of what I did in the terminal and then I could push that out? Or do I create a new project and start over? What's the workflow there?

32:59 Dean Kleissas: Yeah. If you go in an edit that runtime environment, it's going to be lost, effectively. So, you would have to go back to the Gigantum client and just go to the environment tab and add that package or whatever. This is a really common thing we've heard is it's really annoying that you have to remember to do that. 'Cause that's a very common workflow, it's like I'm doing in the middle of something, I don't know how to do something, I look on the internet tells me what to do.

33:25 Michael Kennedy: It was on Stack Overflow.

33:27 Dean Kleissas: Exactly. The internet says install this package and I need to do that right now, and that means I have to shut everything off, 'cause I need to edit this Docker container, so that's something we're very much thinking about making a better experience of being able to edit the runtime environment but then when you stop, have that edit persist back, because it's so common and it's such a common way that people drift their project, right? You start all good at day one, you write your requirements.txt file but then you just get into it and you drift, right?

33:58 Michael Kennedy: Right, and it might not even be a Python thing, it might be some apt install type of thing, right? Like, I changed this environment variable in Linux so that it would work.

34:08 Dean Kleissas: Right and so, how do we make that better? We've got a bunch of ideas that we're starting to work on. And one thing that has been really fun in getting into this project is going from being this researcher mind of like just building tools to get it done, to more like product development, so this is something we know we want to solve, but we've got ideas, but maybe our ideas are wrong. So we actually try to do some user testing, and that type of thing. And test it out before we build it, so more complicated features like this where, it's not obvious what the right answer is, takes a little bit more time 'cause you have to actually not just build what you think, 'cause you're not always right. So that's been really fun and interesting. It's definitely something we're going to be focusing on. 'Cause it's one of those things where people are like, "Man, this is really annoying." and those are the things you want to try to fix first.

34:57 Michael Kennedy: Yeah go where the pain is and solve it for people. That's almost always a good business opportunity, right? It sounds really hard to me though, like how do you capture what people randomly did as they flailed about on a command line in terminal? So I'm not really saying you must do this, I was just wondering, like if I were to correct my just bouncing around on the terminal, what would I do in that platform, right.

35:24 Dean Kleissas: Yeah, and so it would be you'd have to go to the environment tab and edit it yourself. But it's interesting because of these tight integrations and because of the architecture of Jupyter and JupyterLab, which is really cool in that it's basically under the hood they've got this pubsub architecture. So there's just messages flying around of like everything you're doing in the interface, we can listen to those messages. And that's how we build this tight integration, so that's how we know when you executed some cells and you produced a figure, under the hood in Jupyter, there is this pubsub architecture that's emitting messages that are containing what you're doing and containing the figure and we're able to scoop all that up and analyze it. And that's how this auto activity is happening.

36:04 Michael Kennedy: I see. Okay so you probably could capture like an apt install such-and-such.

36:08 Dean Kleissas: Yep.

36:09 Michael Kennedy: Potentially right? Maybe editing files is tricky, but some of the command line options, yeah for sure. It seems like you could say, "Hey, it looks like you installed Nginx. Did you really do that? because if you did, we're going to need to put that into the config for Docker." 'Cause that's like a Linux-wide thing that you need to have possibly to keep working the way you are, right?

36:28 Dean Kleissas: Right.

36:28 Michael Kennedy: Nice. This portion of Talk Python to Me is brought to you by TideLift. TideLift if the first managed open source subscription, giving you commercial support and maintenance for the open source dependencies you use to build your applications. And with TideLift, you not only get more dependable software, but you pay the maintainers of the exact packages you're using. Which means you're software will keep getting better. The Tidelift subscription covers millions of open source projects across Python, Javascript, Java, PHP, Ruby, .NET and more. And the subscription includes security updates, licensing, verification and indemnification, maintenance and code improvements, package selection and version guidance, roadmap input and tooling, and cloud integration. The bottom line is you get the capabilities you'd expect and require from commercial software, but now, for all the key open source software you depend upon. Just visit talkpython.fm/tidelift to get started today. Maybe we could talk a little bit about some of the projects you have on the website. Actually before that, let's take a step back, so at the time of the recording, basically your cloud is like a synchronization bookkeeping location, but you're just about roll out the ability to run the code on the cloud, as well if you don't want to install the client and run Docker, maybe just tell us what's the story around that. And then I'm going to talk to you about some of the demo apps you got out there.

37:56 Dean Kleissas: Sure yeah, so this has been a really big effort to basically from the ground up completely rebuild our cloud platform for this big change. And so what we're adding is when you sync a project at gigantum.com, it now becomes much more interactive and rich. You can preview files, you can look at the notebooks, you can see the activity, all those sorts of really useful things right there just on the website. But you can also, like you said, you can click a button and launch that JupyterLab instance right there in the cloud. And that's going to let people with zero install play with their projects, explore other peoples' projects. We've really found from talking to users, there's all different reasons you do different things, and so if you're just quickly looking at somebody's project, or you see some link on Twitter, and you're like what is this thing? And you look at the thing. You're not going to then, go install Docker on your laptop just to see what the thing is, right? And so, this again, this idea of asymmetries, it's what you actually care about really brings that barrier down of being able to click a button, play with somebody's code, or do something real quick with your code and your data. We're also going to allow some anonymous use as well, 'cause I think that's really useful, I don't need to sign up, I don't need to log in, I just want to see this thing real quick, what is this? And so that's going to be a big piece of it, as well.

39:15 Michael Kennedy: Right, maybe you just want to read it, but it's got to execute it, right? You want to just see what is this thing? I just want to see the output and maybe I just need to run it real quick to get fresh data or something like that, right?

39:27 Dean Kleissas: Right it's just going to let you do that, right there in the platform. So we've been working real hard, it's really interesting new stuff. It's a mix of under the hood, Python and Go and Kubernetes is the main orchestrator for all the containers and everything, so.

39:42 Michael Kennedy: Yeah, okay, that's sounds really...

39:43 Dean Kleissas: Very excited about it.

39:44 Michael Kennedy: Yeah that sounds very exciting, that's a cool addition to the project. You know, there are some places that will give you this read only executed view of these Jupyter notebooks. And honestly I don't know how they work super reliably, but there's things like Binder, and even if you go to GitHub, like they'll run some Jupyter notebooks for you there. It sounds to me like those can't have like a complete underlying environment configured for that notebook, I would guess. Is that right?

40:14 Dean Kleissas: For GitHub, they're doing static rendering of the notebooks. So what's nice about Jupyter notebooks, well, depending who you talk to, this is nice, but also annoying sometimes it's like everything is in the notebook, right? It's this big JSON document and the last state of your notebook if you ran your cells and you've got figures it's there, so you can run it through an awesome tool called nbconvert, which will turn it in like an HTML page for example. And then you can view it on the web. So we do that on the website, so you can view preview notebooks. And you mentioned Binder. So Binder's a great tool that let's you take a Git repo and click it and execute it. So it actually gives you a working Jupyter environment as long as you've configured your repository in a certain way and you've got all your package dependencies and all of that. And so that actually gives you an interactive version similar to what we're doing where you can get an interactive version you can actually edit. I would say the one difference is because of how we build things, to put it in Gigantum means that that environment's correct for that notebook, so you'll be guaranteed to be able to run it, to render it. And because we have that environment especially like in R, to render some of the R notebooks for previews. You need that actual environment as well. So having that around makes it a little bit easier for us to be able to render these things for web view.

41:34 Michael Kennedy: Yeah and then you can just grab it and say, give me this project and you literally have everything set up in a Docker container, right? Yeah, yeah super cool. Let's talk about some of the demo projects. There's something just fun about going around nice written notebooks and just clicking through them and watching them do their magic. So you guys have some of those up on the site. Like, you mentioned the Allen Brain Institute before about the measuring the millimeter cube of brain. You've got some stuff like that up there that people can explore, right?

42:04 Dean Kleissas: They have an SDK, so they have like a library you can install with Python, it lets you access some of their datasets and visualize the data and play with it. And so there's an example project that shows you some examples of how to do that. We have some other examples for doing things like transfer learning, and then some NLP tasks, some plotting, so like a lot of people do. There's some great geospatial plotting libraries so, we're working to build some of these examples that are very obvious, like you said, you're a new user, you just want to get started, dip your toes in. Sometimes that's hard, just what do I even do?

42:39 Michael Kennedy: Or what can I do? I don't even know what's possible that I could do if I had the data and the skills, right?

42:45 Dean Kleissas: Right, so having something that like you said is organized nicely and simple. You can click, run, get your feel of what's going on. We have a bunch of those going up as well, yeah.

42:53 Michael Kennedy: Yeah, awesome. We can put some links in the show notes for people to check out. So one thing I did want to talk to you about while we were on these things is, I recently had the folks from Docassemble on and you probably don't know what Docassemble is, but Docassemble is a project, it's a web app with a bunch of cool features for like a super advanced SurveyMonkey type thing. If I want to interview a bunch of people, collect the data, have a lot of flow control and only ask questions conditionally and so on. It's a really cool project that does that. It's mostly for lawyers, but it could be used for anything. And the way they do that is they have one or more Docker containers that work together and they deliver those. And some of the feedback I've gotten from multiple people was that's really awesome, it's super hard for me to work with Docker, this is not making it better, it might be worse for me than installing like Postgres and Redis and Python, like this might actually be harder than most things, trying to get multiple containers to talk to each other and stuff. So the reason I bring that up is it seems like what you all have done here is doing the same thing, but the way that I experienced it as a user was well, I have to have my system be Docker capable, so I had to install Docker the app, and then I double clicked your app, it gave me a cool progress bar, which took a minute or two and then it was working, right? That feels like a great way to use Docker, so I don't know, what are you thoughts on trying to deliver products to users through Docker or shipping with Docker just in general? What's your experience there?

44:34 Dean Kleissas: Yeah that's an awesome question because very early on it was like something we battled with for a bit and then made this decision this is the way we want to go, right? So, you mentioned conda as a way to package all your code and data people use that a lot as a way to manage their environment and we were like, well, should we reinvent the wheel or should we go with something more complicated? And it just really came down to not everything's in a package manager. And the only real way right now that we see is you need to put everything in a container. That's the only real way you'll be able to bundle everything up. And then you go and you look at the space and Docker provides the easiest user experience for that. And so that's the decision, right? We're going to go with Docker. And when we started this, the experience wasn't perfect, but we're kind of making this gamble that we're going to meet Docker when they're ready, we'll be ready, right?

45:27 Michael Kennedy: Right, right.

45:28 Dean Kleissas: And so there really...

45:28 Michael Kennedy: I'm going to skate to where the puck is and I'm pretty sure it's going to be in the place we want it to be, so let's try that.

45:33 Dean Kleissas: Absolutely, that's definitely what happened and so we're feeling pretty good. Docker Desktop team, they've been doing an awesome job building these tools, making them more robust. We're focusing super heavily on performance fixes, 'cause they were also disc I/O problems with Docker on your laptop for a while, so they've been working super hard on making those better and making the installation process better. So we think, if a user can get Docker installed on their laptop, they don't need to know how to use Docker, how to make everything work, we can do that, right? But it's just, can we get Docker on their laptop? So we've been testing with that recently. Because we didn't actually think that was an issue at first. It's one of those things, it's afterthought for you to install Docker, so why would people have problems doing this? And that's so flawed, right? The user might not even know what Docker is and you're like hey go install this crazy thing. So, what we've recently been working on is this new version of our desktop application, which is very lightweight. You double click this thing, it helps you install Docker, and cut the other thing too, we found a lot of users just have questions like why do I need Docker right now? How much disc space? What does this thing doing? So even just right there as they're walking through the Docker install process, let 'em know what's going on, help 'em do it, and then there's nothing to worry about. And so that's kind of this hand-holdy approach where we tell you what to do, and since we're on the computer already, we can wait until you've done it, until you actually install Docker, we know you haven't installed it yet. As soon as it's working, we're like great. Now we need to adjust your memory and your CPU to make it work right. And then we can do that for you, right? Instead of having to read our previous installation process where it's like, it's easy, it's three steps! And then you click on it and step one expands to like this balloon list and you're like, well that doesn't seem very easy, right?

47:25 Michael Kennedy: And that's not just three.

47:28 Dean Kleissas: Yeah, so really it's about that I do think that delivering this application in a Docker container is an interesting way to do it and the right way to do it right now. It lets us also like, we deliver our application in a container as well. But it's really just working on the ergonomics a bit around the installation process. And making it very streamlined and closed loop. That's what we've been trying to do with our desktop application. It's not like follow these instructions, it's like we're here, let's get from A to B and once we get there, then there's nothing to worry about.

48:00 Michael Kennedy: Yeah, I think it's a pretty polished process and it worked pretty seamlessly for me. So yeah, nice work on that one. One thing I do want to talk to you about on this of course is business models, lock in, things like this right? Because what you guys are building is really cool, however when I go to the site, it's in beta mode and it doesn't have a pricing. And it doesn't have an obvious way that I'm going to get charged. So I always wonder about companies, is there going to be a paid product? Is it going to have a premium angle? Just give us a sense of where you guys are going as a business. Because not just interesting as a user, but as just observers of the open source space, how are companies doing open source plus business I think is an interesting view.

48:49 Dean Kleissas: This was, again, one of these conscious decisions where we think we can do this this way because it'll help us keep this whole endeavor sustainable. And we're living in this world of MoviePass, a bit where it's just free money is just given away freely and we really wanted to build something that we hope we can make sustainable by actually trying to build a company around it instead of just an open source project. So, that was why we did it this way. In terms of how do we plan to monetize the core piece, this client, all that stuff, MIT licensed, free forever. So we're not going to ever try to charge for you working on your local system, it's yours whatever, right? Where the monetization is going to come in...

49:35 Michael Kennedy: You have a six core laptop, so we're going to charge you $10 per core, per month it's like, no what? This is not Oracle.

49:43 Dean Kleissas: Exactly, this is not Oracle. Yeah that's like the motto right?

49:46 Michael Kennedy: Okay so the desktop experience is basically going to be free, am I hearing that right?

49:52 Dean Kleissas: Correct. And also like I said, you can run that desktop experience, you can also run that on your own Amazon, GCP, DigitalOcean, Azure, whatever you need more compute horsepower. It runs on whatever you put it on, right? So that's free forever. Our hub where you sync to, there's always going to be this free tier where we're going to allow you to store something like 5 gigs for free, get a certain amount if you sign up, you get a certain amount of compute per month, that'll be like a refreshing quota every month that you can poke around for free. And then they'll be a little bit of a tiering on that. That's when you actually end up paying. So if you need to store 50 gigs, 100 gigs, whatever, you'll have to pay a little bit of money per month for that. And then they'll be an on-prem enterprise offering. This idea where people want to use this, but they've got restrictions because, and this is moving more out of academia towards like industry a bit, but you can't put your data in somebody else's cloud for whatever reason, whatever, you want to control everything.

50:52 Michael Kennedy: Sure, similar to GitHub enterprise, that kind of model and customer base.

50:58 Dean Kleissas: Correct, very similar to that. So that's the plan for monetization of the platform. Our goal to our core, right, is to try to make this an affective tool for people to use that isn't something that you get locked into. And I guess that's a good place to talk about lock in. Fundamental principle as we've built this, right, is this, under the hood, if you look at what's going on, it's just a files. It's just a repository, right? We don't have some complicated database that you have to export your stuff out of. We're built on top of Git, built on top of Docker, we've built our custom extensions on top it that you do not need. So if you want to take your stuff and pull it out of Gigantum, you click the export button. All we're doing is just zipping up the thing and giving it to you as a single archive, right? But you can go in your file system, take your files out, take your Docker file out, and do whatever you want with it. So there is no real lock in in that sense. We're keeping it all built on top of these standard tools.

52:04 Michael Kennedy: It sounds really good. And so I'm looking at my home directory in my MacBook here, and I see a Gigantum folder and then my username, and then more username and then probably my Gigantum username, and then lab books and other things that are basically the projects that you guys have. And it looks like you're just Git cloning that stuff into those subdirectories, even, right? And the file formats are things like Jupyter notebooks and text files and whatnot, right? So I could just take those and run with them if I wanted. You would lose the activity stream and collaboration, but you could just grab the files and go if you need to, right?

52:42 Dean Kleissas: Absolutely, yeah and so you'll see how we've done this activity the thing that's the rich layer on top, it's really built on top of the Git log, so when you do Git commit and you type a message that message is getting written into the Git log and so what we're doing is we're auto generating that message and then if you'll see there's like a new line and then there's some cryptic messages below that, some random text. And those are pointers into this Git compliant data store we've built. And so, you can lose that extra rich metadata of like the figures that were extracted and stuff like that, but at the end of the day, every Git message is still there, the history is still there. It's just, you loose a little bit of the automation, a little bit of the rich history if you pull it out, but it's trivial to pull out and that was intentional.

53:30 Michael Kennedy: Sure, that's cool. So if people want to try it out and they use it and they decide well, I don't necessarily want to be committed to this thing, they can take their code and run, it sounds like.

53:38 Dean Kleissas: Yep, absolutely.

53:39 Michael Kennedy: It's not like a Gigantum notebook that's not really a Jupyter notebook underneath or something.

53:45 Dean Kleissas: Right. One of these principles is we just wanted to keep it as close as possible to the original these core technologies and we want to get out of your way and let people just click a button, get into to Jupyter, get in Rstudio, so what they're used to, don't try to really change these flows too much.

54:01 Michael Kennedy: That's cool. As maybe a Git power user, maybe I really want my stuff in GitHub in some project there, even if that's not the primary source, but kind of a copy in Git. Would it be reasonable to drop into that local Git repository that Gigantum's creating and then create a new origin, like a remote branch and push to that one? To just get it up to...

54:21 Dean Kleissas: Yeah you could definitely do that. The only thing to keep in mind there is because we are doing Git LFS. So if you've got a lot of LFS data, it's going to use LFS to push that to GitHub and GitHub charges you for LFS. So that's kind of the one thing to think about if you want to try to do a mirror to GitHub is that they might start charging you money at some point for your usage there. I think it's one gigabyte free or something like that.

54:46 Michael Kennedy: Okay, but in general, I could mirror it to another Git repository and periodically sync that if I wanted, right?

54:53 Dean Kleissas: Absolutely.

54:53 Michael Kennedy: Yeah, I like that, that's a pretty not super locked in, lock-in story, I like it.

54:58 Dean Kleissas: That's the goal. We want to make it easy to pull stuff into Gigantum we also want to make it easy to push stuff out, right? This idea of building these walled gardens is where the data science world has gone a bit with these massive cloud-only platforms, there's only so much utility to that. Again, if you want to work with somebody who doesn't have access to that thing or doesn't want to use, right? So being able to, even if you lose some capability, like being able to move things around is just in everybody's best interest.

55:26 Michael Kennedy: Yeah and something that really bothers me a lot and I don't know it might be that I don't go to a giant office building that has a floor with a bunch of cubicles and I sit down at one and I work and that's how I work. But I work more sort of freely as I roam about throughout my life, right. I might decide I want to go to a coffee shop 'cause I'm tired of being at my house, totally alone, I'm going to go crazy right? But, the thing that I think drives me crazy about a lot of the cloud stuff, and I'm not thinking just the notebooks, but also AWS or Azure, if you're bound in deeply into all those services, working disconnected on an airplane, at a conference, on a business trip, at a coffee shop, it's super hard to just keep working, right? But it sounds like what you guys have here is like, I could go and go on a camping trip and do some data science if that's how I want to disconnect and come back and it would totally work, right?

56:20 Dean Kleissas: Yeah, you come back, connect to the internet, click sync and there you go.

56:23 Michael Kennedy: I really like that and I think that's quite a compelling option, actually so, super super cool. Let me just ask you something really quick, we don't have a ton of time to spend on it, but you get to see a lot of folks, and maybe that's also another final question is who is this for? If I am a professional data scientist at an insurance company, is this for me? If I am a professor, is it for me? If I'm teaching a class, is it for my students, right? Who is this for?

56:50 Dean Kleissas: Today, I would say, because of what we have, it's maybe not for, you at an insurance company yet, because they do need these very enterprise-y features we haven't built yet, right? So that's where we want to head because we think that's a huge space where data science is taking over industry like it is everywhere, right? But, right now if you're a student, if you're someone who wants to get into data science, if you do data science for research, academic research, publishing, this is the core people we've been working with to start, we have one of the co-founders, Randall Burns, he's the chair of computer science at Johns Hopkins and he ran a course inside Gigantum, we learned a lot from that. We've made some changes. Like running courses, certainly is reasonable because you can deliver to your students both notebooks, but also the environment, right? And they don't have to install stuff.

57:43 Michael Kennedy: It seems like it would be really good for teachers actually. Assuming there was enough to teach them in Docker

57:49 Dean Kleissas: Yeah, that's definitely a target audience. But right now users have been very academic focused but we've started to get lots more in smaller industry commercial interest because it's just a different way of doing things and it's like a small teams working distributed, just set up to support that. So, we're definitely thinking about moving there and building things more for an enterprise-y use case, but for today it's really if you're doing something in Jupyter, you're doing something in Rstudio, we think we can make that easier, basically, so.

58:25 Michael Kennedy: That's cool. All right, well it seems like you've built a pretty cool environment and I like it. Let me just ask you really quick, what are some of the trends you see in data science and scientific computing because you're interacting with all these different folks from these different environments what should people maybe be paying attention to in that space?

58:43 Dean Kleissas: Yeah, so we see a lot of interesting stuff and it's definitely changed over the course even of us starting this project. One thing that's very obvious, at least from our users and all the people we talk to is really this move towards the open source languages, right? It's Python, it's R, it's much less MATLAB, right? It's because the communities are there, the tooling is there, the libraries are there, I believe. It's been interesting to see. So that's also a conscious decision for us is that's what we're supporting right now, we don't support MatLab out of the box, we don't support SAS and some of these other things because not running through a web app is hard, but also dealing with licensing is hard, and we're just seeing people more and more like yeah this is hard and I can get this great environment in Python, this great ecosystem of tools, and this great community of support. And so I think as you see increasingly more and more stuff getting published that's built on Python tooling and R tooling. And that's been really really apparent.

59:51 Michael Kennedy: Yeah, that's great.

59:52 Dean Kleissas: Another thing that we're starting to see a lot more of and it's something we were obviously predicting a bit, and not necessarily predicting, but banking on, right, is this idea of hybrid architecture in terms of, yeah you use the cloud a bit, but you also have local resources that you want to run your compute on. And we're seeing people want this more and talk about it now as if it's this great idea and if anybody really starts doing legit machine learning like deep learning where you're training a lot, you very quickly realize how expensive GPUs are to use in the cloud. If you're paying. If someone's paying the bill for you, you don't notice, or you've got a bunch of credits, you don't notice, but eventually you realize that wow, this is really expensive, and so we're starting to see a lot of people that got started doing their machine learning in the cloud 'cause it was so easy and then they realized it's really expensive and so their buying GPUs. And this is also kind of crazy. Landa Labs is an interesting company in the US that's building GPU boxes, like workstations. But there's not much around this turnkey GPU local thing, but people are, I think, want that because it's expensive and it changes the psychology a bit of doing your work, I think. When you're not worried about paying for every cycle, it changes how you work, right? So I'm definitely thinking that's going to be a larger trend I think the cloud is this great thing, that gives us all these awesome capabilities but I do think that, at least for the near future, it's going to be, we have so much compute power locally, you're going to want to run distributed. You're going to have jobs you want to run on your laptop, you're going to have jobs you're going to want to run on a server, you'll have jobs you're going to run on the cloud, and we just want to make that easier. We see that being a trend.

1:01:40 Michael Kennedy: Yeah, my laptop has core i9, with six cores, each hyper thread is 12. And 32 gigs of RAM. That probably solves most people's computational problems unless you're doing like truly big data or massive machine learning, yeah.

1:01:54 Dean Kleissas: Right, especially with some of these great libraries that are coming out and Dask, Rapids, all these things, it's like, you have all this power, but you're using your 12 cores to run Chrome, right? If you want to use a cloud platform, you may need most of those cores to run Chrome, but that's the value proposition people are being sold and I think right size for the right job is going to be the way that people are going to go because they're going to realize as more and more stuff gets more and more computationally intensive in all these different fields in all these different industries, not everybody has incredible amount of money to just, you know.

1:02:34 Michael Kennedy: Yeah, it totally makes sense. Now, are you guys thinking you're going to have this run-it-in-the-cloud-for-free thing? Are you thinking of having a premium offer there? Like, I really do need to run this on a GPU for an hour, can I just push a button into your platform and make that happen and pay you $10 or what are you thinking there?

1:02:50 Dean Kleissas: Yes so right now at launch, not yet. That's something we're definitely playing with and trying to learn how people want that to work. I think that's kind of why we built some of this was also some cloud platforms, this pricing model of I pay the same thing every month forever is not really how people work either. They need bursts of compute because they're late for a paper deadline or they're running something now, but then they're not doing data science for a month, because their job is more than just that, right? And so how can we provide that in a good way is something we're playing with for sure. And then making that easy to be able to like scale up without having to, I think if you know how to do your own Amazon stuff then you could do it yourself. But sometimes yeah, can I click a button and solve this problem and go away is what people want, so.

1:03:39 Michael Kennedy: Yeah and I guess if they're right there in your platform, they can click a button you'd want to make that possible. But no, it sounds like a great platform, I like the sort of trend backwards I can mostly work local, but not always. All right so I think we'll leave it there for the data science, Gigantum conversation. It's been really interesting, but before you get out of here, let me ask you the final two questions. If you're going to write some Python code, what editor do you use?

1:04:00 Dean Kleissas: I use PyCharm. We like PyCharm a lot. I especially started using PyCharm a long time ago, so I've just always used that really. I know there's a lot of VS Code love right now, but I do love PyCharm and their Docker integration's been really convenient for us, too. 'cause you know, we're building in Docker, and so what's real nice is we've got all of our build toolings, so our application that runs in Docker, if I didn't have PyCharm connect to that so when you run a test or want to debug code, you're running it in the actual product that you built container and everything, so that's real convenient, so we enjoy that a lot.

1:04:37 Michael Kennedy: Yeah that's a good one. Then notable PyPI package.

1:04:40 Dean Kleissas: Recently I've been using a new, we use GraphQL for our APIs. It's a way to write an API that is really interesting and so Graphene is the big library we've been using a lot, I started using a new one recently Ariadne. Which is a schema first GraphQL library which I like a lot. So you write your schema and it makes it real easy to just wire everything up and it's all asynchronous, which I've been playing a lot with.

1:05:04 Michael Kennedy: Oh that sounds cool.

1:05:05 Dean Kleissas: I love all the new Python 3 been really getting into all the asynchronous Python and typing and MyPy, all that stuff has been awesome, so I like this library a lot, I've been playing with it a lot lately, so it's been good.

1:05:16 Michael Kennedy: That's a good recommendation and I hadn't heard of it, awesome. All right well, Dean, final call to action, people that want to get check out Gigantum and this kind of stuff, what do you say to them?

1:05:25 Dean Kleissas: Just gigantum.com check that out, that's where you go to learn about what it is, to explore people's projects, just click a button and try it out, it's the place to go.

1:05:35 Michael Kennedy: Right on, alright, well thanks for sharing what you guys are up to, it's been good to talk to you.

1:05:40 Dean Kleissas: Yeah thanks so much, it's been great.

1:05:41 Michael Kennedy: You bet, bye. This has been another episode of Talk Python To Me, our guest on this episode was Dean Kleissas and it's been brought to you by Linode and Tidelift. Linode is your go-to hosting for whatever you're building with Python, get four months free at talkpython.fm/linode. That's L-I-N-O-D-E. If you run an open source project Tidelift wants to help you get paid for keeping it going strong. Just visit talkpython.fm/tidelift, search for your package and get started today. Want to level up your Python? If you're just getting started, try my Python Jumpstart by Building 10 Apps course. Or, if you're looking for something more advanced, check out our new Async course that digs into all the different types of Async programming you can do in Python. And of course if you're interested in more than one of these, but sure to check out our Everything Bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python, we should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy, thanks so much for listening, I really appreciate it. Now get out there and write some Python code.

Back to show page