Brought to you by Linode - Build your next big idea @ linode.com


« Return to show page

Transcript for Episode #2:
Python and MongoDB

Recorded on Sunday, Apr 5, 2015.

[music]

:36 Hello and welcome to Talk Python To Me, a weekly podcast on Python, the language, the library the ecosystem and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I am at @mkennedy and keep up with the show, and listen to past episodes at talkpythontome.com.

:54 This episode we will be talking to Jesse Davis from MongoDB about PyMongo and of course MongoDB. Before we get to the interview, I have a quick message to share. Since we launched a week ago, the response has been overwhelming. I received many tweets and emails with positive feedback. I want to thank everyone who contacted the show. However, I could use your help to make sure the show continues to grow and thrive. If you know someone who would be interested in listening to the show please send them a link to talkpythontome.com or share this on Twitter or Facebook. Do you know of someone who would make a great guest or have a great show topic in mind? Send me a note and I will set it up. In other excellent news, we have a show sponsor- I want to thank Python Gear from pythongear.com for sponsoring this episode and you will hear more about them later. If you would like to sponsor a future episode please contact us at talkpythontome.com/sponsor.

1:47 Now, onto the show. Let me introduce Jesse. Jesse Davis is a staff engineer at MongoDB in New York City. He works on the MongoDB driver team, and develops PyMongo and the Mongo C Driver. He is the author of the async MongoDB Driver called Motor and he contributes to Tornado and asyncio.

2:09 Jesse welcome to the show.

2:12 Thanks Michael.

2:13 It's really great to have you here on the show and, you know we have known each other sort of as acquaintances for a couple of years as you know- I am MongoDB master which is kind of like an MVP community expert program that you guys run. So yearly we will come up there and we'll have some really interesting conversations, and we have always enjoyed the sessions where you come down and talk to for the external experts about working with MongoDB from Python.

2:40 Yeah, we've been doing that about once a year and we have got the next one coming up in a month, and I really look forward to those; some of our best ideas it definitely creates a years' worth of ideas if not more to kind of mull over and implements after each one of those sessions.

2:58 Yeah those are really fantastic meetings. I really, really enjoy them. So I have seen that you have done a tone of stuff with Python, but before we get into the details of MongoDB and PyMongo and all that, how did you get started?

3:10 So it's a funny story really as they say. I began when I graduated from Oberlin College 15 years ago, I was a C++ guy to the extent that I knew any programming language really particularly well at the age of 22. I thought that I was C++ and graphics guy. So, I went to Austin Texas and I worked for a little company called "Austin Digital" which does really neat stuff, still with C++ and 3D graphics, and statistical analyses of flight data recorder stats for the use of airlines to analyze their flight safety and their flight patterns.

4:00 Well, that sounds like a really interesting thing to jump into.

4:03 It was great gig and I did really, really poorly at it. So, I spent about two years there and realized that I wasn't yet a grown up, I was really screwing up my life and I was a bad software engineer so I quit the job and went out there into the world to try to get my head straight and went biking through France and then I spent a year at the Zen monastery in southern California, and when I checked back with Austin Digital whether they wanted me to come work for them again they said NO. Because, I had not proven myself there. So I came to New York to continue my Zen study with a place called a Village Zendo and to start being a self guru professional and there were no C++ jobs in New York at the time.

5:04 What kind of jobs were there?

5:05 Well, this was fall 2004, so the whole market was kind of in bad shape. There was the crash in July at 2001 and then there was September 11th. And New York hadn't recovered from that. Yet. So, all the C++ people from all of the banks were still unemployed and I couldn't compete with them. So what I did find was educational startup called wireless generation in Brooklyn in recent years it has become amplify education, and today we are willing to take a shot at me even though the job was in Python and Oracle and I didn't know either of those.

5:53 That is a long way, from graphics and C++.

5:57 Yeah, it was a huge leap and it was pretty tough but I had good mentors and I started using Python there.

6:07 That's excellent. So then you carried on in Python and I saw that you have a tone of OpenSource projects on GitHub that are successful or contribute to them. And somehow you found your way over to MongoDB from there?

6:22 Yes, so after few years working for Wireless generation I wanted to get a little more breath and also wanted to make sure that I didn't become so senior that I couldn't continue to program which was the pressure that I experienced there, to move into management-

6:41 That is kind of the curse of success for some programmers... You are really good at this, stop doing it now go manage people, right?

6:48 Yeah exactly and so to jump ahead a little bit, now at MongoDB we figured out how to do that by creating this whole separate track of staff engineers which is to track them on now but at the time at Wireless generation I couldn't see how to do that so I quit. And I freelanced around. And at gig after gig what people were doing was making these JSON over HTTP restful APIs and for obvious reasons they kept choosing MongoDB to do that. So at one gig after another as exposed to MongoDB as to natural data storage layer for application were like that. Even though MongoDB was brand new at the time I started using the version like .8 or something.

7:38 Yes, this was like 2009, 2010 time frame or something like that

7:43 Yeah exactly. And it was such a cool product and it was such a rarity as a big infrastructure systems' project in a New York city startup that when I finally got tired of freelancing and I wanted to settle down and make a substantial contribution to a single product, I called Eliot and said I'm ready to come in from the cold and he said great.

8:13 That's excellent. That is Eliot Horowitz who is the CTO of MongoDB right?

8:17 Yeah exactly.

8:18 Yeah, so, I suspect most people who are listening to this show have heard of MongoDB although not everybody and they might maybe just know it as a buzzword. Can you give us a quick elevator pitch about what MongoDB is?

8:32 Sure. So it stores your data- it's a database. And it sores your data not in rows and columns but in a non relational document format. And the format is called BSON. Which is the Binary JSON format. So if you know JSON MongoDb's data format is very familiar; it consists of objects which have the set of key value pairs and these documents can also contain arrays, strings, numbers, dates, and about a dozen primitive data types. MongoDB lets you index, and query this kind of object oriented data in a very rich way. So, among the documents databases that we compete with were particularly- we had a particular advantage when it comes to our ability to declare multiple indexes on a collection, the sophistication of our query language and our statistical aggregation capabilities, and our ability to let you do very complex update operations where you can add member to set within the documents or do math on numbers within those documents.

9:56 One of the things that I find people come in for more relational database. A lot of times they are like Well, it's really cool you can have this kind of hierarchical structures that more closely match the way your objects look in memory in your program, but you probably can't query properly deep down with those stuff so if I've got let's take a super simple example like a book store, and the book store has books and the books have reviews nested as nested array- well, what if I just want to know all the books that have five star reviews? Could I query that?

10:31 Right, exactly. And, we do provide that and that distinguishes us somewhat from the much simpler of key value store or other simplified documents database product.

10:46 Yeah, definitely. I think of all the NoSQL databases MongoDB is one of the few that would reasonably be something you could consider as your standard general purpose database; not just some kind of high scale special use case.

11:02 Yeah, that's exactly right. And MongoDB is not the best answer to every single question, obviously. There is data that is naturally relational and then there is data that should naturally be put in some other simpler more specialized NoSQL database. The MongoDB is very much targeted to be the best answer to many questions, and a pretty good answer to an even broader set of questions. So you can use it as your default database in the way that you might have in the past and used to MySQL or Postgres being a pretty good answer to many questions and the best answer to many other.

11:42 This episode is sponsored by Python Gear. We know you are a huge fan of Python and Python Gear has an excellent way to put your enthusiasm for Python on display. Visit pythongear.com and pick up Python or the Django T-shirt stickers and more. Hand screen printed on American apparel these are shirts that are made to last and are very comfortable. What is more a portion of all sales will benefit the Python software foundation or the Django software foundation. Help Python Gear thank you for sponsoring this podcast by visiting their site at pythongear.com and ordering a T-shirt. They are also helping us with the small contest- we are giving away a free T-shirt to one lucky listener. Visit talkpython.com click on friends of the show, enter your email address and we will pick a winner before the next episode. Now, back to the show.

12:34 So, I've got MongoDB, and by the way in case people do not know it is OpenSource you can go to GitHub and check it out, or see the progress. I've got it, probably I have downloaded it from MongoDB.org, and it is running. Now I've got my Python app what do I do?

12:51 Right. So Mongo DB is that work protocol has called The MongoDB wire protocol. And it is basic TCP protocol so you need something that knows how to talk that protocol, and knows how to convert between your Python data structure, your dicts and lists and strings and numbers to BSON and back. So you need a driver. And the standard driver for MongoDB is called PyMongo and you install it from PyPi InstalPyMongo. The current version is about to be 3.0 which will release in just about a week which is very exciting-

13:36 Yeah, that's a big news like that you guys have been trying to have a sort of major unification of all the different drivers for all the different languages, and is this part of that effort?

13:45 Yeah, this is exactly right. So PyMongo 3.0 has big behavioral and API improvement since standardization and that does changes our matched by the MongoDB Ruby driver 2.0, C driver 1.2, Node driver 2.0 and so on, and much more than ever before we are all concurrent on the same set of behaviors and the same set of APIs.

14:13 That's really cool. One of the real benefits of Mongo I think it is that it has great support for so many languages, so if you choose your data base you will be like "Oh wait, maybe this is better from Java for some reason?" It still has a good data access, so that is fantastic. That is getting even better.

14:27 Yes, right, that is exactly right. So we have drivers in ten programming languages. And, plus even if you are using something weird like R or Haskell or Perl, there is something out there in the community for you. We are really focused on making sure that these drivers feels right to experts in that language so PyMongo is very Pythonic. And it is written by Python experts, and its style and its documentation and so on are all very Pythony, while at the same time balancing that with some degree with consistency with the nine other programming languages to be supported.

15:10 Yeah. That has got to be some interesting tension there.

15:14 Huge, huge tension, it is been toughest problem that we faced. And we are just in the last year to really figuring out how to tackle them and to make those decisions correctly.

15:25 Yeah, cool. So you play a pretty big part n PyMongo right?

15:28 Yes, I have been Bernie Hackett, my boss in Palo Alto, is the PyMongo developer and maintainer and I have been assisting him through the last three years as the second in command. And my main contributions to the driver are its concurrency design, its implementation of distributed systems type problem solving, and the connection pull. And with the 3.0 release that is actually kind of done for the moment, and so I'm putting a lot of that work to rest now and moving on to become the primary maintainer of the C driver for MongoDB so that part of the team can move into the 16:22 and make contributions there.

16:23 Oh that's excellent. And that is even a little bit back to your roots from Austin? With the C++ story?

16:30 Yeah exactly, Python is my brain that has been idle for a decade, or coming back online and it is really fun feeling. If audience remembers- if you have been programming Python for ten years straight like I have, I really cannot recommend enough very different programming language or reviving one, it is incredibly satisfying.

16:57 Yeah, and gives you interesting problem solving skills that you do not necessarily develop if you stay in just one language, so that's great. Now there is a bunch of ways to talk to MongoDB, even just from Python, right? So there is PyMongo, what else is there?

17:11 So PyMongo is the general purpose driver and it is the most feature the most standard the best maintained but it i snot optimized for some special use cases. And you can think of these as CPU bound versus I/O bound use cases.

17:34 Right, ok.

17:35 So, for the I/O bound, cases where you've got a web application that has a huge number of client connections but there are often kind of idle or sleepy connections like if you are implementing a chat server or something with web sockets, you want to use an async framework in Python, like Tornado or Twisted, or in the Python 3.4 standard library with asyncio now-

18:06 Yeah, that's a cool new feature...

18:06 Right, so these are awesome async frameworks and they solved that problem brilliantly but they've got a gigantic compatibility issue- none of the existing libraries work with them. So-

18:21 None of the existing sort of outside MongoDB libraries do not work with them, or like PyMongo itself doesn't work with them, how do you mean?

18:27 Well, I mean both of those. So, if you've got this driver for any database that's not written specifically for one of these async frameworks, then it won't work with that async framework. So you need a specialized database driver for Tornado and MySQL. And you need specialized database driver for Tornado and Postgres. And you need a specialized driver for Tornado and MongoDB. So I wrote that over the last few years and it's called Motor, because, it is taking the beginning of Mongo and Tornado.

19:03 Excellent.

19:03 Plus it is a cool name and somehow it is not yet taken on PyPi. So Motor is the now standard official async driver for MongoDB and Tornado, and over the next year I am going to be expanding it out to cover asyncio next and then eventually Twisted as well. So that will just integrate with whatever you are using right now.

19:27 Nice. And does that work with Python 3 and 2 or is that a sort of a two thing for now, what's the story there?

19:32 That will work with Python 2.6+ so 2.6, 2.7, 3.3, 3.4 and 3.5.

19:42 So how does that work with different implementations like PyPy for example?

19:45 Sure, in the past Motor and PyPy didn't work together very well, but that was about a year ago that I last personally tested them. It was correct but it was slow due to some very specific details about PyPy; in recent months, somebody that I didn't know posted benchmarks that show that Tornado, Motor and PyPy were actually blazingly fast, but I haven't personally reproduced that so at the moment it's just kind of a hopeful fine rather than something that I would've especially endorse.

20:20 Sure. that's really good news though. It looks like PyPy is moving on and has a lot of activity there so that's really cool.

20:27 I agree.

20:28 Yeah. I've also heard of something called Monary what's that?

20:30 Right. So we've got this other branch of specialization so the 3 categories that I thing are our general purpose piled down that is the Motor and then there is CPU band and that is what Monary is for. Monary is a numpy driver for MongoDB.

20:50 Oh that's interesting.

20:53 Isn't it? I found out about it a few years ago it was written by a quantitative analyst named David Beach he needed it for something some specific financial application that he was doing. And he noticed that if you stream BSON data through PyMongo and then into numpy from there, it is pretty slow. And your data conversion is typically your Bottleneck. MongoDB is fast, numpy is fast. But converting each number from one data format to the next is very expensive and it is a lot of wasted work. So he wrote a little bit of C code which queries MongoDB using the C driver rather than PyMongo. Converts the BSON data directly into numpy arrays without passing through any Python data structure. And then hands you giant buffers of numbers that it got from MongoDB. And then you can use numpy's incredibly fast statistical methods on that data.

22:00 That's really fantastic. So if maybe you are storing a bunch of data in Mongo for big data on numerical type stuff, this should be the thing for you from Python.

22:10 Exactly. And that is pretty common use case among financial institutions; there is also a lot of universities doing big bioinformatics with MongoDB and there is generally a lot of use within the scientific community for storing numeric data and MongoDB sense numpy has such a rich set of statistical routine so you can just take off the shelf being able to go between the two in Python incredibly fast id an awesome feature. So Monary can do up to the million queries per second on commodity hardware. And or query upwords to million documents per second.

22:56 That's amazing. We've been adding features to it over the years we've had a couple of interns last summer Matt Cotter and Kyle Suarez, and now a new hired recruit is working for me Anna Herlihy is adding more and more features to Monary every month. So it is becoming better documented, it is now read right so you can insert some numpy arrays into MongoDB, and it is similarly optimized along that pass and we are adding SSL and authentication which financial institutions will probably want to start analyzing financial data, and that is kind of that fills in the other portion of the environment where you are doing single threaded CPU bound calculations on numeric data within MongoDB.

23:49 Yeah, that really does open up the whole science story for MongoDB a little bit more from Python anyway, that's really cool. So what surprises you most about what you see people do with Mongo from Python or even with PyMongo specifically?

24:02 Well what surprises me most is there are certain mistakes that are incredibly common and I wish we could figure how to stamp them out. And the main mistake that I see people make is that they create a new Mongo client class instance for every HTTP request. And so they pay the price of TCP setup very often as a cell and authentication setup and then the TCP start algorithm all of this incredible over head involved in opening the socket and then they do one query and they shut it all down. Oh and they defeat connection pooling as well. And there is this strange resistance to just creating a single global variable which is the Mongo clients and they don't understand I think they are coming from Java where there is one module level global variable but it means that among Python programmers who make this mistake which is a huge number of people, there is probably a third or a quarter of a fifth of what they should be seeing.

25:21 Wow, that is amazing. And it is easy to fix.

25:26 It's incredibly easy to fix.

25:27 That's the thing, right.

25:27 Right. Doing it correctly is easier than doing it wrong and yet doing it wrong is very common.

25:31 Yeah. Well, hopefully people out there listening will go make a global Mongo client.

25:36 I hope so and we are also adding a best practices document to the PyMongo documentation and the next really is that I hope will further persuade people from making this error.

25:48 Yeah, that would be great. It would be cool to have a list like these are the top 5 worst things you could do that we see out in the wild so don't do them. You know, something to like that would be nice put up somewhere.

25:58 That is a good idea although honestly, it is really only tough one. If you don't do that, you are probably doing a good job.

26:05 Excellent. So you guys just had a big release as well, you have kind of changed the entire underline file system, with something called WiredTiger and you released version 3.0? What is the quick rundown of that? That is pretty exciting.

26:19 Right. So it's not the default at the moment, it's opt in but with MongoDB 3.0 we need the option available to 26:30 old storage engine with WiredTiger. And the performance characteristics are still little complex so I am not going to make any inexpert pronouncements on that but for the large number of use case WiredTiger seems to be much much higher performance and it especially has much better concurrency so if you are doing a lot of 26:59 on certain kind of hardware you should fell a lot better to put through a WiredTiger. So it is really exciting and we also acquired that company so that we could- the WiredTiger OpenSource project continues. But it means that we have that expertise in house to make sure that WiredTiger and MongoDB work perfectly together.

27:20 They were from Berkeley DB or something like that originally, right?

27:24 I think a lot of the WiredTiger people were among the Berkeley DB developers. Yes.

27:29 Yeah, excellent. You have some opinions on editors, right, like them versus PyCharm. What are your thoughts there?

27:37 Yeah, so in recent years I've spent a lot of time mentoring young people or just working together with a lot of junior developers, and what I particularly noticed is that everybody uses 27:50 for everything. And it is kind of sad to watch because I say "Go find this method" and then I just have to sit around for a little bit, drinking tea and watching them grabbing the files and doing like an incremental search through each file until their searching feel like death stays create collection or something in order to find this method. And it is just sad because with PyCharm you hit I think it's command option O and you start typing the method and you jump right there. And when people watch me is PyCharm is not this isn't bragging it's just if you use the right tool for the job you can leap around really easily. And it's not just navigation, it's for huge search in replace jobs PyCharm has all of these modes and all of these wonderful ways of breaking down big code change task to make sure that you complete them correctly using a visual debugger is also completely invaluable. Like, when they see somebody say "I have this bug" and they say well what if you try it and they say well I added a 100 print statements I know that something is wrong, and invariably even though they spent hours on it I say well let's get it into PyCharm. And I use a couple of break points and a few watch points and we find it because that's what the debugger is for.

29:23 That's right, the picture is worth a thousand words. And you can kind of say he same about visual debugger, right.

29:28 Yeah, exactly, you don't have to ask the print statement is this what I expect? You just had everything displays for you and you can immediately see what it is about the code that is not matching your expectations.

29:45 Yeah, that's awesome. I totally agree with you and I am a big fan of PyCharm as well I use it for my all my Python work and you can just hit like shift two times and type something and it will just take you to a file or to a method or to a variable it's fantastic.

30:00 Yes, I agree. It also shows you your unused imports and it checks your style while you go so you stop making that error as well.

30:12 Awesome. All right, well, let me ask you one more question. So out there on PyPi there is tones and tones, you know, 65 000 or 56 000- something like that packages. There are some great ones out there, do you have any favorites? PyMongo and Motor maybe, ha?

30:29 Ha ha ha, yes, PyMongo is a great package, it's one of the most popular. And Motor is a fun little project written by some smart guy somewhere I wonder who that was. I am also big fan so another package I wrote that I am very fond of is called Toro, and it's a set of things that I like locks, ques, conditions, event variables. But it is not for multithreading, it's for Tornado coroutines. So it kind of extends the analogy between coroutines and threads and it makes it possible for your asynchronous coroutines which are optimized for IO bound applications isn't something like Tornado. It lets some coordinate using the same kind of patterns you are used to with threads so you can make consumer producer coroutines using exact same pattern as you used before producoring consumer threads. And working with Tornado's author Ben Darnell to contribute Toro piece by piece into Tornado so that Tornado will have those features built in with Tornado 4.2. It has nothing to do with MongoDB I think it is one of the things I like about is is that it's just separate project that I kind of use recreationally, and it also has just a lot of really fun patterns that are fun to think about.

32:02 Yeah, that's really cool. That kind of programming trying to coordinate asynchronous coroutine, sounds like a really interesting sort of problem, or puzzle to solve, you know.

32:12 Right. And a lot of those questions have already been answered for threads, so if you have the same primitives, or if you have analogous primitives to use as coroutines, you can use the same instance that people have developed for threads for the last few decades.

32:28 Fantastic. So yeah, people should check out Toro. Very cool. People who have been listening to us talking about MongoDB and in case you guys do not know I am a huge fan of MongoDB as well I use it for many projects. If people are excited and they want to get started what do they need to do?

32:42 So you can pip install PyMongo and then you can go to mongodb.org and download the mongoDB server, it's among other awesome things about MongoDB you install it by untarring it and then starting it, period. And if you want to learn more, you have PyMongo documentation online, at api.mongodb.org/python, has a complete tutorial for you, and if you want to go a little deeper there is an excellent online course called M101P so that is MongoDB 101 for Python and the next section of that class starts on May 26th.

33:26 Excellent. And is that eductaion.mongodb.org?

33:30 Yeah, that's correct. Exactly.

33:33 Excellent. Ok, well is there anything else you want to sort of give a shout out to or get people's attention focused on?

33:37 I am just really excited to be on this podcast and I think that this podcast is excellent and I am looking forward to listening to the next episodes while I'm on the subway.

33:50 Excellent. Thank you Jesse, it's been a really great conversation and I think people are going to be super interested in this stuff, it's good. MongoDB is such a fun project to work with, when I was working with relational databases I always felt like I had this sort of database and that it is a necessary evil and it's kind of resisting what I am trying to build of my app and you switch to Mongo and it is just sort of frictionless design and evolution app and so thanks for all your work on making that a possibility.

34:14 Thanks very much Michael.

34:16 You bet. Talk to you later.

34:20 This has been another episode of Talk Python To Me. I want to thank our sponsor Python Gear for making this show possible. Please visit pythongear.com, get an awesome T-shirt or sticker, and let them know you heard about them on Talk Python To Me. Smixx, take us out of here.

[music]

Back to show page