#109: MongoDB Applied Design Patterns Transcript
00:00 Michael Kennedy: Database design and decisions used to be fairly straightforward. Pick your relational database engine, map out the general entities, apply the third normal form to them and you're basically done. With the Cambrian explosion of database options and variations created from about 2009 to present, it's way harder to even choose the database, much less follow the well-worn path of third normal form for modeling. On this episode, you'll meet Rick Copeland, a fellow MongoDB Master and author of the book MongoDB Applied Design Patterns. We'll discuss modeling data using documents in a document database such as MongoDB and some techniques and situations that apply particularly to MongoDB's implementation. This is Talk Python To Me, Episode 109, recorded April 26, 2017. Welcome to Talk Python To Me, a weekly podcast on Python. The language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via @talkpython. This episode is brought to you by Advance Digital and Hired. Please check out what they're offering during their segments, it helps support the show. Rick, welcome to Talk Python.
01:37 Rick Copeland: Thanks for having me on Michael.
01:38 Michael Kennedy: I'm very excited to have you on. Anytime I get to talk about MongoDB, it really makes me happy, so we're gonna have a lot of fun doing that. I think the work that you've done in MongoDB in a couple areas in your ODM as well as your book that we're gonna talk about, it super works. So I'm looking forward to share with everyone and talking to you about it.
01:55 Rick Copeland: Awesome.
01:55 Michael Kennedy: Yeah. Of course, before we get into that, we gotta hear your story. How'd you get into programming and Python?
02:02 Rick Copeland: When I was a kid, my dad got an Apple II, and I guess that dates me a little bit, but I learned how to program on Basic, starting there and then ended up just getting into Computer Science in college. I was pretty hardcore in C, C++. Out of college, I did some systems programming and various other types of things with Visual C++ and ended up having a little period when I was exploring new programming new languages and I ran across an essay by Eric Raymond about why Python. It was kind of like his new favorite programming language. And I'd seen a little bit of Python before. Like I'd ran across GEM 2 and I saw its got this crazy indentation syntax and I kinda dismissed it but then when someone credible
02:44 Michael Kennedy: This is a weird language, it uses whitespace. Keep going.
02:47 Rick Copeland: Yeah. When someone credible said this is really cool, I figured that I would get into it. I kinda taught myself Python, started to introduce it at that time into a enterprise programming environment which was interesting. They're using mainly C# and Visual Basic. It was kind of one of those things where whenever there was a problem and they needed something solved really quick, I would say, oh, I'll use Python for that, and they said, don't tell me what you're doing. Just fix the problem. So kind of flying in and out of the radar and then discovered that I really loved it, so the next job, I sought out Python programming positions after that. I think Python was around version 2.3 then, so obviously its grown a lot of new bells and whistles since then, and it's just a lot more fun to program in even than it was then.
03:33 Michael Kennedy: Yeah. Both in terms of the language and what not but also the ecosystem and all the packages. It just keeps getting cooler. I think it's a fun place to be for sure.
03:44 Rick Copeland: Yeah. Even back in 2002 or '03 when I started doing Python, it was so far, like what you got when you just downloaded the standard library was so much than you got in any of the other languages that I was familiar with. If you wanted to do anything in C++, you had to go out and find something that you would configure, make and make install, and then you'd be able to get those development libraries. With Python, you can download URLs. You can create an FTP server. Or things like that. And it's just built in. That was a nice aspect too that battery's included at that time.
04:20 Michael Kennedy: Yeah, absolutely. So, what are you doing with Python these days? What's your day job?
04:24 Rick Copeland: These days, I'm a consultant. Which basically means that you kind of do a little bit of everything. Sometimes I'm out training different companies in how to do Python. I've been to DC and California in the last two weeks. I sometimes do custom development for folks. I've got I think three active projects going on for that. And I'm working on a startup on the side. Everybody's gotta have their little side hustle that they wanna eventually make it into something. I guess I got more of a three-side hustle that's going at this point.
04:56 Michael Kennedy: Yeah, that sounds fun. It's a challenge to do all these different things, I know, but it's also fun to have a wide variety and not just be doing the one thing, right?
05:05 Rick Copeland: Yes. Yeah, it is. It's never boring. Sometimes it can be a little bit overwhelming like what am I supposed to work on today, and the context which can get a little bit much but it's all good.
05:16 Michael Kennedy: Yeah, it's better than the alternative, I think. Although you're right, it's definitely overwhelming. You're still using MongoDB for some of these projects?
05:23 Rick Copeland: I am. One of the things that I discovered is that it's kind of my go-to at this point. I know most people, they learn relational databases 'cause it's much more widespread use but I just kinda got used to using MongoDB. In my last business, it was the main database of choice. I have all the tooling and I have the familiarity with it so that's just the first thing I reach for whenever I'm writing something for someone.
05:48 Michael Kennedy: Yeah, I'm with you. I feel like it just has so much more flexibility and what not. I feel like a lot of people fall into using relational databases because that's considered the safe choice or that's what they already know, what they were taught in college, or like the people they're working with, they already know that, but it's not necessarily 'cause it's the best choice.
06:08 Rick Copeland: Right, yes.
06:10 Michael Kennedy: Absolutely. I thought maybe we're mostly gonna focus on advanced MongoDB design patterns and implementation concepts, but not everybody listening to this is totally familiar with Mongo or NoSQL or document databases. Maybe you could give us a quick view just into the summary of what is NoSQL? What is a document database? I'm also interested to hear your thoughts on what is NoSQL 'cause everybody seems to have a slightly different definition of that.
06:42 Rick Copeland: Okay, sure. I would say NoSQL is kind of anything besides SQL. So, if you want to drop some of the constraints that SQL puts on you if you're programming. If you look at things like transactions, do you want to be atomic, consistent, isolated, and durable. If you drop some of those, then maybe you're on the NoSQL land. A lot of the NoSQL databases or I guess I should say they kind of run the gamut. A NoSQL database could be just a key value store, so you're able to look up things very quickly. It could be something that's more complicated, has an exotic model like Cassandra with the column data store or MongoDB which is a document data store. That's more like the storage model than the programming model. What are you actually putting into this database? I guess how I describe MongoDB or its document model is if you think of the relational databases, you've got tables which are made up rows and columns. In MongoDB, we don't call them tables. We call them collections. What you put in those collections are JSON objects. If you've done web programming, you've probably run into JSON. If you haven't, then it's Python dictionaries and lists embedded in each other. Basically, that's the data model you're looking at. You just have a collection of these documents and we call those JSON objects documents. You can kinda query into those collections. You can say, give me all the restaurants that are bakeries in this collection. And so you have some field in each JSON document that says the type of cuisine and it's a bakery. You could do that sort of a query on a MongoDB database which makes it a little bit different from a key value store because you don't have to just query based on the key of that document.
08:33 Michael Kennedy: Right, key value stores are like the most basic fastest, most scalable, but also the most limiting, right? Because you've got the key, the primary key like an ID or maybe you could use an email address or whatever if it's a user, but then it makes it really hard to ask interesting questions. Like, the rest of the data is fairly an opaque BLOB. I know there's way to add extra stuff around some of the databases, but still.
08:58 Rick Copeland: Yeah. And a lot of the times, those things, you have the key value store and then people will create their manual indexes around it. When you've got kind of the natural key of something, maybe it's some restaurant ID, and you wanna look up things by cuisine, so what you have is maybe a whole bunch of things in another collection but it's like cuisine is the key. Then those point to the restaurant ID. They build their own indexes out of these things. MongoDB takes care of that for you in a lot of cases.
09:27 Michael Kennedy: Right. One of the big distinctions I think that it takes people who are new to this idea of document databases, and MongoDB is not the only one, just the most popular, probably the best, but there's things like Azure DocumentDB. There's CouchDB. There's a variety of them, right? I think it's easy to look at these databases and go, oh, it's kind of like a JSON field embedded in another database or like storing a BLOB of data or something like that, right? You know like, I could do this in say Microsoft SQL Server and just make like a text field or a JSON field and stick a BLOB in there, but the big difference is you can index deep into these things, you can do rich queries into them. They're not just hierarchical things you can store but even look at some of the object databases, things like ZODB, things like that, right?
10:18 Rick Copeland: Yes. That is the big difference between this and a key value store I would say is that you can index and do these things. If you're using something like, you could take MySQL and throw a BLOB column and an ID column into a table and call it in NoSQL database if you want to, but you don't really get the ability to do these sorts of rapid lookups on something other than the key. That's what MongoDB gives you. That, plus it gives you some scaling advantages as well. My background, I haven't taken advantage of that as much, because I've just taken advantage of the fact that you can do reasonable things very quickly as opposed to doing unreasonable things reasonably quickly.
11:00 Michael Kennedy: Right. I think that's a really interesting comment you made about the performance. One of the things you can do with a lot of these NoSQL databases, Mongo included, is you can do lots of replication. You can do specifically I'm thinking of sharding, like I could set up a 10-sharded cluster and so we do insert sort of like crazy fast and we can parallelize our queries and our aggregation and mapReduce stuff, all those kinds of things, right? That is like the thing that draws people to it, they're like, look what we can do with performance, but at the same time, very few people actually end up needing that much performance. I mean, I've seen a few places where it was really needed, but 99% or more, the use cases don't. But everybody has a, my relational schema is a pain to deal with. It's hard to add columns. It's hard to change the shape of it. It's like a pain, it's slowing me down. Everybody has a complexity problem with their software and I feel like modeling in these documents solves that complexity problem for everyone, not just the one person.
12:02 Rick Copeland: Right. To me, the sharding and the ability to scale up horizontally has always been kind of a safety feature. If things go really super well, my system's not going to fall over. I'm not gonna have to come back and do manual sharding and partitioning my database and re-architect my whole application. I know that there is a path forward if something like that happens, but for now, I can get things done faster than I could with any of the relational approaches.
12:29 Michael Kennedy: Right, yeah. I totally agree, totally agree. There's a bunch of different ways to access MongoDB from Python, right? You've got the official Python driver, PyMongo from the MongoDB folks. You've for this new thing, Bison, NumPy Bison, I don't remember the order of it, that go straight into the data science type structures in NumPy. Then on top of PyMongo, we've got things like MongoEngine, Ming, MongoKit, and all these ODMs, right?
13:03 Rick Copeland: Right.
13:04 Michael Kennedy: One of these, named Ming, actually is one that you created.
13:08 Rick Copeland: Yes.
13:09 Michael Kennedy: Maybe give us a quick overview of what are the trade-offs, when would you consider using one of these ODMs? What the heck is an ODM anyway?
13:17 Rick Copeland: It really comes down to the idea that, I said we don't have tables. One of the other things that you don't really have when you're dealing with MongoDB is you don't have a database enforced schema. It's kinda like this big bag of things. I said they're kind of like Python dictionaries. I'll have to say this very carefully, but you don't want to end up with a big bag of dicts when you're working on this. What you need to actually have is some sort of a schema that tells you the sorts of things that you're gonna put on these collections. Because it turns out, whatever you're putting into them, when you read them back out, your code's gonna have to do something with that data. You can't just say, well, I'm just gonna store everything in their and so it's magically going to reappear. Your code is making certain assumptions about what fields are in those dictionaries, what keys, and like what is the structure of the data that you're storing? That's really what these ODMs or object document managers do. That tells you, in this collection, we're putting things in that look like this restaurant. Although MongoDB, until very recently, didn't have any form of enforcing schemas, this would be something in your code where you're documenting it. At the very least, you're documenting what sorts of dictionaries or Bison documents you wanna be putting into these collections.
14:36 Michael Kennedy: Right. If you go through one of these ODMs, object data mappers layers, you basically go through predefined classes and objects in Python which themselves have a fixed structure. And so you're kind of, you filter through a known layer of schema and that works pretty well, right?
14:55 Rick Copeland: Yeah, yeah. That allows you to kind of, you can get a long way without having any kind of a documented schema if you're the only programmer on the project. But once you start having multiple people, you need to have kind of a common understanding of, well, I'm going to write things that look like this to the collection and I'm going to read things and expect them to look like this other thing. That's kind of the base level of why you need something like this. You need a library or a data access layer. Sometimes people will write their own. That's a pretty common thing. If you don't use an ODM, then people will typically write Python modules that have getters and setters for different types of data that they wanna put in the database. That's the approach that MongoDB uses on their training materials, I think. They built this Python module that does these things. An object document mapper allows you to kind of abstract that out and write those more quickly. Rather than saying, I want to write a function that calls this call to get restaurant, then I can have a restaurant class that has a get method, but the get method is not something I have to write, it's something that the ODM provides me. I hope that makes sense.
16:01 Michael Kennedy: Yeah, I think it does. A lot of these, I can't speak to Ming, you have to fill us in on the details. I don't remember exactly, but the one that I'm using right now is called MongoEngine which is also one of the more popular ones. It has a lot of additional things that helps you with like, you define a class and it's much like SQLAlchemy, you say these are the fields that go into the database, and this one's a string and it has to be unique. This one's an integer and I want an index on it. Things like that. And so it'll actually apply the uniqueness constraints, it'll apply the index, it'll create and enforce the indexes, all those sorts of things as well.
16:37 Rick Copeland: That's part of what Ming does definitely. You can put these constraints in there, these indexes that you wanted to define and it'll go ahead and create those indexes for you. Another thing that it's helpful or that it helps you with is your schema evolves as you're building your applications. Maybe you didn't need a zip code when you first started or you forgot that you were going to need a zip code. Maybe that's a bad example, but there's some fields that you want to add later on and you need to sensible default for the existing documents.
17:07 Michael Kennedy: Like let's say you have an account class and you eventually want to start verifying that they've verified their email address, and you didn't think of that at first, you don't have a, is email verified, or something like that.
17:17 Rick Copeland: Yeah, so maybe you'll want all of the existing documents that have that be default false, they haven't verified their email. You can write a validator or a type into your schema that says when this field is not found, then I want you to populate the Python object with false. It helps you do these sorts of on the fly data migrations.
17:41 Michael Kennedy: Right. You described it in your book as a lazy migration or lazy schema migrations because in a relational database that wouldn't fly very well, right? You'd have to say, we're going to have, we're going to do a schema transformation. We're gonna add a column and it's gonna be of type bool and it's gonna be default. There's some kinda script you gotta run probably.
18:01 Rick Copeland: Yeah. I guess that's the deal. When you're going to be changing the schema in a SQL database, you do it upfront. So you've got to make sure that all of the rows conform to the schema. The database is gonna enforce that. You do this alter table statement and it makes sure that everything conforms to it. That's one approach, and you can do that in MongoDB as well. You just go in and you overwrite all of the existing documents.
18:27 Michael Kennedy: Right, it maybe is in a SQL script. It's just a JavaScript script or something like that.
18:31 Rick Copeland: Exactly.
18:33 Michael Kennedy: And you run that.
18:33 Rick Copeland: Yeah. Mongo gives you the option of kinda waiting until you actually load a particular document to make sure it conforms to your current schema. That's something that we built in to Ming as well so that you could actually read a document and then check to see, does that document actually conforms to my current schema? If it doesn't, do exactly the ability to fall back and run a migration function on that document. So you can actually bring things forward at the moment when they're loaded out of the database.
18:58 Michael Kennedy: Yeah, that's really cool. One of the early success stories for MongoDB I think comes from SourceForge actually, and you're a part of this. I remember SourceForge, this is before there's GitHub or anything like this. SourceForge was used quite frequently. I remember it was getting painfully slow. Then one day it was fast again. You were involved in that somewhat and that's actually partly where Ming came from, right? You wanna tell us about that?
19:27 Rick Copeland: Yeah. When I came to SourceForge, I remember in the interview when I was about to come to SourceForge, I had worked on building a SQLAlchemy-like library for I guess you could call it a NoSQL database. It was a private thing that the company that I was working for had developed internally. They asked me about that and they said, we got this thing called MongoDB that we're thinking about working on and it kinda stores Python dictionaries, so what would your approach be to doing something like that? And so I kind of talked about it and I guess my answer was good enough and they hired me. They said, we did some performance evaluations, and at that time, it was like 2009, they looked at various different approaches and they said MongoDB is gonna give us a performance that we need and we're comfortable with the data model. So we like the idea of storing things that look like Python dictionaries into the database, but we would like to have something like some kind of a schema enforcement layer or ODM. Although I don't know that that was really a big term at that point.
20:25 Michael Kennedy: They may be called an ORM even.
20:27 Rick Copeland: Yeah, call it an ORM for a non-relational database.
20:31 Michael Kennedy: Yeah, ORM minus the R.
20:32 Rick Copeland: Yes. We started working on that. I was the main developer on Ming. Ming formed kind of the data layer of complete rewrite of all of the SourceForge developer tools. When you think about SourceForge, there's kind of two sections of it, I mean, if you think about SourceForge these days, but there's sort of the, this is the site for the developers to build their software and this is the site for users to download software.
21:03 Michael Kennedy: This portion of Talk Python is brought to you by Advance Digital. How would you like to build one of the most visited new sites in the US? That sounds fun. The folks at Advance Digital would love to talk to you. They're primarily a Python shop located in beautiful Jersey City, just one subway stop from Lower Manhattan. Spend your time building an amazing web app with Python and do it with a small team of developers focused on agile development. Are you gonna miss PyCon this year 'cause your company wouldn't fund the travel and expense? If you join this team, they'll cover your conference and training initiatives. It's time to take your Python to the next level. Build an amazing web app. Get started by visiting python.advance.net right now.
21:42 Rick Copeland: We rewrote kind of all the developer tools and we rewrote a lot of the download side of things as well. That was actually a migration from PHP to Python and a migration from largely Postgres-backed to mostly MongoDB. We kinda did it in stages, but Ming was a big part of that, being able to kind of come in and say we gotta group of programmers working. What's our common understanding of the data we're storing in this weird database that none of us has seen before?
22:10 Michael Kennedy: Right. How did it go? I recall that there was some pretty major stats in like how much better the site got, how many fewer database servers there were, things like that, do you recall?
22:24 Rick Copeland: I remember we went from handling, it was something like 13 servers that were running a PHP front end, our first deployment, we went down to I believe four Python servers doing basically the same work. That was a nice thing for Python, and of course the PHP was backed by Postgres and the Python was backed by Mongo. One of the other things in the first version, and this is what a lot of people did with Mongo at that time and I guess still probably do is when you're introducing this new technology, you kinda take baby steps. Mongo was not our system of record initially. We would use it as kind of a cache for all of the Postgres data that was coming from the legacy system. So all of that went into Mongo and then as long as you obey a few little rules, make sure you're working set fits into RAM, Mongo behaves. It's performance was closer to memcached than it was to a relational database. Super fast for a read mostly workload and that's why we were able to do nice things. Then we, like a lot of people who first deployed MongoDB, we think, oh, this is great. It can probably do anything I want it to do. So we wrote a little rate limiter in MongoDB. We did it in a really stupid way, it turns out, but basically just logging every request and then every time a request comes in, we would query to see how many request from that IP in the last X seconds or minutes or whatever rate limit was. That worked until it didn't, which was when the index got bigger than our RAM and you got this nice cliff of performance. So, we reworked that. For the most part, it was a pretty good roll out. A lot of success moving from PHP to Python. There's still things that we have to run on or when last I read, there are still things that ran on Postgres at SourceForge, but it was primarily MongoDB later on.
24:20 Michael Kennedy: Yeah, okay, that makes a lot of sense. That's really cool. Is it still running Mongo, do you think? Do you know?
24:25 Rick Copeland: Well, it is. The first version that was rolled out was only for the download side. Then we ended up rewriting all of the developer tools in Python and MongoDB. That ended up being outsourced, not outsourced, opensourced as the Apache Allura Project so it's not official Apache Software Foundation project and anybody can run the same tools that SourceForce is running for developing software. There's a little bit of setup involved, but it's still out there. It's something that was kind of a goal early on that we wanted to make sure that we gave back to the community with what we were doing. Of course, Ming was always opensource from the beginning. SourceForge has had its moments of evil but generally has been a good supporter of opensource software.
25:12 Michael Kennedy: Yeah, I'd say historically its probably got a positive grade all in all.
25:16 Rick Copeland: Yeah.
25:18 Michael Kennedy: One of the things I really wanna dig into while we're talking is your book called MongoDB Applied Design Patterns. But before we get to that, I just wanna quickly run an idea by you. Maybe make a plead to anyone who's either running or considering running Mongo. I think, I'd love to hear your opinion, one of the things, I think Mongo is super great, but I think they made a few fairly minor decisions that have come back to haunt them in certain ways that get amplified from the early days. I think one of those is by default not running encrypted connections and another is by default not running with authentication.
25:55 Rick Copeland: Yes. Their defaults have always been interesting. Maybe I'll use that word.
26:00 Michael Kennedy: I think they've optimized too much for performance and scalability and not enough for durability and safety nets. I'm thinking of the initial write concern defaults. I'm thinking of the lack of journaling in the early days. All these things. Each one of them maybe made sense in their original world but I think people have taken these and not knowing they need to be aware of them got themselves in trouble.
26:22 Rick Copeland: Absolutely. When we started out the default way that you wrote to MongoDB, if you didn't change any of the settings and you do an update or an insert or whatever, basically you got an acknowledgement from the server that hey I received your request to write this data to the database. What you didn't have was any assurance that it actually made it on to disk. We didn't even get an acknowledgement.
26:45 Michael Kennedy: Yeah, not even into the dataset and memory, just, the server's received your socket request basically.
26:50 Rick Copeland: I think we even didn't get that initially.
26:52 Michael Kennedy: Yeah, I think you might be right. Yeah, you could be right about that.
26:54 Rick Copeland: Everybody learned first of all that you needed to have this magic argument when you connected called safe equals true. By default, MongoDB was running in unsafe mode, which is kind of a silly thing to do when you think about it.
27:08 Michael Kennedy: It's cool, it's fast.
27:09 Rick Copeland: Yeah, it's certainly fast and somebody made a nice video about web scaleness from that. DevNull is very fast, too.
27:16 Michael Kennedy: Yeah.
27:17 Rick Copeland: I could write an infinite amount of data to it, super great. We moved over to safe equals true, but even then you just got an acknowledgement that server received your request and maybe it didn't violate any unique key constraints. Great, that's some progress, but it might not make it to disk. And so they told you, well, you need to really run in replication so then you could get some, you could say, well, I wanna only consider my right to be complete once it's been also written to another server. Okay fine, that's pretty good. If you're actually getting verification of replication, then you're probably running in a slightly safer mode than mostly are writing to MySQL. I would say that's a good place to be. But then they've also got this network issue that by default, you get MongoDB. You fire it up and it's going to bind to all of the IP addresses on the machine with no authentication and no encryption and anybody can connect to it, read, write any of the data that's on the database. That is not really a good default state to be in and it turns out a lot of people didn't read their docs when they had moved to production, and there was a big exploit recently where there were thousands of production MongoDB databases that were compromised because they were running completely wide open to the internet. Yeah, be careful.
28:32 Michael Kennedy: Yeah, absolutely. Basically I bring this up for two reasons. One is there's a lot of FUD about Mongo involving things about like this write concern and the journaling and all those are changed. The defaults are to do the right thing these days. So those are basically phased out. But this last thing about the security is not. If I were king of Mongo, I'm not a king of MongoDB, but if I were, I would make it a change that unless you set up authentication, it will only listen on local host by default. That would be my rule. That let's kind of say if you're running the server and like actually for web app or for dev, it's fine. And if you wanna do something production-wise, you gotta configure it a little better but that's not how it works. If you guys are listening, you wanna run Mongo, we both definitely recommend it, just make sure you turn on security or you don't listen just unprotected on the internet. Just take a few steps and enable encryption if you're gonna go across networks and security authentication, things like that.
29:34 Rick Copeland: Yeah, and I think the latest versions of the RPMs and the WN packages do bind only to the local host, so at least they're a little bit more secure. But still, if you're just running the MongoDB binary by default, it's going to listen to anything. So, yeah, be careful.
29:51 Michael Kennedy: That also is not just the server, server production thing. That could be a dev issue. Your dev machine could be on the network and you could be running a dev version with live data and it could have the same problems. Just be careful about this.
30:05 Rick Copeland: Yes.
30:06 Michael Kennedy: Let's talk about your book, MongoDB Applied Design Patterns. That's the title, right?
30:11 Rick Copeland: Yes.
30:12 Michael Kennedy: Okay, it's not a paraphrasing. Okay, good. This is a book that looks at MongoDB from a Python developer's perspective. Really, I think it's a super book. The idea is to look at a bunch of different use cases and challenges and try to solve them, right?
30:28 Rick Copeland: Right. The genesis of the book is MongoDB needed to, they wanted to have something like a list of different use cases like how do you use MongoDB in this situation? And so I wrote up a bunch of use cases for them, and then they said, yeah, this could be a really good book, so let's see if we can introduce you to some people are O'Reilly and see if we can kind of flush these out into a full O'Reilly title. That's what we ended up doing.
30:53 Michael Kennedy: That book came out in 2013, right?
30:56 Rick Copeland: Sounds right, yeah.
30:59 Michael Kennedy: MongoDB 2 point something, 2.2, 2.4 sort of time zone. How much of it do you think is still current and how much do you think is sort of slightly changed with the release of I'd say MongoDB 3?
31:12 Rick Copeland: There are definitely changes to some of the performance concerns that have to do with the way that the storage engine works since version three.
31:19 Michael Kennedy: Because they switched to WiredTiger by default and not mem-mapped files, yeah?
31:23 Rick Copeland: Yes. The nice thing, nice maybe in quotes here, for programming with MongoDB in the olden days before WiredTiger is, it was really easy to understand the memory model because what they did is they just took your whole database and they mapped it into RAM and they used the Linux virtual memory system to decide what was in and what was out. If you know how to modify memory, then you knew the most efficient way to modify MongoDB. With WiredTiger, that's changed. They have a real storage engine. It has multi-version concurrency control. It has some interesting, interesting in a good way, performance characteristics of being able to have multiple writers going at the same time. I would say some of the things that really optimized for in-place modification in my book don't really apply as much because there was a huge difference in performance in the old storage engine between writing to something in place on the disk and doing something that say changed the size of a document and required MongoDB to write a whole new copy of the document somewhere else on the disk.
32:29 Michael Kennedy: Right, and the way it works now, it's totally different. All right. My look when I went through it, I felt like this is really still quite current. I think you're right about probably the considerations around the memory-mapped files and what not, but other than that, it looked really good. Let me read a really quick excerpt from the book just to kinda set the stage. You say traditionally, relational databases while familiar will present significant challenges and complications when trying to scale up to big data needs. Into this world steps MongoDB to address the scaling and that around all of this hype and excitement, a bunch of sites grabbed the NoSQL database and MongoDB database and threw it out there and just started working with it without really understanding that it takes a different thinking about it, right? - Right - It's paraphrasing but
33:18 Rick Copeland: Right
33:18 Michael Kennedy: It's paraphrasing but it's basically and some of these we just talked about around like the durability and security were one of the things, but I think more, probably the biggest mind shift that you have to make in this world and you start, you dedicate a significant part of your book to this right at the beginning, and I think you should is schema design and talking about design relative to say first normal form and third normal form and all that.
33:46 Rick Copeland: Yes. I would say that the biggest mindset shift that you gotta get through to be effective at MongoDB schema design is to say how you, what happens when you get rid of joins and what happens when you get rid of transactions? So it's kinda the reads and the writes. MongoDB does not support the join operator. Well, there's a way to do it in the aggregation framework, but putting that aside, generally when you do a query in MongoDB, you can get a set of documents in your result set but you're not going to be talking to two different collections when you do that. You're gonna be making a query against a single collection and you'll get documents from that single collection. The question is how do you actually use that in an efficient way? If I was building a blog in a relational database, then maybe if I need to render that blog post, I would maybe fetch something from the post collection, I would fetch something from the author's collection, or not collection but the post table, the author's table, comments table, and then do a join of all these things, and you'd end up with all of the data that you need to represent that blog post to a web user. With MongoDB, you can do the same thing. You could have a post collection and a comments collections and an author's collection and you can do kind of the joining work and memory but you've gotten rid of a lot of benefits of MongoDB because the nice thing about MongoDB is you can design your schema so that a single document can satisfy that web request. So you could have the post with the embedded author information with all of the comments, all in asingle document, so basically what you're doing is single fetch, single round trip to the database. Even on the database, if you're using a disk or you're using SSD, whatever the case is, you've got all the locality right there. So the whole document is right where Mongo is looking at that time and so, it's able to basically just do things much more efficiently if you design your schema right.
35:39 Michael Kennedy: Yeah, and I think it's very much a Shakespearean type of thing like to embed or not to embed. That is question, right? Really, every time I sit down to design a new data model for MongoDB, it's like, what are all the pieces? What embeds where, and what shouldn't be embedded for various reasons? Like for example you mentioned, you could have your post and it could have the author embedded and it could have the comments embedded and so on. Maybe even there's categories, right? Like categories and things. And you could theoretically embed the category data into the post, but then you have to replicate that across all the different posts, right?
36:15 Rick Copeland: Sure, yes.
36:16 Michael Kennedy: That may or may not be something you want.
36:18 Rick Copeland: Yeah, so you still have relationships in your data. That's a logical concern, right? You can do an entity relationship diagram and you can still map that on to MongoDB. The difference is with Mongo, when you have one of these one to many relationships, all of a sudden you now have the option if it makes sense performance wise that you could take both of the entities and put them into a single collection. You can't do that in a relational database. Relational, kind of first normal form says you don't have multiple entries on a column. But with MongoDB, that's sort of a norm. You're allowed to have these array types that are being stored there. So now you gotta decide, does it make sense to put it there or if you've got a many to many relationship, the old way of doing or the SQL way of doing it is you gotta have a join table that's got IDs from table one and IDs from table two, and it tells you which ones match up with which ones. MongoDB, if you're doing a blog, again, it's just an easy example. So you've got tags or categories, a lot of the times, that will just be a list of strings that you put into the post, and there's no need to actually have that join collection or that join table that you would have in SQL.
37:25 Michael Kennedy: I think that's totally right. Even if your tag thing was more complicated, you can do these many to many relationships and maybe store like a list of tag IDs in every post and then reach back into other table, yeah.
37:39 Rick Copeland: Yeah, you'd almost never want to have something like a join table in MongoDB. I can't think of a good case. You'll almost always wanna either have a list of IDs in collection A or a list of IDs in collection B or both but you wouldn't wanna have a separate collection where the existence of a document means that these other two documents are joined.
37:59 Michael Kennedy: Yeah, I find that to be almost never, I don't think I've ever seen that in a well designed case either. I definitely have never made use of it in the apps that I built.
38:08 Rick Copeland: That was one of the problems with people coming from the SQL world is they know how to model things there and they just assume that if I take the same schema that I had in SQL, it's going to give me the, it's gonna be like that but faster if I do it in MongoDB.
38:19 Michael Kennedy: Yeah, 'cause I heard MongoDB is faster, so it'll be faster if I just put it over here.
38:23 Rick Copeland: Exactly, yeah.
38:24 Michael Kennedy: It probably is faster but not because you copied over your schema design from relational database.
38:30 Rick Copeland: Yeah, or in many cases, it would end up being slower because you're doing all of the logic of the join at that point but you're doing it in whatever your programming language is. I love Python but it's not this super high performance bare metal language. If you're building a join engine in Python, yeah, you can do that but you are now talking about introducing network latency to talk to the database. You're talking about, it's written in Python, it's not written in C++ like the MongoDB engine is, or like C database engines might be in other cases. If it's faster then it's an unusual situation.
39:05 Michael Kennedy: Yeah, for sure.
39:05 Rick Copeland: You're usually gonna even kill yourself performance wise.
39:08 Michael Kennedy: This portion of Talk Python to me is brought to you by Hired. Hired is the platform for top Python developer jobs. Create your profile and instantly get access to thousand of companies who will compete to work with you. Take it from one of Hired users who recently got a job and said, "I had my first offer within four days and I ended up getting eight offers in total. I've worked with recruiters in the past, but they were pretty hit and miss. I tried LinkedIn, but I found Hired to be the best." "I really like knowing the salary up front and privacy was also a huge seller for me." Well, that sounds pretty awesome, doesn't it? But wait until you hear about the signing bonus. Everyone who accepts a job from Hired gets a $300 signing bonus, and as Talk Python listeners, it gets even sweeter. Use the link talkpython.fm/hired and Hired will double the signing bonus to $600. Opportunity is knocking. Visit talkpython.fm/hired and answer the door. Yes, so one of the things while we're on this document design stuff is in MongoDB, there's no concept of a foreign key constraint or relationship, right? I can't have one document with a strict relationship to another document. I'm not really sure how much value you get. There's no joins and things like that. Often times, people think that means there's no relationships in MongoDB, right?
40:22 Rick Copeland: Yes.
40:22 Michael Kennedy: But I don't think that that's true. I think you can put them into these models. They just don't span documents, right?
40:28 Rick Copeland: Right, yeah, you can have, you know, the relationships can exist within a document and you get atomic updates and things like that. You get the database to enforce some consistency there and you can also model the relationships with, I mean it's not enforced by Mongo, but you can have a foreign key concept where you got an ID of a different document in another collection and you're storing that ID. Difference is that you always have to take into account the possibility that that document might not actually exist.
40:55 Michael Kennedy: That's right. Yeah, I think of them as two things. I have a slightly different name I've used over the years for it. For the stuff that's within your document, you've got a post and it has a list inside of it of comments. That is a super strong relationship. You can't have a comment without the post. It is the same thing. But if you're like reaching back to an author table through just a foreign key constraint that doesn't really exist but it's logically there, I call those soft foreign keys or something like that. They're not enforced, but they technically, they fill the same role, right?
41:26 Rick Copeland: Yeah, they fill the same role and sometimes people call them references or document references. Way back when I started with MongoDB, one of the patterns that they kind of promoted was storing the collection name along with the ID. I never found that super valuable, but that's another thing that you can do. If you wanna have a reference that could go to any collection, then you can just throw a collection name in there.
41:46 Michael Kennedy: Yup, yup, interesting. That works well at the low level like the PyMongo level. Less good at the ODM level.
41:52 Rick Copeland: Right.
41:53 Michael Kennedy: So, let's talk about some of the use cases. We've kind of set up this, you talked a lot about like this is what modeling in this world looks like. You also talked about mimicking transactional behavior with a compensation model that work well with MongoDB, but let's kind of leave that as it is there. You kind of set the ground with some of these foundational things, and then you said, let's talk about six different use cases. All the performance considerations and how you model and everything, right?
42:19 Rick Copeland: Yes.
42:19 Michael Kennedy: If you wanna touch on some of your favorite ones there. Maybe what was non-obvious or maybe something like that?
42:24 Rick Copeland: Yeah. The first one has some of the more interesting parts or some of the things that I found really interesting, and I guess that's why I put it first but that's the operational intelligence chapter and it's really focusing on analytics and dealing with high volume data that's coming in quickly. There were two different use cases in there or maybe there are three in there, but two in particular that I remember were one of them was incremental aggregations. This is, you've got something coming in, you've got these aggregates statistics that you want to report out immediately. One approach that you could do for aggregation is you can run a big mapReduce job on a Hadoop cluster and that'll come back in a few minutes. But if you actually want something that's up to the minute, then how do you do that in an efficient way? This relied a lot on the in-place updating and it was based on MongoDB's own, it's now called Cloud Manager, but their monitoring service which would actually monitor MongoDB performance for you, and they offer this as a free service. So it's like, how do we deal with this scale? Let me show you how you can build your schema to deal with that kind of scale and how you can keep the performance high even with an MMAP storage engine.
43:33 Michael Kennedy: No, I think it was, what I found interesting about this was you start from, like let's start with log file data, like something out of Apache web request or something like that. Let's put that in the database and then let's start doing processing and analysis of it. You have some really interesting graphs and various things that say like, let's look at if we design it this way, what are the trade-offs? What is the benefits? What are the drawbacks? There was a number of non-obvious ways in which things kind of slowed down or gotten out of control and you ended up with quite an interesting aggregation report database where you precomputed and preallocated a whole bunch of pieces and then use some of the in-place update operations to sort of like increment the numbers at the right levels as these things came in, right?
44:24 Rick Copeland: Yeah, that was the incremental aggregation. The problem there is it was storing the aggregates in these large documents and sometimes the documents would grow and that would cause performance problems and then you get into a secondary issue which is that even though you think of these things as Python dictionaries which you're super fast to access any item in them, physically they're stored as a list of key value pairs on the disk. And so, it turns out it takes longer to access something towards the end than it does to access something towards the beginning, so how can we mitigate that issue? These were sorts of things that you could only see when you've actually run some performance metrics against it. Again, just to shout out to Python, I did all these with at that time iPython Notebook and printed out the graphs and just threw those into the book right there. I think those are actually screenshots from iPython Jupyter Notebook.
45:18 Michael Kennedy: Yeah, they looked like some matplotlib graphs or something which is cool.
45:21 Rick Copeland: Yeah.
45:21 Michael Kennedy: All right. Another thing that people at least in the early days were like, you can't use MongoDB for this, was eCommerce, which I totally disagree with that statement but you have a section where you talked about using MongoDB for an eCommerce site, right?
45:34 Rick Copeland: Yeah. One of the big things or one of the difficulties with existing eCommerce, I guess the big one is Magento. Magento uses an entity attribute value store so they're still stuck on SQL but they use SQL in a way that makes it non-relational. Basically instead of keeping your products in a products table where each one of the attributes of that product is a column, they just say, I've got one big table that says for this entity, maybe it's a shirt, I have an attribute which is a size and it's in XL. For this entity which is a drill, it has some other attribute and it's 120 volts or whatever. And so out of that, they're able to get this very flexible schema. It's kinda like, well that's not really a fantastic way to map to the relational model but they kinda have to because you wanna deploy to a store that might have all sorts of different items in it that have different attributes that you wanna store. Nice thing about MongoDB is not all of your documents have to look like each other inside the collection so MongoDB lets you actually say, well, I wanna store drills and shirts in this collection. Can I do that? And it turns out you can. Maybe there's certain attributes that they all have in common. They have an SKU number. They have a price. They have maybe a quantity available. But then they've all got their other things that are custom to each one. So you can introduce this polymorphism with MongoDB in a much more natural way, I think, than using something like an entity attribute value schema in a relational database.
47:03 Michael Kennedy: Yeah, I think that's leveraging a pretty interesting aspect and in some sense, implementing inheritance for specialization, not exactly, but something to that effect, right? And that because the schema is really enforced at the application layer, not in the database layer, that flexibility pretty much just flew so you end up with these sparse objects like maybe one document has a drill bit size or something and the other one has a shirt size, right? And those don't appear in both records so you don't waste the space.
47:34 Rick Copeland: Yeah, exactly. You can build your ODM to kind of take care of that. I haven't been doing a lot with Ming super recently but I'm not sure if we have the ability to kind of discriminate based on the data that it loads out as to which physical type of object it's creating. But that's certainly something that you can do with an ODM. I know that something that the SQLAlchemy does with relational databases but it requires you to either do a super complex schema in SQL or it requires you to waste a lot of columns. Those are kind of your two options to do this sort of object-oriented polymorphism.
48:11 Michael Kennedy: Nice. What are some of the other ones that you cover that you really like?
48:15 Rick Copeland: I did have some fun with the online gaming chapter because that was just, I don't know, games are fun but kinda brainstorming out like what are some of the data structures that you might need when you're building this? How do you do these in say some massively multiplayer online game and how would you actually store this? How would you scale it? How would you do the sharding? The online advertising networks was also interesting just because it's a very high frequency sort of application and it's something that I had seen a little bit of at SourceForge. One of the things that you mentioned earlier on was SourceForge got slower and slower and slower, so part of that, we can blame on maybe PHP and Postgres, but part of that we just have to blame on the ad networks. Because SourceForge is an advertising supported site, a lot of these ad networks just took a long time to render the ad, and that's kind of slowing down your browsing experience and it can cause various other problems. So, what if we could speed those things up and deliver contextual advertising to people in a way that doesn't make them wanna pull their hair out? That was also an interesting one.
49:18 Michael Kennedy: Yeah, that's a fun one to work on. I know a couple of people working in this ad network space and they're using MongoDB and they have some pretty intense requirements around the traffic that they handle. 'Cause if you run ads on a site that gets a million views a day, and that's just one of the places, right? You all of a sudden are getting a million request a day.
49:40 Rick Copeland: You're getting a million request a day and you're trying to target those ads now based on some content that's going on in the article. So presumably, you've indexed that and you know something about the keywords but then you probably have some real-time bidding going on for those two. So, how do you actually choose the ad inside that request response cycle 'cause you know that your content people that are actually paying or that you're advertising on their site, they're not gonna like it if you slow down the experience for their viewers.
50:08 Michael Kennedy: No, absolutely not. That's definitely a cool example. There was a bunch of great examples and I learned a lot from looking at how you implemented them and the trade-offs and it's a great book. I definitely recommend if people are, if they know a little bit of MongoDB and they're like, I think I should be using this but I don't really know how to solve this problem. There's a lot of good stuff to study there around schema design and what not.
50:28 Rick Copeland: Thanks.
50:28 Michael Kennedy: Yeah, you bet. There's a couple of options on where you might run your MongoDB server and I guess it depends on how complicated of a situation you have on how much you wanna think about this or need to think about this. If you're just running a single server and it's just like their on a machine, then maybe you could run that on a VM schedule of backups and what not, but there's also hosted Mongo, they have MongoDB Atlas. What are your thoughts on like if somebody comes to you and says, hey, I wanna do the site run maybe let's say a three node replicated cluster. What would you consider?
51:08 Rick Copeland: By default, I would hope that their budget would afford them to get Atlas. Atlas is actually the cloud service by MongoDB. They'll host your Mongo for you. They'll host the latest copy or the latest version, handle your backups and everything. If you're dealing with a large amount of data, the backups can start to get pretty pricey, so that might not be an option, but unless you have strong operations people on your team, I wouldn't immediately jump to saying, oh I need to self-host, I need to build it, I need to run it on my own VMs. So there's other options that you can go to. You can go to mLab is one that I've used in the past. I've really enjoyed working with them. They provide hosted MongoDB, Compose.io, ObjectRocket, these are all hosted MongoDB options that you can go with. Then, if you are going to decide to self-host, there's actually some MongoDB provided tools to do that. If you actually go on to the MongoDB Cloud Manager, provide them your easy to account keys. For instance, and you say, I want to use these three servers or these three virtual machines that I've provisioned to make a three node replica set, then they can do that for you as well. That would probably be the next step is get your own VMs and then install Cloud Manager and go ahead and have Cloud Manager install that.
52:29 Michael Kennedy: Okay, cool. That Cloud Manager, that's from MongoDB themselves?
52:32 Rick Copeland: Yeah, that's also from MongoDB. All of these things kind of run in the same UI on MongoDB, I guess it's .com. I know they have .com and .org both.
52:41 Michael Kennedy: Yeah, there used to be a big confusion. You couldn't find the download link on .com. One thing I'd like to say is I have used MongoLab before mLab, it used to be MongoLab, they renamed it. I think they're one of the few options that has a free Mongo server. If you wanna just set up a little prototyping get started and play around, they have like a half a gig free server you can set up and use that there. That's pretty sweet.
53:10 Rick Copeland: They're great. I still use them today. I use Atlas a little bit, but I use mLab as well. One of the nice things about mLab is that there's an integration to Heroku as well. So if you're using Heroku, you can get the mLab plan for free and then it's just kind of like, I'm not running a server anywhere, somebody else is doing it for me, and I can play around with things and have them work and with authentication enabled as well.
53:32 Michael Kennedy: Yeah. Those all come set up correctly, let's say.
53:35 Rick Copeland: Yes.
53:36 Michael Kennedy: Perfect. All right, awesome. Just right now, I'm running my own MongoDB server on my own VM, but I've been working with Mongo for six years, so I feel like that's probably at that point where I can go run my own VM and do my own backups daily and things like that. These are all good options and I know that jumping on one of the hosted ones is pretty nice to get started. Let's talk about some other stuff that you've been up to. First of all, all of these MongoDB work, you now just came out with a MongoDB course for Python developers, right?
54:06 Rick Copeland: Yeah. I'm working with Packt Publishing and they wanted to put out some courses on MongoDB. I just came up with a video course called Developing with MongoDB and kind of a three-hour course that you gives you an intro both with what is MongoDB, how does it work, how is it different from relationship databases. It takes you through how using it with Python. Takes you through some schema design. It doesn't get into some of the big data analytics using it with Hadoop or something of the other things, but it does give you a good foundation in MongoDB. I'm happy to say that I was just published yesterday, which would be the 25th of April. We're recording on the 26th, so happy to see that out there.
54:48 Michael Kennedy: Yeah, how's that for timing? Perfect, huh?
54:51 Rick Copeland: Yeah.
54:52 Michael Kennedy: Nice, that's cool. That must have been fun to make. Speaking of ODMs, you wrote a book with the R instead of a D in there as well, the ORM, right?
55:00 Rick Copeland: I did. This is prior to my involvement with MongoDB and the name of the book is Essential SQLAlchemy. It's also an O'Reilly title. So SQLAlchemy. If you are using Python and you are using an SQL database, and you are not using SQLAlchemy, then you're missing out, I would say. And you're probably a Django developer because they have a really nice ORM themselves and it has a lot of other features if you're using Django that are nice, but SQLAlchemy is one of the best libraries, object relational mappers that, I mean it is the best I've ever seen.
55:33 Michael Kennedy: Yeah, it's really, really good. I've used it a lot and it's been perfect.
55:36 Rick Copeland: Yeah. A lot of the time when you get something like an object relational mapper, then you give up a lot of the goodness of, like a lot of the strengths of SQL. I think that Mike Bayer who's the author of SQLAlchemy really did a good job of giving you the abstractions of an ORM plus still allowing you to get the performance of raw SQL. I was really happy with that and the second edition of that came out in the last year. I didn't have a lot to do with second edition but because I wrote the first edition, I get to have my name on the cover.
56:06 Michael Kennedy: Nice. Perfect. I actually had Mike Bayer on one of the first episodes, episode five, so dug into that. Yeah, I like SQLAlchemy a lot.
56:15 Rick Copeland: He's a smart dude.
56:17 Michael Kennedy: Indeed. All right, Rick, we're about out of time. I don't wanna take all of your day up. Let me ask you two quick questions before I let you out of here. Then, one more thing after that. If you're gonna write some Python code, what editor do you open up?
56:34 Rick Copeland: I open up Sublime Text 3.
56:36 Michael Kennedy: Sublime Text, all right. Definitely a solid one. Do you have extra plugins or do you use like the Anaconda IDE thing that plugs in there, not the continuum thing but something else?
56:45 Rick Copeland: No, I pretty much use almost the default. Package Control is in there. Occasionally do some React programming, to mention a different programming language, but get the JSX plugin and things like that. But it's Sublime, pretty vanilla for me.
57:00 Michael Kennedy: Nice. There's a ton, over 100,000 packages on PyPI, is there one that's kind of notable you think maybe people haven't tried or heard of that you wanna recommend?
57:08 Rick Copeland: Other than things like PyMongo and SQLAlchemy that we've already mentioned, one of the ones that it just comes up over and over and people may have already, a lot of people have heard of is Requests. It's the most un-Googleable package name but if you're gonna do any web programming in Python as a client, you need the Requests library.
57:29 Michael Kennedy: Yeah, absolutely. I think it would be un-Googleable if it weren't so popular.
57:35 Rick Copeland: Yes, true. Python Requests is your best bet, yeah.
57:38 Michael Kennedy: Exactly, exactly. All right, well, that's about all the time we have to talk about Mongo for today. Any final call to actions? People who are excited about this stuff, how do they learn more and do more?
57:48 Rick Copeland: MongoDB.org can teach you a lot about MongoDB. Obviously the course which will be in the show notes, but there's also a MongoDB World coming up this summer in Chicago. That might be a good place if you're really interested in this database. It's probably the cheapest education that you can get. It's two days of talks and tutorials before that. I guess those are my calls to action.
58:10 Michael Kennedy: Yeah, cool, MongoDB World, that's like the PyCon of MongoDB.
58:14 Rick Copeland: Yes.
58:14 Michael Kennedy: It's the big one to go to. It's in Chicago and it's cool. It used to be in New York City every time.
58:21 Rick Copeland: Yeah, this is the first time that they've kinda ventured out of Manhattan, so it'll be interesting to see what goes on there.
58:26 Michael Kennedy: Yeah, indeed. All right, Rick, thank you so much for being in the show. It's been great to chat about Mongo.
58:31 Rick Copeland: All right, thank you.