MongoDB Applied Design Patterns

Episode #109, published Sat, Apr 29, 2017, recorded Wed, Apr 26, 2017

Episode Deep Dive Links Transcript

Database design and decisions use to be fairly straightforward. Pick your relational database engine, map out the general entities, apply the third-normal-form (3NF) to them and you're basically done.

With the Cambrian explosion of database options and variations created from 2009 to present, it gets much harder to even choose the database much less follow the well-worn path of 3NF.

On this episode, you'll meet Rick Copeland, a fellow MongoDB Master and author of the book MongoDB Applied Design Patterns. We will discuss modeling data using documents in a document database such as MongoDB and some techniques that particular apply to MongoDB's implementation.

Episode Deep Dive

Guest Introduction and Background

Rick Copeland is a seasoned Python developer with deep expertise in MongoDB, data modeling, and the Python ecosystem. He is the creator of the Ming ODM (Object Document Mapper) for MongoDB, and he authored MongoDB Applied Design Patterns as well as Essential SQLAlchemy. Rick has worked extensively on high-traffic sites such as SourceForge, helping them migrate to Python and MongoDB, and he continues to consult and train teams on Python and database topics.

What to Know If You’re New to Python

If you’re just getting started with Python, some foundational ideas will help you follow this episode. Knowing how Python dictionaries behave, and having a general idea of how data structures map to objects is especially helpful here. A basic understanding of database concepts, such as tables and columns, will help you see the differences when Rick and Michael discuss documents and collections in MongoDB. Brush up on the idea of packages vs. frameworks in Python so you can see how libraries (like PyMongo or an ODM) fit into a project structure.

Key Points and Takeaways

Document Databases vs. Relational Databases MongoDB is a document database, meaning data is stored in JSON-like documents rather than in rows and columns. This allows more flexible schemas and potentially fewer queries (no JOINs in most cases), but it also requires a different mindset from traditional SQL databases.
- Links and Tools:
  - MongoDB Documentation
  - PyMongo on PyPI
Modeling Relationships: Embed or Reference A key design choice in MongoDB is whether to embed data (nested sub-documents) or reference another collection with an ID. Embedding is great for one-to-few relationships (like a blog post and its comments), while referencing is often best for many-to-many cases or when data duplication becomes too large or cumbersome to update.
- Links and Tools:
  - MongoEngine
  - Ming ODM
Schema Evolution and Lazy Migrations MongoDB itself doesn’t enforce a schema at the database layer, so you can add or remove fields without strict migrations. However, Rick and Michael highlight that you should still enforce schemas in your application, often via an ODM, and possibly handle updates dynamically (lazy migrations) when documents are read.
- Links and Tools:
  - Essential SQLAlchemy (O’Reilly book by Rick Copeland)
Using an ODM (Object Document Mapper) Libraries such as MongoEngine or Ming provide a structured approach to data validation, relationships, indexing, and more. They help teams standardize and document how data is stored, reducing the “bag of dicts” anti-pattern, and introducing features like default values, schema validation, and indexing setup.
- Links and Tools:
  - Ming ODM GitHub Repo
  - MongoEngine Documentation
SourceForge Migration Story One of Rick’s major projects was migrating SourceForge’s backend from PHP and Postgres to Python and MongoDB. Their transition showed significant performance improvements, reduced server requirements, and demonstrated MongoDB’s strength in managing large-scale data for a busy site, especially when used as a caching layer initially.
- Links and Tools:
  - SourceForge
Operational Intelligence / Incremental Aggregation For real-time analytics (e.g., log files or dashboards), MongoDB can incrementally update aggregated counters using in-place updates. This approach avoids huge map-reduce overhauls and allows immediate data visibility, but it does require careful document design (especially pre-WiredTiger, which had stricter sizing constraints).
- Links and Tools:
  - MongoDB Aggregation Docs
Hosted vs. Self-Managed Mongo Rick points out that while you can install and manage your own MongoDB cluster, many dev teams choose hosted solutions (Atlas, MLab, Object Rocket, etc.) for ease of use, security defaults, and built-in backups. If you do self-manage, pay close attention to authentication, encryption, backups, and replication or sharding strategies.
- Links and Tools:
  - MongoDB Atlas
  - MLab / MongoDB on Heroku
Default Settings and Security Early versions of MongoDB famously lacked safe defaults, no auth required, no encryption, and minimal write concerns. Modern versions do better, but you still need to ensure you’re not exposing MongoDB to the internet unprotected, and you must configure replication, journaling, and backups properly.
- Links and Tools:
  - MongoDB Security Documentation
MongoDB for E-Commerce MongoDB’s flexible schema is well-suited for products with diverse attributes (e.g., clothing vs. electronics). Instead of complicated entity-attribute-value tables (EAV) in SQL, you can store items as self-describing documents. Just ensure you plan for indexing and the updates needed if product attributes change frequently.
- Links and Tools:
  - Ming or MongoEngine for Product Schemas
Examples of Python Libraries and Tools Beyond MongoDB itself, Rick and Michael called out requests (for HTTP), various libraries for migrations, and the power of frameworks like Flask or FastAPI combined with Mongo. These highlight how Python’s ecosystem, from small modules to bigger frameworks, pairs nicely with NoSQL solutions.

Links and Tools:

Interesting Quotes and Stories

"We didn't want to end up with a big bag of dicts when working in MongoDB." -- Rick Copeland

"You can do so much in-place updating that might give you near-real-time dashboards without resorting to huge MapReduce clusters." -- Rick Copeland

"The shift from 13 servers running PHP to just 4 running Python was a major leap for SourceForge." -- Rick Copeland

Key Definitions and Terms

ODM (Object Document Mapper): Similar to an ORM (Object Relational Mapper), but specifically for document databases. It maps Python objects to MongoDB documents, enforcing schemas and handling queries.
WiredTiger: The default storage engine for MongoDB in more recent versions, replacing the old MMAPv1, which improves concurrency and memory management.
In-Place Update: A MongoDB feature allowing modification of subfields (e.g., $inc, $set) inside a document without rewriting the entire document, leading to faster updates if used carefully.
Embedding vs. Referencing: Two ways to represent relationships in MongoDB. Embedding places related data inside a single document, referencing uses separate collections and links them via ID fields.

Learning Resources

Python for Absolute Beginners (Talk Python Training) For a foundational start to Python itself if you need the basics or a refresher.
MongoDB Quickstart with Python (Talk Python Training) A free mini-course to get hands-on with MongoDB using Python, including an AirBnB-like project.
Eve: Building RESTful APIs with MongoDB and Flask (Talk Python Training) For anyone wanting to create a Python-based REST API directly on top of MongoDB with minimal effort.
MongoDB with Async Python (Talk Python Training) Learn how to leverage async features in Python (via frameworks like FastAPI and Beanie) for working with MongoDB.

Overall Takeaway

MongoDB offers a highly flexible way of modeling data that fits naturally with Python’s fluid style and dictionary-based data structures. By carefully weighing embedding vs. referencing, using an ODM for schema clarity, and being mindful of security settings and operational considerations, developers can unlock both rapid prototyping and high-performance production workloads. Rick Copeland’s insights, drawing from high-scale projects like SourceForge, make clear that embracing the document model is less about replacing SQL wholesale and more about building solutions aligned with the data’s natural shape. MongoDB combined with Python’s expressive tooling can yield fast, scalable, and maintainable applications if approached with the right design strategies.

Links from the show

Rick on twitter: @rick446
Rick's blog: blog.pythonisito.com

O'Reilly: MongoDB Applied Design Patterns: oreilly.com/product/0636920027041
Amazon: MongoDB Applied Design Patterns: amzn.to/2qx47oL
Rick's Mongo Course: packtpub.com

Ming ODM: ming.readthedocs.io
PyMongo: api.mongodb.com/python
MongoEngine: github.com/MongoEngine
MongoKit: github.com/namlook/mongokit
Mlab: mlab.com
MongoDB Atlas: mongodb.com/cloud/atlas

Sponsored Links
Advance Digital: python.advance.net
Hired: hired.com/talkpythontome
Talk Python Courses: training.talkpython.fm
Episode #109 deep-dive: talkpython.fm/109
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #109 deep-dive: talkpython.fm/109

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Database design and decisions used to be fairly straightforward.

00:02 Pick your relational database engine, map out the general entities, apply the third normal form to them, and you're basically done.

00:09 With the Cambrian explosion of database options and variations created from about 2009 until present,

00:15 it's way harder to even choose the database, much less follow the well-worn path of third normal form for modeling.

00:20 On this episode, you'll meet Rick Copeland, a fellow MongoDB master and author of the book, MongoDB Applied Design Patterns.

00:28 We'll discuss modeling data using documents in a document database such as MongoDB

00:33 and some techniques and situations that apply particularly to MongoDB's implementation.

00:38 This is Talk Python To Me, episode 109, recorded April 26, 2017.

00:57 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem,

01:15 and the personalities.

01:16 This is your host, Michael Kennedy.

01:18 Follow me on Twitter where I'm @mkennedy.

01:20 Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via at Talk Python.

01:26 This episode is brought to you by Advanced Digital and Hired.

01:31 Please check out what they're offering during their segments.

01:33 It helps support the show.

01:35 Rick, welcome to Talk Python.

01:37 Thanks for having me on, Michael.

01:38 Oh, I'm very excited to have you on.

01:40 Anytime I get to talk about MongoDB, it really makes me happy.

01:43 So we're going to have a lot of fun doing that.

01:45 And I think the work that you've done in MongoDB in a couple areas, in your ODM, as well as

01:50 your book that we're going to talk about, it's super work.

01:52 So I'm looking forward to sharing it with everyone and talking to you about it.

01:55 Awesome.

01:55 Yeah.

01:56 Well, of course, before we get into that, we've got to hear your story.

01:59 How did you get into programming in Python?

02:01 Well, so when I was, I don't know, when I was a kid, my dad got an Apple II, and I guess

02:06 that dates me a little bit.

02:07 But I learned how to program on BASIC starting there and then ended up just getting into

02:11 computer science in college.

02:13 So I was pretty hardcore in C, C++.

02:16 Out of college, I did some systems programming and various other types of things with visual,

02:23 C++, and ended up having a little period when I was exploring new programming languages and

02:28 I ran across an essay by Eric Raymond about why Python.

02:33 And it was kind of like his new favorite programming language.

02:35 And I'd seen a little bit of Python before, like I'd run across Gen 2 and I saw it's got

02:40 this crazy indentation syntax and I kind of dismissed it.

02:42 But then when someone credible...

02:44 This is a weird language.

02:45 It uses white space.

02:46 Let's keep going.

02:47 Yeah.

02:48 So when someone credible said, you know, this is really cool, I figured that I would get

02:52 into it.

02:52 And so I kind of taught myself Python, started to introduce it at that time into a enterprise

02:58 programming environment, which was interesting.

03:00 They were using mainly C# and Visual Basic.

03:04 It was kind of one of those things where whenever there was a problem and they needed something

03:07 solved really quick, I would say, oh, I'll use Python for that.

03:10 And they said, don't tell me what you're doing.

03:12 Just fix the problem.

03:13 So kind of flying it under the radar.

03:15 And then, you know, discovered that I really loved it.

03:17 So the next job, you know, I sought out Python programming positions after that.

03:22 I think I was in...

03:23 Python was around version 2.3 then.

03:26 So obviously it's grown a lot of new bells and whistles since then.

03:30 And just it's a lot more fun to program in even than it was then.

03:33 Yeah.

03:33 It's just both in terms of the language and whatnot, but also the ecosystem and all the

03:39 packages.

03:40 Right.

03:40 It just keeps getting cooler.

03:41 I think it's a fun place to be for sure.

03:44 Yeah.

03:44 Well, even back in, you know, 2002 or 2003 when I started doing Python, it was so far...

03:51 Like what you got when you just downloaded the standard library was so much more than you

03:55 got in any of the other languages that I was familiar with.

03:58 So, you know, if you wanted to do anything in C++, you had to go out and find something

04:03 that you would configure, make and make install.

04:06 And then, you know, you'd be able to get those development libraries.

04:09 But with Python, you know, you can download URLs, you can create an FTP server or, you know,

04:14 things like that.

04:15 And it's just built in.

04:16 So that was a nice aspect of the batteries included at that time.

04:20 Yeah, absolutely.

04:20 So what are you doing with Python these days?

04:23 What's your day job?

04:24 So these days I'm a consultant, which basically means I do kind of a little bit of everything.

04:28 So sometimes I'm out training different companies in how to do Python.

04:34 So I've been to D.C. and California in the last two weeks.

04:40 I sometimes do custom development for folks.

04:43 I've got, I think, three active projects going on for that.

04:46 And I'm working on a startup on the side.

04:48 So, you know, everybody's got to have their little side hustle that they want to eventually

04:51 make into something.

04:52 So I guess I've got more of a three side hustles going at this point.

04:55 Yeah, that sounds fun.

04:57 It's a challenge to do all these different things.

04:59 I know.

05:00 But it's also fun to have a wide variety and not just be doing the one thing, right?

05:05 Yes.

05:05 Yeah, it is.

05:06 It's never boring.

05:08 Sometimes it can be a little bit overwhelming.

05:10 Like, what am I supposed to work on today?

05:12 And the context, which can get a little bit much.

05:14 But, you know, it's all good.

05:16 Yeah, it's better than the alternative, I think.

05:18 Although you're right.

05:18 It's definitely overwhelming.

05:20 And you're still using MongoDB for some of these projects?

05:23 I am.

05:23 One of the things that I discovered is just it's kind of my go-to at this point.

05:28 I know most people, they learn relational databases because it's much more widespread use.

05:34 But I just kind of got used to using MongoDB.

05:36 In my last business, it was the main database of choice.

05:40 So, you know, I just, I have all the tooling and I have the familiarity with it.

05:44 So that's just the first thing I reach for when I'm implementing something for someone.

05:47 Yeah, I'm with you.

05:49 I feel like it just has so much more flexibility and whatnot.

05:52 I feel like a lot of people fall into using relational databases because that's considered the safe choice or that's what they already know or what they were taught in college.

06:04 Or like the people they're working with, they already know that, but it's not necessarily because it's the best choice.

06:08 Right.

06:08 Yes.

06:09 Yeah, absolutely.

06:11 So I thought maybe, you know, we're mostly going to focus on sort of advanced MongoDB design patterns and implementation concepts.

06:20 But not everybody listening to this is totally familiar with Mongo or NoSQL or document databases.

06:26 So maybe you could give us a quick view just into the summary of what is NoSQL?

06:34 What is a document database?

06:36 And I'm also interested to hear your thoughts on what is NoSQL because everybody seems to have a slightly different definition of that.

06:42 Okay, sure.

06:42 So I would say, you know, NoSQL is kind of anything besides SQL.

06:47 So if you want to kind of drop some of the constraints that SQL puts on you if you're programming.

06:54 So if you look at things like transactions, do you want them to be atomic, consistent, isolated, and durable?

07:00 If you drop some of those, then maybe you're in the NoSQL land.

07:03 So a lot of the NoSQL databases, well, I guess I should say they kind of run the gamut.

07:09 So a NoSQL database could be just a key value store.

07:13 So you're able to look up things very quickly.

07:15 It could be something that's more complicated, has an exotic model like Cassandra with a column, you know, column data store or MongoDB, which is a document data store.

07:25 So, and that's more like the storage model, the programming model, what are you actually putting into this database?

07:31 So I guess the, how I describe MongoDB is, or its document model is, if you think of the relational databases, you've got tables, which are made up of rows and columns.

07:42 In MongoDB, we don't call them tables, we call them collections.

07:47 And what you put in those collections are JSON objects.

07:49 So if you've done web programming, you've probably run into JSON.

07:53 So, and if you haven't, then it's Python dictionaries and lists embedded in each other.

07:58 Basically, that's the data model you're looking at.

08:00 So you just have a collection of these documents and, and we call them those JSON objects documents.

08:06 And you can kind of query into those collections.

08:10 So you can say, give me, say, all of the restaurants that are bakeries in this collection.

08:16 And so you have some field in each JSON document that says, you know, the type of cuisine and it's a, it's a bakery.

08:22 So you could do that sort of a query on a MongoDB database, which makes it a little bit different from a key value store because you don't have to just query based on the key of that document.

08:32 Right. Key value stores are like the most basic, fastest, most scalable, but also the most limiting, right?

08:40 Because you've got the key, the primary key, like an ID, or maybe you could use an email address or whatever if it's a user.

08:46 But then it makes it really hard to ask interesting questions.

08:50 Like the rest of the data is fairly an opaque blob.

08:53 I know there's ways to kind of like add extra stuff around some of the databases, but still.

08:58 Yeah.

08:58 Well, and a lot of the time, those things, you know, you have the key value store and then people will create their manual indexes around that.

09:05 So, you know, you've got kind of the natural key of something, maybe it's some restaurant ID and you want to look up things by cuisine.

09:11 So what you have is maybe a whole bunch of things in another collection that is like cuisine is the key.

09:18 So then those point to the restaurant ID.

09:20 So they build their own indexes out of these things.

09:22 MongoDB takes care of that for you in a lot of cases.

09:26 Right.

09:27 You know, one of the big distinctions, I think, that it takes people who are new to this idea of document databases.

09:33 And MongoDB is not the only one, just the most popular and probably the best.

09:37 But there's, you know, things like Azure DocumentDB, there's CouchDB, there's a variety of them.

09:43 Right.

09:43 And I think it's easy to look at these databases and go, oh, it's kind of like a JSON field embedded in another database or like storing a blob of data or something like that.

09:55 Right.

09:56 You're like, oh, I could do this and say Microsoft SQL server and just make like a text field or a JSON field and stick a blob in there.

10:02 But the big difference is you can index deep into these things.

10:06 You can do rich queries into them.

10:08 Right.

10:08 They're like, they're not just hierarchical things you can store.

10:12 But, you know, even look at some of the object databases, things like Zodb, things like that.

10:18 Right.

10:19 Yes.

10:19 So that is the big difference between this and a key value store, I would say, is that you can index into these things.

10:25 If you're using something like, like you could take MySQL and throw a, you know, a blob column and an ID column into a table and call it a NoSQL database if you want to.

10:36 But you don't really get the ability to do these sorts of rapid lookups on something other than the key.

10:43 And that's what MongoDB gives you.

10:45 That plus it gives you some scaling advantages as well.

10:48 But my background, I haven't taken advantage of that as much as I've just taken advantage of the fact that you can do reasonable things very quickly as opposed to doing unreasonable things reasonably quickly.

10:59 Right.

11:00 I think that's a really interesting comment you made about the performance.

11:03 Right.

11:03 Like one of the things you can do with a lot of these NoSQL databases, Mongo included, is you can do lots of replication.

11:11 You can do specifically, I'm thinking of like sharding, like I could set up a 10 sharded cluster.

11:17 And so then when we do inserts, they're like crazy fast and we can parallelize our queries and our aggregation and MapReduce stuff, all those kinds of things.

11:24 Right.

11:25 But that is like the thing that draws people to it.

11:28 They're like, oh, look what we can do with performance.

11:30 But at the same time, like very few people actually end up needing that much performance.

11:35 I mean, I've seen a few places where it was really needed, but 99% or more of the use cases don't.

11:41 But everybody has a, my relational schema is a pain to deal with.

11:47 It's hard to add columns.

11:48 It's hard to change the shape of it.

11:50 It's like a pain.

11:51 It's slowing me down.

11:52 It's have a, everybody has a complexity problem with their software.

11:56 And I feel like modeling in these documents solves that complexity problem for everyone, not just the 1%.

12:02 Right.

12:02 To me, the sharding and the ability to scale out horizontally has always been kind of a safety feature.

12:09 Like if things go really super well, my system's not going to fall over and I'm not going to have to come back and do manual sharding or partitioning in my database and re-architect my whole application.

12:19 I know that there is a path forward if something like that happens.

12:22 But for now, I can get things done faster than I could with any of the relational approaches.

12:28 Right.

12:29 Yeah.

12:29 I totally agree.

12:30 Totally agree.

12:30 So one of the, there's a bunch of different ways to access MongoDB from Python, right?

12:37 You've got the official Python driver, PyMongo from the MongoDB folks.

12:42 You've got this new thing, BSON, NumPy, NumPy, BSON.

12:47 I don't remember the order of it that goes straight into the data science type structures in NumPy.

12:52 And then on top of PyMongo, we've got things like MongoEngine, Ming, MongoKit, and all these ODMs, right?

13:03 Right.

13:03 And one of these, named Ming, actually, is one that you created.

13:08 Yes.

13:09 Maybe give us a quick overview of what are the trade-offs?

13:12 When would you consider using one of these ODMs?

13:15 Like, what the heck is an ODM anyway?

13:16 So it really comes down to the idea.

13:19 So I said we don't have tables.

13:21 And one of the other things that you don't really have when you're dealing with MongoDB is you don't have a database-enforced schema.

13:29 So you have, it's kind of like this big bag of things.

13:34 And I said that, you know, they're kind of like Python dictionaries.

13:37 So I'll have to say this very quickly or very carefully.

13:40 But you don't want to end up with a big bag of dicts when you're working on this.

13:45 So what you need to actually have is some sort of a schema that tells you the sorts of things that you're going to put in these collections.

13:51 Because it turns out whatever you're putting into them, when you read them back out, your code's going to have to do something with that data.

13:58 You can't just say, well, I'm just going to store everything in there and it's magically going to reappear.

14:03 Your code is making certain assumptions about what fields are in those dictionaries, what keys, and, like, what is the structure of the data that you're storing?

14:11 So that's really what these ODMs or object document managers do.

14:16 That tells you, you know, in this collection, we're putting things in that look like this restaurant.

14:20 So although MongoDB, until very recently, didn't have any form of enforcing schemas, this would be something in your code where you're documenting it.

14:28 At the very least, you're documenting what sorts of dictionaries or BSON documents you want to be putting into these collections.

14:36 Right, and so these, if you go through one of these ODMs, object data mappers, layers, you basically go through predefined classes and objects in Python, which themselves have a fixed structure.

14:48 And so you're kind of, you filter through, like, a known layer of schema, and that works pretty well, right?

14:55 Yeah, yeah.

14:56 And that allows you to kind of, you can get a long way without having any kind of a documented schema if you're the only programmer on the project.

15:04 But once you start having multiple people, you need to have kind of a common understanding of, well, I'm going to write things that look like this to the collection, and I'm going to read things and expect them to look like this other thing.

15:14 So that's kind of the base level of why you need something like this.

15:17 You need a library or a data access layer.

15:19 Sometimes people will write their own.

15:20 That's a pretty common thing.

15:22 If you don't use an ODM, then people will typically write Python modules that have, you know, getters and setters for different types of data that they want to put into the database.

15:31 That's the approach that MongoDB uses on their training materials, I think.

15:34 They just build this Python module that does these things.

15:38 So an object document mapper allows you to kind of abstract that out and write those more quickly.

15:45 So rather than saying, you know, I want to get a restaurant or I want to write a function that calls, you know, this is called get restaurant, then I can have a restaurant class that has a get method.

15:56 But the get method is not something I have to write.

15:58 It's something that the ODM provides me.

16:00 I hope that makes sense.

16:01 Yeah, I think it does.

16:02 And a lot of these, I can't speak to me, you'll have to fill us in on the details.

16:07 I don't remember exactly.

16:08 But the one that I'm using right now is called Mongo Engine, which is also one of the more popular ones.

16:13 And it has a lot of additional things that it helps you with.

16:16 Like you define a class and it's much like SQLAlchemy.

16:19 You say these are the fields that go into the database.

16:22 And this one's a string and it has to be unique.

16:25 Like this one's an integer and I want an index on it and things like that.

16:29 And so it'll actually apply the uniqueness constraints.

16:31 It'll apply the index.

16:33 It'll create and enforce the indexes, all those sorts of things as well.

16:37 That's part of what Ming does.

16:38 Definitely.

16:39 You can put these constraints in there, these indexes that you want it to define, and it'll go ahead and create those indexes for you.

16:46 Another thing that it's helpful or that it helps you with is your schema evolves as you're building your application.

16:53 So maybe you didn't need a zip code when you first started or you forgot that you were going to need a zip code.

16:59 And maybe that's a bad example.

17:01 But there's some fields that you want to add later on and you need a sensible default for the existing documents.

17:07 Like let's say you have an account class and you eventually want to start verifying that they've verified their email address.

17:13 And you didn't think of that at first.

17:15 You don't have a is email verified or something like that, right?

17:17 Yeah.

17:18 So maybe you want all of the existing documents to have that be default false.

17:22 They haven't verified their email.

17:24 So you can write a validator or a type into your schema that says, you know, when this field is not found, then I want you to populate the Python object with false.

17:36 It helps you to do these sorts of on-the-fly data migrations.

17:41 Right.

17:41 And it can be kind of like you describe it in your book as a lazy, lazy migrations or lazy schema migrations.

17:49 Because in a relational database, that wouldn't fly very well, right?

17:53 You'd have to say, well, we're going to have we're going to do a schema transformation and we're going to add a column and it's going to be a type bool and it's going to be default.

17:59 There's some kind of script you got to run probably.

18:01 Yeah.

18:01 And those I guess that's the deal.

18:04 You when you're going to be changing the schema in a in a SQL database, you do it up front.

18:10 So you've got to make sure that all of the rows conform to the schema.

18:13 And, you know, the database is going to enforce that.

18:15 So you do this alter table statement and it makes sure that everything conforms to it.

18:20 So that's one approach.

18:21 And you can do that in MongoDB as well.

18:23 You just go in and you and you overwrite all of the existing documents.

18:27 Right.

18:27 It's maybe as in a SQL script.

18:29 It's just a JavaScript script or something like that.

18:32 Right.

18:33 And you run that.

18:33 Yeah.

18:33 Mongo gives you the option of kind of waiting until you actually load a particular document to make sure it conforms to your current schema.

18:41 So that's something that we built into Ming as well so that you could actually like read a document and then check to see does that document actually conform to my current schema.

18:49 And if it doesn't, there was actually the ability to fall back and run a migration function on that document.

18:54 So you could actually bring things forward at the moment when they're loaded out of the database.

18:58 Oh, yeah.

18:58 That's really cool.

19:00 So one of the early success stories from MongoDB, I think, comes from SourceForge, actually.

19:07 And you were part of this.

19:08 I remember SourceForge.

19:10 This is before there's a GitHub or anything like this, right?

19:13 And SourceForge was used quite frequently.

19:14 And I remember it was getting painfully slow.

19:18 And then one day it was fast again.

19:20 And you were involved in that somewhat.

19:23 And that's actually partly where Ming came from, right?

19:26 Do you want to tell us about that?

19:27 Yeah.

19:28 So when I came to SourceForge, I remember in the interview when I was about to come to SourceForge, I had worked on building a SQLAlchemy-like library for, I guess you could call it a NoSQL database.

19:41 It was a private thing that the company that I was working for had developed internally.

19:44 And so they asked me about that.

19:47 And they said, well, you know, we've got this thing called MongoDB that we're thinking about working on.

19:51 And it kind of stores Python dictionaries.

19:53 So what would your approach be to doing something like that?

19:56 And so, you know, I kind of talked about it.

19:58 And I guess my answer was good enough.

20:00 And they hired me.

20:00 And they said, you know, we did some performance evaluations.

20:03 And at that time, it was like 2009.

20:05 They looked at various different approaches.

20:07 And they said, MongoDB is going to give us the performance that we need.

20:09 And we're comfortable with the data model.

20:12 So we like the idea of storing things that look like Python dictionaries into the database.

20:16 But we would like to have something like some kind of a schema enforcement layer or an ODM.

20:22 Although I don't know that that was really a big term at that point.

20:25 They maybe called it an ORM even.

20:27 Yeah.

20:28 I call it an ORM for a non-relational database.

20:31 Yeah.

20:31 ORM minus the R.

20:32 Yes.

20:32 So, you know, we started working on that.

20:36 And I was the main developer on Ming.

20:40 And so Ming formed kind of the data layer of a complete rewrite of all of the SourceForge developer tools.

20:47 So when you think about SourceForge, there's kind of two sections of it.

20:52 I mean, if you think about SourceForge these days.

20:54 But there's sort of the, this is the site for the developers to build their software.

20:58 And this is the site for users to download software.

21:02 This portion of Talk Python is brought to you by Advanced Digital.

21:06 How would you like to build one of the most visited news sites in the U.S.?

21:10 That sounds fun.

21:11 The folks at Advanced Digital would love to talk to you.

21:13 They're primarily a Python shop located in beautiful Jersey City.

21:16 Just one subway stop from lower Manhattan.

21:19 Spend your time building an amazing web app with Python.

21:22 And do it with a small team of developers focused on agile development.

21:25 Are you going to miss PyCon this year because your company wouldn't fund the travel and expense?

21:29 If you join this team, they'll cover your conference and training initiatives.

21:33 It's time to take your Python to the next level.

21:35 Build an amazing web app.

21:37 Get started by visiting python.advance.net right now.

21:41 So we rewrote kind of all of the developer tools.

21:44 And we rewrote a lot of the download side of things as well.

21:49 And that was actually a migration from PHP to Python.

21:52 And a migration from largely Postgres backed to mostly MongoDB.

21:58 And we kind of did it in stages.

21:59 But Ming was a big part of that.

22:01 Being able to kind of come in and say, we've got a group of programmers working.

22:05 What's our common understanding of the data that we're storing in this weird database that none of us has seen before?

22:09 Right.

22:10 And how did it go?

22:12 I recall that there were some pretty major stats in how much better the site got.

22:18 How much fewer, how many fewer database servers there were?

22:22 Things like that.

22:23 Do you recall?

22:23 Well, I remember we went from handling, it was something like 13 servers that were running the PHP front end.

22:30 Our first deployment, we went down to, I believe, four Python servers doing basically the same work.

22:37 So that was a nice, nice thing for Python.

22:40 And of course, the PHP was backed by Postgres and the Python was backed by Mongo.

22:44 And one of the other things in the first version, this is what a lot of people did with Mongo at the time, and I guess still probably do, is when you're introducing this new technology, you kind of take baby steps.

22:54 So Mongo was not our system of record initially.

22:57 We would use it as kind of a cache for all of the Postgres data that was coming from the legacy system.

23:02 So all of that went into Mongo.

23:05 And then as long as you obey a few little rules like make sure your working set fits into RAM, Mongo behaved, its performance was closer to memcached than it was to a relational database.

23:17 So, you know, super fast for a read mostly workload.

23:22 And that's why we were able to do nice things.

23:25 And then we, like a lot of people who first deploy MongoDB, we think, oh, this is great.

23:30 It can probably do anything I want it to do.

23:33 So we wrote a little rate limiter in MongoDB.

23:36 And we did it in a really stupid way, it turns out, by basically just logging every request.

23:41 And then every time a request comes in, we would query to see how many requests from that IP in the last X seconds or minutes or whatever a rate limit was.

23:50 And that worked until it didn't, which was when the index got bigger than our RAM.

23:55 And you got this nice cliff of performance.

23:59 So we reworked that.

24:01 But, you know, for the most part, it was a pretty good rollout.

24:04 And, you know, a lot of success moving from PHP to Python.

24:10 And there's still things that run on.

24:13 Last I heard, there were still things that ran on Postgres at SourceForge.

24:17 But it was primarily MongoDB later on.

24:20 Yeah, okay.

24:21 That makes a lot of sense.

24:22 That's really cool.

24:22 Is it still running in Mongo, do you think?

24:25 Do you know?

24:25 Well, it is.

24:26 So the first version that we rolled out was only for the download side.

24:31 And then we ended up rewriting all of the developer tools in Python and MongoDB.

24:34 And then that ended up being outsourced, not outsourced, open sourced as the Apache Allura project.

24:42 So it's now an official Apache Software Foundation project.

24:45 And anybody can run the same tools that SourceForge is running for developing software.

24:51 And there's a little bit of setup involved.

24:53 But it's still out there.

24:55 It's something that was kind of a goal early on that we wanted to make sure that we gave back to the community with what we were doing.

25:01 And, of course, Ming was always open source from the beginning.

25:03 SourceForge has had its moments of evil, but generally has been a good supporter of open source software.

25:12 Yeah, I say historically, it's probably got a positive grade, all in all.

25:16 Yeah.

25:16 All right.

25:17 So one of the things I really want to dig into while we're talking is your book called MongoDB Applied Design Patterns.

25:24 But before we get to that, I just want to quickly write an idea by you and maybe make a plea to anyone who is either running or considering running Mongo.

25:31 I think I'd love to hear your opinion.

25:34 One of the things, I think Mongo is super great, but I think they've made a few fairly minor decisions that have come back to haunt them in certain ways that get amplified from the early days.

25:47 And I think one of those is, by default, not running encrypted connections.

25:52 And another is, by default, not running with authentication.

25:55 Yes.

25:56 So their defaults have always been interesting.

25:58 Maybe I'll use that word.

26:00 I think they've optimized too much for performance and scalability and not enough for durability and safety nets.

26:07 I'm thinking of the initial write concern defaults.

26:10 I'm thinking of the lack of journaling in the early days.

26:13 You know, all these things.

26:15 And each one of them maybe made sense in their original world.

26:17 But I think people have taken these and not knowing they need to be aware of them got themselves in trouble.

26:22 Absolutely.

26:23 So when we started out, the default way that you wrote to MongoDB, if you didn't change any of the settings and you do an update or an insert or whatever, basically you got an acknowledgement from the server that, hey, I received your request to write this data to the database.

26:39 What you didn't have was any assurance that it actually made it onto disk.

26:43 We didn't even get an acknowledgement.

26:44 Yeah.

26:45 Not even into the data set and memory.

26:47 Just the servers received your socket request, basically.

26:50 I think we even didn't get that initially.

26:52 Yeah, I think you might be right.

26:53 Yeah, you could be right about that.

26:54 So everybody learned, first of all, that you needed to have this magic argument when you connected called safe equals true.

27:01 So by default, MongoDB was running an unsafe mode, which is kind of a silly thing to do when you think about it.

27:08 It's cool.

27:08 It's fast.

27:09 Yeah, it was certainly fast.

27:10 And somebody made a nice web video about web scaleness from that.

27:15 Dev null is very fast, too.

27:16 Yeah.

27:17 I can write an infinite amount of data to it super quick.

27:21 But so everybody, you know, we moved over to safe equals true.

27:23 But even then, you just got an acknowledgement that server received your request and maybe it didn't violate any unique key constraints.

27:30 So, okay, great.

27:31 That's some progress, but it might not make it to disk.

27:33 And so they told you, well, you need to really run in replication.

27:37 So then you could get some, you could say, well, I want to only consider my right to be complete once it's been also written to another server.

27:45 Okay, fine.

27:45 Well, that's pretty good.

27:46 That's if you're actually getting verification of replication, then you're probably running in a slightly safer mode than most people are writing to MySQL.

27:53 So I would say that's a good place to be.

27:56 But then they've also got this network issue that by default, you get MongoDB, you fire it up, and it's going to bind to all of the IP addresses on the machine with no authentication and no encryption.

28:07 And anybody can connect to it, read, write any of the data that's on the database.

28:11 So that is not really a good default state to be in.

28:16 And it turns out a lot of people didn't read their docs when they moved to production.

28:21 And there was a big exploit recently where there were thousands of production MongoDB databases that were compromised because they were running completely wide open to the intranet.

28:30 So, yeah, be careful.

28:32 Yeah, absolutely.

28:32 So basically, I bring this up for two reasons.

28:35 One is there's a lot of FUD about Mongo involving things about this write concern and the journaling, and all those are changed, right?

28:45 The defaults are to do the right thing these days.

28:48 So those are basically phased out.

28:50 But this last thing about the security is not.

28:53 If I were king of Mongo, I'm not a king of MongoDB.

28:56 But if I were, I would make it a change that unless you set up authentication, it will only listen on localhost by default.

29:05 Right.

29:05 That would be my rule.

29:07 And that's kind of safe.

29:08 Like, if you're running the server next to your web app or for dev, it's fine.

29:12 And if you want to do something production-wise, you've got to configure it a little better.

29:15 But that's not how it works.

29:17 So just if you guys are listening and you want to run Mongo, definitely, we both definitely recommend it.

29:22 Just make sure you turn on security or you don't listen.

29:25 Just unprotected on the internet, right?

29:28 Just take a few steps to enable encryption if you're going to go across networks and security authentication, things like that.

29:34 Yeah.

29:34 And I think the latest versions of the RPMs and the Debian packages do bind only to the local host.

29:43 So at least they're a little bit more secure.

29:44 But still, if you're just running the MongoDB binary by default, it's going to listen to anything.

29:49 So, yeah, be careful.

29:50 Yep.

29:51 Yep.

29:51 And that also, it's not just a server production thing, right?

29:54 Like, that could be a dev issue.

29:57 Your dev machine could be on the network and you could be running a dev version with live data and it could have the same problem.

30:04 So just be careful about this.

30:05 Yes.

30:05 All right.

30:06 So let's talk about your book, MongoDB Applied Design Patterns.

30:10 That's the title, right?

30:11 Yes.

30:11 Okay.

30:12 I didn't copy.

30:12 It's not a paraphrasing.

30:13 Okay, good.

30:13 So this is a book that looks at MongoDB from a Python developer's perspective.

30:19 And really, I think it's a super book.

30:22 The idea is to look at a bunch of different use cases and challenges and try to solve them, right?

30:28 Right.

30:28 The genesis of the book is MongoDB needed to, or they wanted to have something like a list of different use cases.

30:35 Like, how do you use MongoDB in this situation?

30:38 And so I wrote up a bunch of use cases for them and then they said, you know, this would be a really good book.

30:43 So let's see if we can introduce you to some people at O'Reilly and see if we can kind of flesh these out into a full O'Reilly title.

30:51 And so that's what we ended up doing.

30:53 Yeah.

30:53 And that book came out in 2013, right?

30:56 Sounds right.

30:57 Yeah.

30:57 And MongoDB 2.

31:01 Something 2.2, 2.4 sort of time zone.

31:04 How much of it do you think is still current?

31:06 And how much do you think is sort of slightly changed with a release like, say, MongoDB 3?

31:12 There are definitely changes to some of the performance concerns that have to do with the way that the storage engine works since version 3.

31:19 Because they switched to WiredTiger by default and not mem map files, yeah?

31:23 Yeah.

31:24 So the nice thing, nice maybe in quotes here, for programming against MongoDB in the olden days before WiredTiger is it was really easy to understand the memory model.

31:34 Because what they did is they just took your whole database and they mapped it into RAM.

31:38 And they used the Linux virtual memory system to decide what was in and what was out.

31:44 So if you know how to modify memory, then you knew the most efficient way to modify MongoDB.

31:49 With WiredTiger, that's changed.

31:51 They have a real storage engine.

31:53 You know, it has multi-version concurrency control.

31:56 It has some interesting, interesting in a good way, performance characteristics of being able to, you know, have multiple writers going at the same time.

32:05 So I would say some of the things that really optimized for in-place modification in my book don't really apply as much.

32:14 Because there was a huge difference in performance in the old storage engine between writing to something in place on the disk and doing something that, say, changed the size of a document and required MongoDB to write a whole new copy of the document somewhere else on the disk.

32:29 Right.

32:29 And the way it works now, it's totally different.

32:32 So, all right.

32:32 But my look, when I went through it, I felt like this is really still quite current.

32:37 I think you're right about probably the considerations around the memory map files and whatnot.

32:41 But other than that, it looked really good.

32:44 So let me read a really quick excerpt from the book just to kind of set the stage.

32:49 So you say, traditionally, relational databases, while familiar, present significant challenges and complications when trying to scale up to big data needs.

32:59 And into this world steps MongoDB to address the scaling.

33:03 And around all of this height and excitement, a bunch of sites grabbed a NoSQL database, MongoDB database, and threw it out there and just started working with it without really understanding that it takes a different thinking about it.

33:18 Right.

33:18 It's paraphrasing.

33:19 Right.

33:19 But it's basically – and some of these things we just talked about around the durability and security were one of the things.

33:27 But I think more – probably the biggest mind shift that you have to make in this world, and you start – you dedicate a significant part of your book to this right at the beginning, and I think you should, is schema design and document design relative to, say, first normal form and third normal form and all that.

33:46 Yes.

33:47 So I would say that the biggest mindset shift that you've got to get through to be effective at MongoDB schema design is to say how you – what happens when you get rid of joins and what happens when you get rid of transactions.

34:00 So it's kind of the reads and writes.

34:02 So MongoDB does not support the join operator.

34:05 Well, there's a way to do it in the aggregation framework.

34:07 But besides putting that aside, generally, when you do a query in MongoDB, you can get a collection of – or you can get a set of documents in your results set, but you're not going to be talking to two different collections when you do that.

34:19 You're going to be making a query against a single collection.

34:22 And you'll get documents from that single collection.

34:24 And so the question is, how do you actually use that in an efficient way?

34:29 So if I was building a blog in a relational database, then maybe if I need to render that blog post, I would maybe fetch something from the posts collection.

34:37 I would fetch something from the authors collection.

34:40 I would fetch something from – or not collection, but the post table, the authors table, comments table.

34:45 And I'd do a join of all these things, and you'd end up with all of the data that you need to represent that blog post to a web user.

34:54 Well, with MongoDB, you can do the same thing.

34:56 You could have a posts collection and a comments collection and an authors collection, and you can do kind of the join-y work in memory.

35:03 But you've gotten rid of a lot of the benefits of MongoDB because the nice thing about MongoDB is you can design your schema so that a single document can satisfy that web request.

35:12 So you could have the post with the embedded author information with all of the comments all in a single document.

35:18 So basically, you're doing a single fetch, a single round trip to the database.

35:22 And even on the database, if you're using a disk or you're using an SSD, whatever the case is, you've got all the locality right there.

35:28 So the whole document is right where Mongo is looking at that time.

35:32 And so it's able to basically just do things much more efficiently if you design your schema right.

35:38 Yeah, and I think it's very much a Shakespearean type of thing, like to embed or not to embed.

35:44 That is the question, right?

35:45 Like really, every time I sat down to design a new data model for MongoDB, it's like, what are all the pieces?

35:51 What embeds where and what shouldn't be embedded for various reasons, right?

35:55 So like, for example, you mentioned you could have your post and it could have the author embedded and it could have the comments embedded and so on.

36:04 Maybe even there's categories, right?

36:07 Like categories and things that you could theoretically embed the category data into the post, but then you have to replicate that across all the different posts, right?

36:15 Sure.

36:15 Yes.

36:16 That may or may not be something you want.

36:18 Yeah.

36:18 So you still have relationships in your data.

36:21 That's a logical concern, right?

36:22 You can do an entity relationship diagram and you can still map that onto MongoDB.

36:28 The difference is with Mongo, when you have one of these one-to-many relationships, all of a sudden you now have the option, if it makes sense performance-wise, that you could take both of the entities and put them into a single collection.

36:39 But you can't do that in a relational database, right?

36:43 Relational kind of first normal form says you don't have multiple entries in a column.

36:48 But with MongoDB, that's sort of the norm.

36:50 You're allowed to have these array types that are being stored there.

36:54 So now you've got to decide, does it make sense to put it there?

36:56 Or if you've got a many-to-many joint or a many-to-many relationship, the old way of doing it or the SQL way of doing it is you've got to have a join table that's got IDs from table one and IDs from table two.

37:07 And it tells you which ones match up with which ones.

37:09 MongoDB, if you're doing a blog, again, it's just an easy example.

37:13 So you've got tags or categories.

37:15 A lot of the time, that'll just be a list of strings that you put into the post.

37:19 And there's no need to actually have that join collection or that join table that you would have in SQL.

37:25 I think that's totally right.

37:26 And even if your tag thing was more complicated, right, you can do these many, many relationships and maybe store a list of tag IDs in every post.

37:37 Right.

37:37 And then reach back into the other table.

37:39 Yeah.

37:40 You'd almost never want to have something like a join table in MongoDB.

37:44 I can't think of a good case.

37:46 You'll almost always want to either have a list of IDs in collection A or a list of IDs in collection B or both.

37:53 But you wouldn't want to have a separate collection where the existence of a document means that these other two documents are joined.

37:59 Yeah.

37:59 I find that to be almost never.

38:01 I don't think I've ever seen that in a well-designed case either.

38:05 I definitely have never made use of it in the apps that I built.

38:08 That was one of the problems with people coming from the SQL world is they know how to model things there.

38:12 And they just assume that if I take the same schema that I had in SQL, it's going to be like that but faster if I do it in MongoDB.

38:19 Yeah.

38:20 Because I heard Mongo is faster.

38:21 So it'll be faster if I just put this over here.

38:23 Exactly.

38:24 Yeah.

38:24 It probably is faster, but not because you copied over your schema design from a relational database.

38:30 Yeah.

38:30 Or in many cases, it would end up being slower because you're doing all of the logic of the join at that point, but you're doing it in whatever your programming language is.

38:39 So I love Python, but it's not this super high-performance bare metal language.

38:44 If you're building a join engine in Python, yeah, you can do that.

38:48 But you are now talking about introducing network latency to talk to the database.

38:52 You're talking about it's written in Python.

38:54 It's not written in C++ like the MongoDB engine is or like C database engines might be in other cases.

39:00 So you're kind of, if it's faster, then it's an unusual situation.

39:05 You're usually going to kill yourself performance-wise.

39:07 This portion of Talk Python To Me is brought to you by Hired.

39:11 Hired is the platform for top Python developer jobs.

39:15 Create your profile and instantly get access to thousands of companies who will compete to work with you.

39:20 Take it from one of Hired's users who recently got a job and said, I had my first offer within four days and I ended up getting eight offers in total.

39:27 I've worked with recruiters in the past, but they were pretty hit and miss.

39:31 I tried LinkedIn, but I found Hired to be the best.

39:33 I really like knowing the salary up front and privacy was also a huge seller for me.

39:38 Well, that sounds pretty awesome, doesn't it?

39:40 But wait until you hear about the signing bonus.

39:42 Everyone who accepts a job from Hired gets a $300 signing bonus.

39:45 And as Talk Python listeners, it gets even sweeter.

39:48 Use the link talkpython.fm/Hired and Hired will double the signing bonus to $600.

39:53 Opportunity is knocking.

39:56 Visit talkpython.fm/Hired and answer the door.

40:00 Yeah.

40:01 So one of the things while we're on this document design stuff is in MongoDB, there's no concept of a foreign key constraint or relationship.

40:09 Right.

40:09 I can't have one document with a strict relationship to another document.

40:13 Right.

40:14 I'm not really sure how much value you get.

40:15 There's no joins and things like that.

40:17 Like, so oftentimes people think that means there's no relationships in MongoDB.

40:22 Right.

40:23 Yeah.

40:23 But I don't think that that's true.

40:24 I think you can put them into these models.

40:27 They just don't span documents, right?

40:28 Right.

40:29 Yeah.

40:29 You can have, you know, the relationships can exist within a document and you get atomic updates and things like that.

40:35 So you get the database to enforce some consistency there.

40:38 And you can also model the relationships with, I mean, it's not enforced by Mongo, but you can have a foreign key concept where you've got an ID of a different document in another collection and you're storing that ID.

40:50 Differences that you always have to take into account the possibility that that document might not actually exist.

40:55 That's right.

40:56 Yeah.

40:56 I think of them as two things.

40:57 I have a slightly different name that I've used over the years for it.

41:00 Like for the stuff that's within your document, you've got a post and it has a list inside of it of comments.

41:06 Like that is a super strong relationship.

41:08 You can't have a comment without the post.

41:10 It is the same thing.

41:11 But if you were like reaching back to an author table through just a foreign key constraint, that doesn't really exist, but it's logically there.

41:18 I call those soft foreign keys or something like that.

41:23 Like they're not enforced, but they technically, they fill the same role, right?

41:26 Yeah.

41:26 They fill the same role.

41:27 And sometimes people call them references or document references.

41:30 Way back when I started with MongoDB, one of the patterns that they kind of promoted was storing the collection name along with the ID.

41:39 I never found that super valuable, but that's another thing that you can do.

41:42 If you want to have a reference that could go to any collection, then you can just throw the collection name in there.

41:46 Yeah.

41:47 Interesting.

41:47 That works well at the low level, at like the PyMongo level.

41:50 Less good at the ODM level.

41:52 Right.

41:52 So let's talk about some of the use cases.

41:55 So we've kind of set up this, you talk a lot about like, this is what modeling in this world looks like.

41:59 You also talk about mimicking transactional behavior with compensation models that work well in MongoDB.

42:05 But let's just kind of leave that as there.

42:07 So you kind of set the ground with some of these foundational things.

42:10 And then you say, let's talk about six different use cases, all the performance considerations and how you model it and everything.

42:18 Right.

42:18 So do you want to touch on some of your favorite ones there and maybe like what was non-obvious or maybe something like that?

42:24 Yeah.

42:24 So the first one is, has some of the more interesting parts, I think, or some of the things that I found really interesting.

42:30 And I guess that's why I put it first, but that's the operational intelligence chapter.

42:34 And it's really focusing on analytics and dealing with kind of high volume data that's coming in quickly.

42:40 There were two different use cases in there, or maybe there were three in there.

42:45 But two in particular that I remember were, one of them was incremental aggregation.

42:49 So this is, you've got something coming in, you've got these aggregate statistics that you want to report out immediately.

42:56 So one approach that you could do for aggregation is you can run a big MapReduce job on a Hadoop cluster, and that'll come back in a few minutes.

43:05 But if you actually want something that's up to the minute, then how do you do that in an efficient way?

43:10 And so this relied a lot on the in-place updating, and it was based on MongoDB's own, it's not called Cloud Manager, but their monitoring service, which would actually monitor MongoDB performance for you.

43:22 And they offered this as a free service.

43:23 So it was like, how do we deal with this scale?

43:25 So let me show you how you can build your schema to deal with that kind of scale and how you can keep the performance high, even with an in-map storage engine.

43:33 No, I just, I think it was, what I found interesting about this was you start from, like, let's start with log file data, like something out of Apache web request or something like that.

43:44 Let's put that in the database.

43:46 And then let's start doing, like, processing and analysis of it.

43:50 And you have some really interesting graphs and various things that say, like, let's look at, if we design it this way, what are the trade-offs?

43:58 What is the benefits, what are the drawbacks?

44:00 And there was a number of non-obvious ways in which things kind of slowed down or got out of control.

44:08 And you ended up with quite an interesting aggregation report database, right, where you pre-computed and pre-allocated a whole bunch of pieces and then used some of the in-place update operations to sort of, like, increment the numbers at the right levels as these things came in, right?

44:24 Yeah, that was the incremental aggregation one.

44:26 So that was, the problem there is it was storing the aggregates in these large documents.

44:33 And sometimes the documents would grow and that would cause performance problems.

44:36 And then you get into a secondary issue, which is that even though you think of these things as Python dictionaries, which are super fast to access any item in them, physically they're stored as a list of key value pairs on the disk.

44:51 And so it turns out it takes longer to access something towards the end than it does to take to access something towards the beginning.

44:57 So how can we mitigate that issue?

44:58 And those were just some sort of the sorts of things that you can only see when you've actually run some performance metrics against it.

45:05 Again, just a shout out to Python.

45:06 I did all this with, at that time, IPython notebook and printed out the graphs and, you know, just threw those into the book right there.

45:14 So I think those are actually screenshots from IPython, now Jupyter notebook.

45:17 Yeah, they looked like some Matplotlib graphs or something, which is cool.

45:21 Yeah.

45:21 All right.

45:22 So another thing that people, at least in the early days, were like, oh, you can't use MongoDB for this was e-commerce, which I totally disagree with that statement.

45:29 But you have a section where you talk about using MongoDB for like an e-commerce site, right?

45:34 Yeah.

45:34 So one of the big things or one of the difficulties with existing e-commerce, I guess the big one is Magento.

45:42 So Magento uses an entity attribute value store.

45:46 So they're still stuck on SQL, but they use SQL in a way that makes it non-relational.

45:51 Basically, instead of keeping your products in a products table where each one of the attributes of that product is a column, they just say, I've got one big table that says for this entity, maybe it's a shirt.

46:02 I have an attribute, which is a size, and it's an Excel.

46:05 For this entity, which is a drill, it has, you know, some other attribute, and it's, you know, 120 volts or whatever.

46:11 And so out of that, they're able to get this very flexible schema.

46:15 So it's kind of like, well, that's not really a fantastic way to map to the relational model.

46:20 But they kind of have to because you want to deploy to a store that might have all sorts of different items in it that have different attributes that you want to store.

46:29 Nice thing about MongoDB is not all of your documents have to look like each other inside the collection.

46:35 So Mongo lets you actually say, well, I want to store drills and shirts in this collection.

46:39 Can I do that?

46:40 And it turns out you can.

46:41 Maybe there's certain attributes that they all have in common.

46:44 They have an SKU number.

46:46 They have a price.

46:47 They have maybe a quantity available.

46:50 But then they've all, you know, got their other things that are custom to each one.

46:54 And so you can introduce this polymorphism with MongoDB in a much more natural way, I think, than using something like an entity attribute value schema in a relational database.

47:03 Yeah, I think that's leveraging a pretty interesting aspect.

47:06 And you're in some sense implementing inheritance for specialization.

47:12 Not exactly, but something to that effect, right?

47:14 And that because the schema is really enforced at the application layer, not in the database layer, that flexibility pretty much just flows through.

47:24 And you end up with these sparse objects.

47:25 Like maybe one document has a drill bit size or something.

47:29 The other one has a shirt size, right?

47:31 And those don't appear in both records.

47:33 You don't waste the space.

47:34 Yeah, exactly.

47:34 And you can build your ODM to kind of take care of that.

47:40 I don't think I haven't been doing a lot with Ming super recently, but I'm not sure if we had the ability to kind of discriminate based on the data that it loads out as to which physical type of object it's creating.

47:52 But that's certainly something that you can do with an ODM.

47:55 And I know it's something that SQLAlchemy does with relational databases, but it requires you to either do a super complex schema in SQL or it requires you to waste a lot of columns.

48:07 And those are kind of your two options to do this sort of object oriented polymorphism.

48:11 Nice.

48:11 So what are some of the other ones that you cover that you really like?

48:15 So I did have some fun with the online gaming chapter because that was just, I don't know, games are fun, but kind of like brainstorming out like what are some of the data structures that you might need when you're building this?

48:26 How do you do these in a, say, it's a massively multiplayer online game?

48:31 How would you actually store this?

48:32 How would you scale it?

48:33 How would you do the sharding?

48:35 The online advertising networks was also interesting just because it's a very high frequency sort of application.

48:42 And it's something that I had seen a little bit of at SourceForge.

48:45 And, you know, one of the things that you mentioned earlier on was, you know, SourceForge got slower and slower and slower.

48:52 So part of that we can blame on maybe PHP and Postgres, but part of it we just have to blame on the ad networks.

48:58 Because SourceForge is an advertising supported site, a lot of these ad networks just took a long time to render the ad.

49:04 And that's kind of slowing down your browsing experience and can cause various other problems.

49:09 So what if we could speed those things up and deliver contextual advertising to people in a way that doesn't make them want to pull their hair out?

49:16 So that was also an interesting one.

49:18 Yeah, that's a fun one to work on.

49:20 And I know a couple of people working in this ad network space and they're using Mongo and they have some pretty intense requirements around the traffic that they handle.

49:31 Because if you run ads on a site that gets, you know, a million views a day and that's just one of the places, right?

49:37 You all of a sudden are getting a million requests a day.

49:40 You're getting a million requests a day and you're trying to target those ads now based on some content, you know, that's going on in the article.

49:47 So presumably you've indexed that and you know something about the keywords, but then you probably have some real-time bidding going on for those too.

49:54 So how do you actually choose the ad inside that request-response cycle?

49:58 Because you know that your content people that are actually paying or that, you know, you're advertising on their site, they're not going to like it if you slow down the experience for their viewers.

50:08 No, absolutely not.

50:09 So, yeah, that's definitely a cool example.

50:11 So there was a bunch of great examples and I learned a lot from looking at how you implemented them and the trade-offs.

50:17 And it's a great book.

50:19 I definitely recommend if people are, they know a little bit of Mongo and they're like, I think I should be using this, but I don't really know how to solve this problem.

50:25 There's a lot of good stuff to study there around schema design and whatnot.

50:28 Well, thanks.

50:28 Yeah, you bet.

50:29 So there's a couple of options on where you might run your MongoDB server.

50:34 And I guess it depends on how complicated of a situation you have.

50:39 on how much you want to think about this or need to think about this.

50:42 If you're just running a single server and it's just like there on a machine, maybe you can run that on a VM.

50:47 You still got to deal with backups and whatnot.

50:49 But there's also like hosted Mongo.

50:52 They have MongoDB Atlas.

50:56 What are your thoughts on like if somebody comes to you and says, hey, I want to do the site and run maybe let's say a three node replicated cluster?

51:06 Like what would you consider?

51:07 I would, by default, I would hope that their budget would afford them to get Atlas.

51:13 So Atlas is actually the cloud service by MongoDB.

51:15 They'll host your Mongo for you.

51:17 They'll host the latest copy or the latest version, handle your backups and everything.

51:23 Now, if you're dealing with a large amount of data, the backups can start to get pretty pricey.

51:27 So that might not be an option.

51:29 But unless you have strong operations people on your team, I wouldn't immediately jump to saying, oh, I need to self-host.

51:37 I need to build it.

51:38 I need to run it on my own VMs.

51:40 So there's other options that you can go to.

51:42 You can go to ImLab is one that I've used in the past.

51:46 I've really enjoyed working with them.

51:47 They provide, you know, hosted MongoDB, Compose.io, Object Rocket.

51:52 These are all hosted MongoDB options that you can go with.

51:56 And then if you are going to decide to self-host, there's actually some MongoDB provided tools to do that.

52:03 So if you actually go into the MongoDB Cloud Manager, provide them your EC2 account keys, for instance, and you say, I want to use these three servers or these three virtual machines that I've provisioned to make a three node replica set, then they can do that for you as well.

52:19 So that would probably be, you know, the next step is get your own VM or get your own VMs and then install Cloud Manager and go ahead and have Cloud Manager install that.

52:29 Okay, cool.

52:30 And the Cloud Manager, that's from MongoDB themselves?

52:32 Yeah, that's also from MongoDB.

52:33 So all of these things kind of run in the same UI on MongoDB.

52:37 I guess it's .com.

52:39 I know they have .com and .org both.

52:41 Yep.

52:42 That used to be a big confusion.

52:43 You couldn't find the download link on .com.

52:45 Yeah.

52:45 One thing I'd like to say is I have used MongoLab before, MLab.

52:52 It used to be MongoLab.

52:53 They renamed it.

52:54 And I think they're one of the few options that has a free Mongo server.

52:59 So if you want to just set up a little prototype and get started and play around, they have like a half a gig free server you can set up and use there.

53:08 And so that's pretty sweet.

53:10 They're great.

53:10 I use them.

53:12 I still use them today.

53:13 I use Atlas a little bit, but I use MLab as well.

53:16 One of the nice things about MLab is that there's an integration to Heroku as well.

53:20 So if you're using Heroku, you can get the MLab plan for free, and then it's just kind of like I'm not running a server anywhere.

53:26 Somebody else is doing it for me, and I can play around with things and have them work and with authentication enabled as well.

53:32 Yeah.

53:32 Yeah, those all come set up correctly, let's say.

53:35 Yes.

53:36 Perfect.

53:37 All right.

53:38 Awesome.

53:38 So, yeah, just right now I'm running my own MongoDB server on my own VM, but I've been working with Mongo for six years.

53:45 So I feel like that's probably a point at where I can go run my own VM and do my own backups daily, things like that.

53:51 But, yeah, these are all good options, and I know that jumping on one of the hosted ones is pretty nice to get started.

53:56 So let's talk about some other stuff that you've been up to.

54:00 First of all, like all of this MongoDB work, you now just came out with a MongoDB course for Python developers, right?

54:06 Yeah, so I'm working with Pact Publishing, and they wanted to put out some courses on MongoDB.

54:12 And I just came out with a video course called Developing with MongoDB and kind of a three-hour course that gives you an intro, both of what is MongoDB, how does it work, how is it different from relational databases,

54:25 takes you through using it with Python, takes you through some schema design.

54:30 It doesn't get into some of the big data analytics, you know, using it with Hadoop or some of the other things, but it does give you a good foundation in MongoDB.

54:40 And, you know, I was happy to say that that was just published yesterday, which would be the 25th of April.

54:45 We're recording on the 26th, so happy to see that out there.

54:48 Yeah.

54:49 How's that for timing?

54:50 Perfect, huh?

54:51 Yeah.

54:51 Nice.

54:52 That's cool.

54:53 That must have been fun to make.

54:55 And you also, speaking of ODMs, you wrote a book with the R instead of a D in there as well, the ORM, right?

55:00 I did.

55:01 This is prior to my involvement with MongoDB, and the name of the book is Essential SQLAlchemy.

55:06 It's also an O'Reilly title.

55:07 So SQLAlchemy, if you are using Python and you are using an SQL database and you are not using SQLAlchemy, then you're missing out, I would say.

55:17 And you're probably a Django developer because they have a really nice ORM themselves, and it has a lot of other features, if you're using Django, that are nice.

55:25 But SQLAlchemy is one of the best libraries, object-relational mappers.

55:30 I mean, it is the best I've ever seen.

55:33 Yeah, it's really, really good.

55:34 I've used it a lot, and it's been perfect.

55:36 Yeah.

55:37 A lot of the time when you get something like an object-relational mapper, then you give up a lot of the goodness of, like, a lot of the strengths of SQL.

55:46 And I think that Mike Baer, who is the author of SQLAlchemy, really did a good job of giving you the abstractions of an ORM while still allowing you to get the performance of raw SQL.

55:57 So I was really happy with that, and a second edition of that came out in the last year.

56:01 I didn't have a lot to do with the second edition, but because I wrote the first edition, I get to have my name on the cover.

56:06 Nice.

56:07 Perfect.

56:09 Yeah, and I actually had Mike Baer on one of the first episodes, episode five, so dug into that.

56:14 Yeah, I like SQLAlchemy a lot.

56:16 He is a smart dude.

56:17 Indeed.

56:18 All right, Rick.

56:20 So we're about out of time.

56:21 I don't want to take all your day up, but so let me ask you two quick questions before I let you out of here, and then one more thing after that.

56:30 So if you're going to write some Python code, what editor do you open up?

56:34 I open up Sublime Text 3.

56:35 Sublime Text.

56:36 All right.

56:37 Definitely a solid one.

56:38 Do you have extra plugins, or do you use the Anaconda IDE thing that plugs in there?

56:43 Not the Continuum thing, but something else.

56:45 No, I pretty much use almost the default install.

56:49 I mean, package control is in there.

56:51 Occasionally do some React programming to mention a different programming language, but get the JSX plugin and things like that.

56:58 But it's Sublime, pretty vanilla for me.

57:00 Nice.

57:01 There's a ton, over 100,000 packages on PyPI.

57:04 Is there one that's kind of notable you think maybe people haven't tried or heard of that you want to recommend?

57:08 Well, other than things like PyMongo and SQLAlchemy that we've already mentioned, one of the ones that it just comes up over and over, and people may have already, a lot of people have heard of it, is Requests.

57:17 It's the most un-Google-able package name.

57:22 But if you're going to do any web programming in Python as a client, you need the Requests library.

57:28 Yeah, absolutely.

57:29 So I think it would be un-Google-able if it weren't so popular.

57:34 Yeah, true.

57:35 So Python Requests is your best bet.

57:37 Yeah.

57:38 Exactly, exactly.

57:39 All right.

57:40 Well, that's about all the time we have to talk about Mongo for today.

57:43 Any final call to actions?

57:44 People are excited about this stuff.

57:46 How do they learn more, do more?

57:48 So MongoDB.org can teach you a lot about MongoDB.

57:52 You know, obviously the course, which will be in the show notes, but there's also MongoDB World coming up this summer in Chicago.

57:58 So that might be a good place if you're really interested in this database.

58:02 It's probably the cheapest education that you can get.

58:04 And it's, you know, two days of talks and tutorials before that.

58:08 So I guess those are my calls to action.

58:11 Yeah, cool.

58:11 MongoDB World, that's like the PyCon of MongoDB.

58:14 Yes.

58:15 That's the big one to go to.

58:17 It's in Chicago, and that's cool.

58:19 It used to be in New York City every time.

58:21 Yeah, this is the first time that they've kind of ventured out of Manhattan.

58:24 So it'll be interesting to see what goes on there.

58:26 Yeah, indeed.

58:27 All right.

58:28 Well, Rick, thank you so much for being on the show.

58:29 It's been great to chat about Mongo.

58:31 All right.

58:32 Well, thank you.

58:34 This has been another episode of Talk Python To Me.

58:37 Today's guest has been Rick Copeland.

58:40 And this episode has been sponsored by Advanced Digital and Hired.

58:43 Advanced Digital would love to work with you to build and extend one of the most visited websites in the U.S. in Python.

58:51 Reach out to them at python.advance.net to see if there's a fit.

58:57 Hired wants to help you find your next big thing.

58:59 Visit talkpython.fm/hired to get five or more offers with salary and equity presented right up front and a special listener signing bonus of $600.

59:08 Are you or your colleagues trying to learn Python?

59:10 Well, be sure to visit training.talkpython.fm.

59:13 We now have year-long course bundles and a couple of new classes released just this week.

59:19 Have a look around.

59:20 I'm sure you'll find a class you'll enjoy.

59:22 Be sure to subscribe to the show.

59:24 Open your favorite podcatcher and search for Python.

59:26 We should be right at the top.

59:28 You can also find the iTunes feed at /itunes, Google Play feed at /play, and direct RSS feed at /rss on talkpython.fm.

59:37 Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.

59:42 Corey just recently started selling his tracks on iTunes, so I recommend you check it out at talkpython.fm/music.

59:49 You can browse his tracks he has for sale on iTunes and listen to the full-length version of the theme song.

59:54 This is your host, Michael Kennedy.

59:56 Thanks so much for listening.

59:57 I really appreciate it.

59:59 Smix, let's get out of here.

01:00:01 Stating with my voice, there's no norm that I can feel within.

01:00:05 Haven't been sleeping.

01:00:07 I've been using lots of RSS.

01:00:08 I'll pass the mic back to who rocked his best.

01:00:11 First, developers, developers, developers, developers.

01:00:14 First, developers, developers, developers.

01:00:17 First, developers, developers.

01:00:20 First, developers, developers.

01:00:22 First, developers.

01:00:23 you