Monitor performance issues & errors in your code

#105: A Pythonic Database Tour Transcript

Recorded on Thursday, Mar 16, 2017.

00:00 Michael Kennedy: There are many reasons it's a great time to be a developer. One of them is because there are so many choices around data access and databases. So, this week, we take a tour with our guest, Jim Fulton (@j1mfulton), of some of the databases you may not have heard of or haven't given a try yet. You'll hear about the pure python database, ZODB. There's Zero DB and end-to-end encrypted database, in which the database knows nothing about the data it's even storing. And Newt DB, spanning the world of ZODB and JSON-friendly Postgres. This is Talk Python to Me. Episode 105, recorded Thursday, March 16, 2017. Welcome to Talk Python to Me. A weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via @talkpython. This episode is brought to you interruption free by GetStream. That's right, GetStream, a new sponsor of the show, has a really cool offer for you guys. If you're building an application that has some form of activity stream like you might see in Slack or Facebook or Instagram, and others, then you owe it to yourself to have a look at GetStream. They provide scalable, reliable, and personalizable hosted API feeds as a service. The feed is the most intensive component of these types of applications. Yet there's no need for you to reinvent the underlying feed technology when GetStream has the infrastructure and a Python API already in place. Go from zero to scalable feed in hours, not weeks or months. They even use advanced machine learning to serve up personalized results to each and every user. Stream powers the feeds for over 500 companies, including maker space and fabric, with a total of 70 million end users. Try the API yourself in a short five minute interactive tutorial at talkpython.fm/stream. Be sure to create an account and try it for yourself and help support the show. Jim, welcome to Talk Python.

02:25 Jim Fulton: Thank you, it's nice to be here.

02:26 Michael Kennedy: It's great to have you here. We have a whole bunch of really cool topics, generally around data, but not all data, right? So we're gonna talk about ZODB, something called Zero DB, which is something I'd never heard of and really interesting, actually. Newt DB, and then a little bit more process. So it's some adroll concepts and continuous integration, and so on. But, of course, before we get to all those, let's start at the beginning. What's your story? How did you get into programming?

02:53 Jim Fulton: I was exposed to programming fairly young. Although back then, it wasn't very common or very accessible. I'd say I really got hooked in grad school when I was doing research on rainfall runoff model calibration. And I had to hack some alternate statistical techniques, calibration techniques into a rainfall runoff model. And I found that I enjoyed that quite a bit. That became for years I was a civil engineer/hydrologist, and the software aspect of it kept pulling me and pulling me until it finally extracted me.

03:29 Michael Kennedy: I think that's really interesting. A lot of people get into programming that way. Somewhat grudgingly, like, okay, I have to learn this programming thing to make whatever it is I'm doing, like, actually work, right? But I sort of went down that path myself to some degree. And after a few years, I realized, actually, what am I doing this other stuff for? This programming stuff is really great. I'm just go do more of that.

03:50 Jim Fulton: Yup.

03:50 Michael Kennedy: It's funny how life is sort of serendipitous like that, but it's also good, right? So was that original bit of work, was that in Python, or was that in something else?

03:59 Jim Fulton: Oh, no, that was in FORTRAN.

04:01 Michael Kennedy: Oh, yeah, FORTRAN.

04:04 Jim Fulton: I mean, I went through a lot of languages over my career. That work was in 1981.

04:12 Michael Kennedy: Okay. So probably not Python.

04:14 Jim Fulton: Yeah, definitely not Python.

04:16 Michael Kennedy: Given it was 10 years before it released.

04:18 Jim Fulton: But, yeah, I've used a lot of different languages. I used FORTRAN for a long time. I used PL/1 for a little while. I used Ada for a little while. I really like LL languages. I couldn't afford Small Talk for a long time, so I used a language called Actor for a while. And then much later I did an interesting application With GNU Smalltalk, which was an adventure in and of itself because it was a fledgling and the garbage collect was broken, so I had to use a special branch with a non-broken garbage collector. So, anyway, I've had lots of fun with different languages over the years.

04:50 Michael Kennedy: Yeah, that sounds, sounds like you've really been through a lot of them. So, are you doing mostly Python these days?

04:56 Jim Fulton: Mm-hmm, yeah. Although I did a couple years ago at Zope Corporation, we did a bunch of Android development, and I got to use Scala, which I really enjoyed. I like to describe Scala as a beautiful evil language. Because it just invites so much of use, but it allows you to produce code within the JVM and it's just insane, mind-blowing notions of type-based development. People do interesting development tasks in the compiler.

05:27 Michael Kennedy: Wow, that sounds interesting and evil.

05:31 Jim Fulton: It was a lot of fun. I haven't done any of that. I did some Rust lately, which kind of reminded me of a light-weight version of that. This last year.

05:37 Michael Kennedy: Sure, sure, okay. Yeah, I've been wanting to learn Rust. But I haven't really gotten into it. I did look at Go recently this year. But, I don't know, I'm just not sold on Go. I still like Python a lot better. We'll see about that.

05:48 Jim Fulton: I'm actually very anti Go.

05:50 Michael Kennedy: Yeah.

05:51 Jim Fulton: I think it's bad on multiple levels. But I like Rust quite a bit.

05:54 Michael Kennedy: Okay, well, that's interesting. Maybe I'll be learning Rust eventually. But what do you do day to day these days? You're not still at the Zope Corporation doing Android development, right?

06:02 Jim Fulton: Nope, nope. At Zope, I did a ton of different things. But towards the end, we were doing some Android development among other things. But these days, I'm splitting my time between paid work to sort of keep the lights on, and open-source work. I got an opportunity to work with a company called Zero DB about a year ago. That and also my sons are grown and they've moved away, and so we were sort of downsizing. So that was an opportunity to sort of have enough money and reduce my run rate and focus on some open-source stuff for a while. And that's really what I'm doing right now. Is paying attention to some open-source projects that have been neglected for a while as well as exploring some new ideas.

06:44 Michael Kennedy: Oh, that's really, really great. And it must feel really good, it must just be great to just stop, look at these projects that are pretty mature, and say, okay, I'm gonna work on these things. And I don't have to go to meetings. I don't have to hit some silly deadline that's not realistic or work on some feature that I think adds no value, right, Just be able to focus on what you want, right?

07:06 Jim Fulton: Yup.

07:07 Michael Kennedy: Yeah. Excellent. So we'll be touching on some of these projects, I'm sure, so let's start with one of the older projects, I guess, that's been around since 1996, was ZODB. What is ZODB?

07:18 Jim Fulton: So, ZODB is an object-oriented database for Python. And when I say object-oriented, I contrast that with object-based because lots of people refer to databases that are object based that I don't really consider object oriented. The original goal of object-oriented databases, which were a pretty exciting thing back in the, I don't know, late 80s, maybe early 90s, was to try to reduce or eliminate the impedance mismatch between programming languages and databases. So, in databases, you know, you have a very different computational model than you do in a programming language, especially some of the, especially object-oriented programming languages.

07:58 Michael Kennedy: Yeah, you have hierarchies of object graphs in object-oriented languages. And you have highly normalized data that work to minimize duplication and let you approach the data from any angle you want. But there's always this, I pull it into Python and build it into an object graph, and then I tear it back apart into all the other tables and put it back again, right? So, these object databases, they try to just say, let's keep them in the same shape, something like that?

08:28 Jim Fulton: Well, again, I don't think there are many, I don't know, I'm not sure I know of any object-oriented databases today other than ZODB. I mean, I'm sure there are some. My sense is that a lot of the object-based languages let you get objection, but they don't necessarily avoid having to do queries and doing assembly. Like, for example, some databases we refer to as object based seem to be more like graph based, where you have the ability to query graphs, but it's not sort of, it's still somewhat of a foreign object.

08:58 Michael Kennedy: I see.

08:59 Jim Fulton: Like, when you use ZODB, there are some exceptions, but you have to subclass the special base class, and you have to identify transaction boundaries, which that latter aspect is usually automated, depending on your, you know, your situation. But beyond that, it's literally just as if you were working with objects in memory, you don't really query a database. You know, the way you query database is the way you query something in Python. You maybe look up a key in a mapping, or maybe you access an object's attribute. And accessing an object's attribute might cause data to be loaded from the database. But that's transparent to you.

09:34 Michael Kennedy: Okay, how interesting. So does it use interesting descriptors or something like that for attributes to do that?

09:41 Jim Fulton: That's where the base class comes in. So the base class, I can't remember, there must be, it's been so long since I've implemented it that I don't remember if there's a meta class lurking. I wouldn't be surprised if there was. But, basically, yeah, the base class does a couple of things. I've actually had a project that I've wanted to do for some time, which is to get rid of the base class. I have some hacks in mind involving weak reference data structures. But basically the base class, it watches attribute access, in a sense. So when you modify an attribute, it marks the object as dirty. And when you access an attribute, if the object is something we call a ghost, so in ZODB when you first load an object from the database, typically by referencing it from some other object, it's loaded as a ghost. And then when you actually access an attribute, which includes any method, then the ghost is activated. And its state is loaded into memory.

10:33 Michael Kennedy: I see.

10:34 Jim Fulton: There's an in memory object cache that is effectively an incomplete database replica. At transaction boundaries, any changes that have been made in the database by other clients are then cause any objects that were affected that are in your cache to be invalidated. And so, then the next time you access them, they're loaded automatically. So the data in memory is always consistent with the committed database as of some point in time.

10:59 Michael Kennedy: Okay, yeah, so you have transactions support in those sorts of things as well. That's pretty interesting. So I can, like, get an object from the database and pass it around and maybe, like, it was passed off to some other module. But eventually that reference will be updated because someone committed a transaction.

11:15 Jim Fulton: Right. The only sort of caveat there is that, so when you access a database, you open a connection. And on that connection is a root object. And then all other accesses you make are from that root object possibly through many steps. And then there's an object cache associated with that connection. And so that connection that's cached can only be accessed by one thread at a time. So you couldn't hand it off to a different process. And you couldn't hand it off to another thread and have both threads operating on it. But you could have multiple threads with their own database connections and they're essentially coordinating their activities via transaction commit. Very much in the way that software transactional memory either was going, I'm not sure this is current status, but I should have looked it up. Much in the way that software transactional memory is supposed to do that for PyPy.

12:02 Michael Kennedy: Interesting, okay, so, I don't think I've talked about software transactional memory previously. Maybe you could just give us the quick elevator pitch for what that is?

12:11 Jim Fulton: Well, it's like ZODB but not persistent.

12:15 Michael Kennedy: I see.

12:16 Jim Fulton: You know, there are lots of different sort of models for managing concurrency. And so some of the traditional models like locking are very expensive. And what a lot of systems have moved towards is something called the actor model, where you have different independent actors and message queues. And that's a model that works really well. But of course it's fairly invasive. You have to architect your model, your application around that. I think what the PyPy people were wanting to do is get rid of the gill and try to find some way to get rid of the gill without being crushed by all of the locking overhead of managing concurrency. So with transactions, you basically have multiple copies of the object space possibly with shared and copy underwrite, et cetera. I'm not really super familiar with either their implementation or their status. But the idea is that you have basically different copies of memory, and those copies get synchronized when you reach a transaction boundary. And that means that at the transaction boundary, that's when you sync everything up. And everything else is completely independent, so you don't need any locks. Because you've only got one logical thread of control accessing the data.

13:26 Michael Kennedy: Right, and so, that sounds really cool. Basically, it's a very optimistic view of the world, right? We're gonna have all this stuff in memory. We're gonna try to make a bunch of changes. And it's probably fine, but if it's not fine, then we're actually gonna have to retry that function or whatever that was working it, right? So instead of taking locks, it'll basically restart parts of your code. Which is really quite a different way of thinking about solving this problem, isn't it?

13:50 Jim Fulton: It's how most modern databases work now. I know Postgres uses multiversion concurrency control, which is basically the same idea. I think Oracle does as well. But, yeah, and so, you sort of have to come to terms with what we call conflict errors.

14:03 Michael Kennedy: Yeah, yeah, you have to, instead of being blocked, you have to deal with, here's how I resolve it once I'm doing it wrong. I mean, it's fine if it's all within one database, or within your memory, but if I've called two web services and written a file, and then say, no, no, no, roll back. Well, now what, right?

14:20 Jim Fulton: Yup.

14:21 Michael Kennedy: Already charged your credit card. Roll back, what are you talking about?

14:23 Jim Fulton: Yup.

14:24 Michael Kennedy: So it's just a, yeah, it's interesting. So would you call ZODB a no SQL database? I mean, was it no SQL before no SQL was a thing?

14:33 Jim Fulton: Well, one of the stories I like to tell about how I learned Python was, I was at USGS, and we were using this system called Rand RDB, which was based on an earlier system, but basically was based on managing relational data as flat files. And since I stopped using that project, it sort of evolved, and it called itself no SQL because it didn't use SQL. Really, no SQL is a terrible name. To me, in my mind, modern no SQL databases have nothing to do with not having SQL. In fact, some of them do have SQL. You know, really a better characterization of most no SQL databases that I'm familiar with is that they're no transaction. And they're no transaction because transactions, at some point, do limit scalability. Although there's continuing to be work to make databases like Oracle and Postgres scalable even with transactions. But the no SQL databases have much weaker notions of consistency and really are optimized to allow very fast writes. And the sort of problem domain that I think they're really well suited to is collecting massive amounts of data that you collect and analyze but never really have to update and aren't really part of even business processes.

15:50 Michael Kennedy: Some kind of analytics or something.

15:51 Jim Fulton: Right, and so, in that sense, ZODB is not a no SQL database. But it doesn't use SQL. Although with Newt, you can now start to leverage SQL and ZODB.

16:03 Michael Kennedy: Yeah, we'll talk about Newt as well. Yeah, I saw a really great quote by somebody who was trying to talk about no SQL. And said something like, my toaster doesn't use SQL. Is it no SQL? No. They have to have a better definition. And I really feel like my definition and your definition are probably quite similar from what you said. I feel like no SQL databases are the ones that give up some relational features in order to be more scalable, possibly more horizontally scalable, things like that, right? A lot of 'em give up joins. A lot of 'em give up transactions. But not all of them, right? They give up different things here and there for different things they're optimizing.

16:40 Jim Fulton: I'm not aware of many. There was Foundation DB which is no longer a thing that had transactions. But some databases will talk about atomicity. But their notion of atomicity is kind of laughable because, well, we update a single record atomically.

16:57 Michael Kennedy: Yes, exactly.

16:58 Jim Fulton: But that's not really, I'm sure they're some no SQL databases out there that are transactional, but if there are, they're probably not scalable the way that some of the other ones are. I mean, to me in my mind, transactions are the big trade-off. And I think it's a trade-off that most people don't really understand.

17:16 Michael Kennedy: Yeah, I think you're probably right if I think of them. The thing they give up first is probably transactions. The thing they give up second is probably joins. I mean, Mongo DB does have the isolated operator, which does let you work on multiple documents. But it's not quite the same as the just global isolation level serializable that you get in a lot of relational databases.

17:36 Jim Fulton: Well, and, in fact, the giving of joins is really closely related to that because giving up joins means that there are more problems for which only being able to do one operation at a time atomically makes sense.

17:48 Michael Kennedy: Yep, absolutely. Nice. Okay, so if I want to use ZODB, how do I get started? Can I just pip and install?

17:53 Jim Fulton: Yup.

17:54 Michael Kennedy: Okay, what's it written in? Is it written in Python?

17:56 Jim Fulton: Yup, it has some C extensions. It has some C extensions but it also works with PyPy. All of the C extensions have Python versions.

18:03 Michael Kennedy: Right.

18:04 Jim Fulton: So if you run it with PyPy, then it'll use the Python versions. And zobd.org has some pretty decent documentation. I was noticing yesterday some topics that I need to add. But getting started is pretty easy. You can run it with an in memory database if you want, just while you're playing around.

18:21 Michael Kennedy: Okay, yeah, that's really nice. And nice for testing as well, right?

18:24 Jim Fulton: Well, its testing story is especially strong. ZODB has what I call a pluggable storage architecture. So the, there's a defined API or set of APIs that storages can provide. And then there are a bunch of different storage implementations ranging from an in memory implementation to a file-based implementation to a client-server implementation to an implementation that sits on top of, well, there are a couple of client-server implementations, actually. And then there's an implementation that sits on top of a relational database. And then there are also, we sort of follow a pattern of layering those with adapters. And so, one of the interesting adapters for testing is something called a demo storage. And with a demo storage, you have demo storage wraps two storages, a base storages storage, and a changing storage.

19:13 Michael Kennedy: Okay.

19:15 Jim Fulton: And so, in testing, what you'll typically do is you'll have for a suite or a setup test, you might set up a base database. And then each test will use a demo storage on top of that. And then whatever changes are made are made in the changes. And then the demo storage is discarded. And then the next test creates a new one.

19:33 Michael Kennedy: Oh, that's a really cool feature because one of the super painful things of testing is, ugh, how do I load up the test data? How much is enough to be representative? Et cetera, et cetera. So you can put a snapshot on top of the data, in a sense, right?

19:46 Jim Fulton: Basically. And you can layer that as many layers deep as you want. In fact, we've had, I've written Selenium tests where basically they were sort of pushing pop operations on your database. So you make some changes, and then push another demo storage on top of that. And then for staging, what we've often done was to, one of the layers you can add is something called a before storage. And what it does is it wraps a writeable storage, like our client-server storage, but it says, okay, only show me the data as of this point in time. And then that becomes the base for demo storage. And then you have a file storage as your changes. And now you can stage a large production database, make substantial changes to it, but it's all in this sort of layered snapshot, which you can then discard after staging, and it doesn't affect and of the actual production data.

20:37 Michael Kennedy: Yeah, that's really cool. Okay, so that sounds like the storage system is really robust there. And, of course, that's gonna play into Zero DB when we get into it. But let me ask you two quick questions on ZODB before we move on. When is it a good idea to use ZODB? What's the ideal use case for this?

20:57 Jim Fulton: I think a really good time to use it is when you don't wanna spend a lot time writing software.

21:03 Michael Kennedy: Yeah, sure.

21:04 Jim Fulton: So, it makes writing software a lot easier in a lot of ways because, again, you don't have that database impedance mismatch.

21:11 Michael Kennedy: Does it store things in the basically pickled form or something to that affect?

21:16 Jim Fulton: Right.

21:16 Michael Kennedy: Okay. So you could just say, these are the things I want in the database, put them in the database. And they're in the database, right?

21:22 Jim Fulton: As long as they're pickleable. And we could or could not have a discussion about pickle. Pickle has a bit of a bad reputation that's a little bit fudtastic.

21:33 Michael Kennedy: Yeah, sure.

21:34 Jim Fulton: But, anyway, yeah, it basically uses pickle. So you can store anything that's pickleable.

21:39 Michael Kennedy: All right, nice, and you talked about the good testing story as well. When should we not use ZODB?

21:45 Jim Fulton: You shouldn't use ZODB, and I think this is changing, actually, but...

21:49 Michael Kennedy: And maybe it's being changed by things like Newt DB, right?

21:52 Jim Fulton: Well, it's being changed by, for example, I think Newt can help quite a bit, yeah. Because ZODB can sit on top of, say, Postgres or Oracle, it can scale, more or less, as far as they can scale.

22:07 Michael Kennedy: Right, okay.

22:08 Jim Fulton: Traditionally, ZODB has managed its own, provided its own search facilities on the client side. So when you do that aggressively, you end up with lots of extra objects in the database to support indexing. So there are ZODB-based implementations of B-trees. And then on top of that, various sorts of indexes, like converted indexes, and regular B-tree indexes, and things that are sort of like Postgres gin indexes a little bit. But that tends to bloat the database quite a bit and cause lots of extra writes and lots of opportunities for conflict so. I've, in the past, sort of said, well, don't necessarily use, if you're application is very search intensive, then maybe you don't wanna use ZODB. If your application is sort of object intensive and you're primarily working on application objects and traversing application objects, then it's a much better fit. But I think especially with Newt, by pushing the search back into the relational database, it can greatly reduce some of the challenges. And plus you just have a much more powerful search engine.

23:13 Michael Kennedy: Sure, so let's talk about Newt DB a little bit. Then we'll come back to Zero DB. So, Newt is kind of a marriage between ZODB, which we talked about a lot features there. And one of its shortcomings, I guess you could say, is that it's really hard to query. You talked about how if your app is search intensive, then maybe you don't wanna use it, because it's not really normalized. It's not flat text and integers and stuff in columns, but its object graphs as binary stuff, right? So doing that is challenging. But something like Postgres is really good at storing that data and querying it. So Newt DB, you called this the amphibious database, which I think is really interesting.

23:55 Jim Fulton: Right.

23:56 Michael Kennedy: What is it?

23:57 Jim Fulton: I'd like to argue with your previous assertion. But let's come back to that.

24:00 Michael Kennedy: Which one is that?

24:02 Jim Fulton: Well, so, in terms of searching, the issue isn't so much that the search capabilities of ZODB catalogs, which are sort of the common pattern for this, are not really that different from a lot of the no SQL database search mechanisms. In fact, a lot of the no SQL mechanisms, even something like SQLAlchemy, to a fault, I think, tries to express searches as data. And so, the catalog is often quite good at that. In fact, if your indexing data fits in memory, the searching in ZODB. I mean, I've seen it actually smoke Postgres.

24:39 Michael Kennedy: Wow, okay.

24:40 Jim Fulton: But for larger databases where it's not all in memory, then Postgres ends up being a win. But it's not so much that it's hard to search, other than that you can't use SQL. But I think most humans can't use SQL anyway.

24:53 Michael Kennedy: Right.

24:54 Jim Fulton: But, anyway, so the ease of search is debatable. But I think it's reasonable to expect that on average Postgres is gonna do a lot better. So the reason I called Newt the amphibious database was that it sort of gives you two views on your data. It gives you a very Python-centric, object-oriented view on your data via ZODB. One of the problems that traditional object-oriented databases have had in terms of what they've been criticized for is that they're kind of closed. They're limited to a single language. And they may even depend very heavily on the classes. I mean, that's the whole point is in ZODB when you're storing objects, they're objects that have specific classes. And traditionally in ZODB, if you wanted to access the data, if you had to have the class around. So it's a little bit more of a restricted environment. And so.

25:41 Michael Kennedy: Right, so if I want to call from my JavaScript, it's not gonna be fun.

25:44 Jim Fulton: Right, right. So, what the idea is in Newt, you've got your regular OOO Python view of your data. And then you also have a Postgres view of your data. And Postgres, you can see your data as JSON. You can access it from anything that can access JSON in Postgres. So you could conceivably write reporting applications that reported against it. You can index it, and you can search it using Postgres SQL.

26:09 Michael Kennedy: Okay, interesting, so basically it stores two copies of any given record. And it keeps them in sync? One pickled version and one JSON version. And then you leverage the JSON capabilities of Postgres to work with that thing from other languages. Okay.

26:23 Jim Fulton: There's also sort of lurking around there some interesting patterns about synchronizing your data. So, Newt has sort of two modes you can use it in. It has the sort of default mode where it writes the JSON data as it's committing the transaction. But there's also sort of asynchronous mode where you can run a separate updater process that watches the database and generates the JSON asynchronously. And one of the things I think that's interesting about that is that you can generalize that. So you could, for example, eventually instead of updating JSON in Postgres. You could update an Elasticsearch database. Or you could even conceivably, asynchronously update a relational representation of the data.

27:06 Michael Kennedy: Right, exactly. Right now, you're just taking it and turning it into JSON because it's such a close fit, but theoretically, you could have a SQLAlchemy type representation as well. Something to that effect, right? So what flexibility does the Postgres add besides just other clients, or other technologies? Is there better searching? Can I work with more data? What's the story?

27:32 Jim Fulton: Postgres has a large community behind it. So there are lots of people working on scaling Postgres in various ways. So, yes, I'm sure you can work with more data than you can with, say, the built-in client-server storage in ZODB. Although there is a project called Neo where they're doing some interesting things in terms of scaling database without Postgres. But also, Postgres has this interesting model. I don't know if Oracle does this. But in Postgres, when you create an index, you can index expressions rather than indexing columns.

28:09 Michael Kennedy: Okay.

28:09 Jim Fulton: And that gives you a lot of power. So, for example, if you're building a text index, you can have instead of saying I want to index this column, you can say, well, here's a Postgres function which could be written in Postgres, Postgres's stored procedure language. Or it could conceivably be written in Python. But here's a function that will extract the text from this data record. And this function that's extracting the text from the data record could actually make queries and get text from related data records. We're actually using that in a project. And then what happens is then you say, okay, now I want to build an index on this function. And what happens is that at index time, it goes through the data and calls that function, gets the result of that function, and builds the index based on that. And so, the function could be doing pretty interesting, possibly expensive things. And none of that has to happen at search time. It can all happen at index time.

29:09 Michael Kennedy: Right, okay, that's really interesting. So basically, you're inserts might get a little slower. And your updates might get a little slower. But it could be really worth it if it dramatically improves your query speed. Something that came to mind, I was thinking, is, well, if you have, say, an e-mail address, you could index just the domain part of the e-mail address. I want to find everybody in this company which has this google.com, or whatever in there, their e-mail address, right? Something like that?

29:36 Jim Fulton: Absolutely.

29:37 Michael Kennedy: Okay.

29:37 Jim Fulton: Absolutely, well, for example, I think most people index their, in Postgres, for example, when you have a text column, and it's not a free text, it's like a person's name or a city name or something like that, I think most people tend to index those incorrectly. Because they index it based on just by creating an index on the column. And, A, there's a certain way that you build those indexes so that they are usable in like queries. But, also, if you wanted to be able to search it case insensitivity, what you really need to do is you need to index by calling lower on it.

30:15 Michael Kennedy: Exactly, yeah. Yeah, I find that lower case, lower case, or case insensitivity, in a lot of databases, can be really challenging if you want to index the thing that has to be case insensitive, right? You've got to maybe even change your scheme a little bit, like store the original and a lower case version and put the index on the lower case version, or something funky like that, right?

30:36 Jim Fulton: But you don't have to do that. See, that's the beauty of this feature of Postgres. And I have to be careful to say this feature of Postgres because I don't know that it's not in other databases. But this pattern of indexing expressions is wildly powerful. And it's one of these things that people should zen up on because once you start thinking about it that way, then lots of doors open up.

31:00 Michael Kennedy: Yeah, it sounds really powerful to me. And I can certainly think of some places I would have used it had I had it available. But I don't.

31:09 Jim Fulton: Another interesting example is that in a lot of applications that I work with, the are data or hierarchical. Think of a content management system where the content is arranged hierarchically, possibly by organization, and there's often interesting security policies about what you can access based both on who you are and where you are in the tree. And so, you can, and the sort of most common case is to ask, can you view this document? And so, you can write a function that says, okay, for any particular piece of content, which principles can view this document? And you can write a function that returns an array of principles that indexes that document. And then create something called a gin index, which is basically an inverted index that allows you to say, okay, here's a set of principles, can any of them view this document? Where the set of principles may be the user and the groups that they're in.

32:03 Michael Kennedy: Yeah.

32:03 Jim Fulton: And you basically can say, okay, can this set of principles access this particular record, and that can be an index query, even though in order to make that decision, at some point you have to walk the tree to find all those security assertions.

32:19 Michael Kennedy: Yeah, yeah, you could have sort of inherited security stuff that flows down the tree and use your little function to build the index without actually putting it on every single level. Okay, that sounds awesome. All right, so Newt DB is definitely an interesting project. How does its ideal use case vary from, say, ZODB?

32:43 Jim Fulton: Well, it addresses two of the major objections to ZODB. I would say the major objections to ZODB would be it's transactional, which I believe limits scalability at some point, although, again, that limit is getting higher and higher all the time. But the other, and I actually think that's a limitation that most people should ignore. But then I'd say the two biggest objections are searchability and the overhead associated with trying to support that, and the complexity associated with trying to support that, and access from outside. So people with ZODB databases, there's a temptation to feel that their data is imprisoned. Especially if you're not very familiar with the technology. So Newt basically gives you, you know, sort of makes the data accessible without Python, without any special skills. It's just sort of sitting there in JSON. You can search it using a much more powerful search mechanism. Now, you still, you know, there's no free lunch. So, you can search it using clever tricks like indexing functions against the JSON, which you have to learn how to do that. And you have to understand how to use Postgres's explain. So you can see how the query optimizer is analyzing the query.

33:55 Michael Kennedy: Sure, that's a good thing to do, anyway, if you're working with data. Know how to ask, are you using an index? Which index are you using? And so on.

34:03 Jim Fulton: Right.

34:04 Michael Kennedy: But still, interesting. Okay, can you update the JSON and have those updates flow to the ZODB Python side? Or is it read only on the JSON, and read-write on the ZODB side?

34:16 Jim Fulton: The latter.

34:17 Michael Kennedy: Okay.

34:18 Jim Fulton: The JSON's a read-only representation.

34:19 Michael Kennedy: Gotcha. Okay. That seems pretty reasonable. All right, very, very nice. So, let's come back and talk about Zero DB. So the ZODB stuff that you'd been doing kind of led you to work with Zero DB. And they actually were the catalyst who were really cool move for you. But let's start with just what is Zero DB?

34:39 Jim Fulton: Well, so, Zero DB was about trying to have your data be encrypted at rest. So the only, only, so with Zero DB, the goal was that only the database client that the applications would be able to unencrypt the data, would be able to access the data because the encryption would happen on the client.

34:57 Michael Kennedy: Right, there's different levels of encrypted at rest. But you're talking about even encrypted in the memory of the database and the database itself can't get it, right? That's a different level than, I've set a file system where when I save the data finally to disk, that part is encrypted. There's more to it than just that, right?

35:15 Jim Fulton: Well, not much more to it than that. I mean, it was certainly encrypted in the memory of the database server. So the database server itself couldn't see the data. But by the time it reached the application, it was unencrypted in the application's memory.

35:28 Michael Kennedy: Sure. So, they sell this, they position this as a really great database for the cloud because your data might live in the cloud. But even if somebody were to get access to it and walk away with your virtual machine in some unknown way, or even just log into the database server, potentially your data is still safe, right? Okay, that's pretty unique. I don't really know of a lot of other databases that have that.

35:55 Jim Fulton: And the fact that, you know, one of ZODB's, a decision that I made a long time ago with ZODB was that the search, basically, all of the sort of application logic would happen on the client that the server was really dumb. That was partly a reflection of my ability to write a smarter server. But that actually fit Zero DB's use cases really well because by doing everything on the client, only the client needs to have unencrypted data.

36:26 Michael Kennedy: I see, so, basically the client, or the application, even if it's like a web app, it has some kind of private key that it can decrypt its data with. So how does the do queries and things like that.

36:37 Jim Fulton: In ZODB in doing a query the sort of the traditional way, you're accessing B-trees and higher level facilities built on B-trees that are regular database objects just like any other object. So they're encrypted. They're part of your database. Let's say that you want to look up something in a B-tree, what happens is you access the top of the B-tree, and that gets loaded from the server. And then you start locking the nodes of the B-tree to find the value you're looking from. And those nodes get loaded from the server as necessary. And then they're all cached locally.

37:13 Michael Kennedy: I see, then the execution of the actual aware clause, or whatever, happens on the client. And so, you said it was the Zero DB guys that made it possible for you to make this transition to sort of being independent, working on these open-source projects, and so on. Yeah, you want to tell us that sorry?

37:31 Jim Fulton: Well, I don't know that there's much to tell. They needed some scalability help. And also they didn't have a lot of deep knowledge of ZODB, so I could sort of provide a lot of help in terms of how they're architecting their application. They funded the sort of client-server part of ZODB was written a long, long time ago, and it used Acen core. It really needed to be modernized for performance, for maintainability, and also to facilitate adding SSL support.

38:06 Michael Kennedy: Sure.

38:06 Jim Fulton: And so, they funded that along with a bunch of other work.

38:09 Michael Kennedy: Nice, as I saw they released Zero DB on GitHub not too long ago, so that's pretty cool.

38:14 Jim Fulton: They've really sort of switched gears. In fact, I think they've renamed the company. So I don't think that the project of Zero DB on top of ZODB, I don't think it's actually active at this point.

38:24 Michael Kennedy: Okay.

38:25 Jim Fulton: Their customers were banks and that sort of financial people. And so, having a Python database wasn't really all that interesting to them.

38:34 Michael Kennedy: Sure.

38:35 Jim Fulton: And so, they've changed their focus towards dealing with big data. And I don't really know all the details. But basically it's the same sort of thing. Your data is encrypted at rest, but while you're processing it, then it's encrypted in the processing pipelines.

38:53 Michael Kennedy: Sure, okay, I see maybe they've changed the underlying storage engine. But the general idea is still probably more or less the same. There's three databases that are probably not super familiar to people, four if you count Postgres, but that one's more familiar to folks. I think it's really interesting look at all these different trade-offs and study the different databases. It gives you a sense for what the value of the trade-offs are, right?

39:19 Jim Fulton: Yep.

39:20 Michael Kennedy: Yeah, cool. All right, so let's switch gears just a little bit towards the process side of things and talk about two projects that you're working on. One, a tool for continuous integration like things. And one that's more about Kanban type stuff. So, first one I want to talk about is Buildout. So, this is an automation tool written in and extended with Python. So is this a continuous integration server, or is this more than that? What is Buildout?

39:49 Jim Fulton: It's something different than that, so.

39:52 Michael Kennedy: Okay.

39:53 Jim Fulton: It's really about, let's say you want to work on a Python project. So you check out the code and now you want to actually run stuff. And so, for a lot of people, what they do is there's a requirements text file sitting around. Maybe they created virtual env. And then they run pip against the requirements dot txt file. Or sadly, what many people will do is they'll just run pip from their machine's system Python and install a bunch of things in there. And then they'll have things in there. And then they'll run whatever scripts are generated. And if the scripts need configuration files, well, maybe they'll write them and they'll check them into version control. And if they need extra processes on top of that, it's sort of outside the realm of pip. And then the question is, well, what do you do to automate all of that? And so, Buildout, when we were working it on projects many years ago at Zope Corporation, we would, and this was actually before there was even disk utils, we were in a mode for a while of creating applications for customers. And then the customers would run them on their machines. And their environments were totally different. Their environments were typically completely uncontrolled. And usually bad things would happen. And so, we needed to automate that. And in those days, the automation typically involved building Python from source because most people's Python environments are in an unpredictable state.

41:18 Michael Kennedy: Okay. So you would get some well-known version and download it and compile it and say we're gonna start from here?

41:24 Jim Fulton: Well, not just, the biggest problem isn't the well-known version, although that certainly is part of it. But the contents and site packages.

41:30 Michael Kennedy: Right, okay.

41:31 Jim Fulton: Over the years, that evolves. And so, Buildout was very much geared towards installing exactly the packages you need and then generating the artifacts around that. So, for example, I have a project related to the Kanban where the JavaScript client is significant, and I need to assemble all those artifacts. And I maybe am old fashion, but it offends me to check them into version control.

41:54 Michael Kennedy: Yeah.

41:55 Jim Fulton: I have a Buildout configuration that among the things it does is it runs Grunt to, was it grunt? Or I forget what it runs, maybe grub. It runs some JavaScript tools to assemble all of the JavaScript requirements. And, of course, it uses Buildout's own mechanisms to assemble the Python requirements. It generates configuration files that something like paste script would need to use. It generates deamon configs. So, for example, when I run the process, I usually don't run if in the foreground. I mean, I may, but I may want to run it in the background. And so, there's a tool called Z daemon, which is kind of like a supervisor D, but a little bit more...

42:34 Michael Kennedy: Okay.

42:35 Jim Fulton: ...a bit simpler. And so, that has a configuration file. Or if you were using supervisor, you would want to have a supervisor configuration file. And those files may depend on things that are specific to your environment. They might depend on, you know, files that are outside of the environment that have paths in them. They're all sort of reasons why you may not be able to have static configurations that are just checked in.

42:59 Michael Kennedy: Right, okay. Buildout will look at the system, look at all the requirements and put it together in just the way needed for that location, huh?

43:07 Jim Fulton: That's one way of putting it. Basically, with Buildout, you give it a single configuration file that represents all of the parts of what you're trying to deploy, whether you're trying to deploy it to production or to CI or to staging or to production, and it basically says, okay, I've got all these parts that I need to build, and it just basically builds them. And it also keeps track of what it's built so that it can unbuild them. And if a parts specification changes, it knows to uninstall what it did before, and then reinstall it.

43:37 Michael Kennedy: Okay, that's really cool, is how much of this is a general software assembly tool and how much of this is for Python projects. Like, could I work on a C++ project only with Buildout?

43:48 Jim Fulton: You could, and there are people using Buildout in non-python environments. But the vast majority is Python.

43:54 Michael Kennedy: Right, 'cause it's written in Python, Python folks are automatically attracted disproportionately to it.

44:00 Jim Fulton: Right, and, of course, it has built in support for assembling Python applications in a particular way that's interesting.

44:06 Michael Kennedy: Right, okay.

44:07 Jim Fulton: There's a project called Slap OS, which seeks to be a light-weight virtualization environment that's built on top of Buildout and the things that they deploy in that environment, the vast majority of them are not Python.

44:21 Michael Kennedy: All right, yeah, that sounds really interesting. Cool. One of the comments you made on the page is that software deployment should be highly automated. And, really, you should be able to run one or two commands and just you're ready to go. And I feel like the more of that we can do, the better. The more frequently released smaller versions because it's not such a challenge for people to get the new version and all sorts of stuff. So I think that's a great philosophy there.

44:48 Jim Fulton: I think the sort of dev ops movement has kind of gotten stalled in too much of an ops rut. So I see way too little automation in a lot of things that I see. At Zope Corporation, we had things to the point where basically we had a representation of our system as a tree that we stored in ZooKeeper. Each service was, you know, anywhere from two or three to 10 lines of very high level specification. And then we had textual models of our entire system for multiple customers and multiple services and multiple applications and how they interconnected. And when we wanted to deploy a change, all we did was modify that tree and check it into Git.

45:34 Michael Kennedy: That's really cool.

45:35 Jim Fulton: A few minutes later, it would be deployed.

45:37 Michael Kennedy: Yeah, that's the way it should be, right? It's definitely the way it should be. Cool, okay, so let's talk about your final project called two-teired Kanban. So, I suspect most people know what Kanban is. But maybe you just give us the elevator pitch. And then we can talk about the two-tiered version.

45:55 Jim Fulton: Sure. The compelling thing about, well, there are sort of two compelling things about Kanban. And one is sort of philosophical, which is that it's very focused on providing value as quickly as possible. Whereas in contrast to something like Scrum that I think focuses on doing work.

45:55 Michael Kennedy: Sure.

46:11 Jim Fulton: So, the concept of providing value as quickly as possible. We sort of grew this culture at Zope Corporation both as part of trying to be better software developers as well as trying to follow some lean startup kinds of ideas. And part of that was related to the fact that we could develop software and check it into Git, but until it was actually in front of customers, it wasn't really providing any value. And then the other part of it is really sort of old-fashion common sense of finish what you start.

46:46 Michael Kennedy: Right, right.

46:47 Jim Fulton: Which Kanban has the highfalutin' term of work in progress, limiting work in progress. But that's just a fancy pants ways of saying, finish what you start before you start something else.

46:57 Michael Kennedy: Right, don't put more stuff on the board. Get to the end. And then put something on the board, right? This is kinda Trello boards if people haven't seen the Kanban boards. You've got the columns, you move the cards from left to right. Like from planned to assigned to in dev, and test, whatever. But you said, or the project says that typical Kanban boards focus on development. And products don't, just because they've had development done on them, don't provide value. They provide value when features land in customer's hands. Hopefully through a single button press to deploy them, right?

47:28 Jim Fulton: And, actually, when we started doing this, we were nowhere near a single button press. So being able to track things beyond development was actually pretty valuable. And often even with a single button press, there are things that you have to do. Like, for example, if your schema changes, you may have to migrate the schema. And you might have to do that before the software is deployed. And there are things. But I'd put it a slightly different way. So, a traditional Trello board or a traditional Kanban board, or even a Scrum board, you have all these trees sitting on the board, but you can't really see the forest. Scrum addresses that a little bit through sprint. So perhaps in a sprint, you're all focused on a single goal, which is good. But whereas the problem with Kanban is it's always been just sort of this sea of separate tasks. And it's hard to know how they relate. It's hard to know how they relate to value. This idea of two-tiered Kanban which, you know, I read about as I was learning about Kanban, but have to this date never really found an implementation of. Although I've heard rumors of implementations. The basic idea of a two-tiered Kanban is that you have a high level Kanban that represents units of value, typically, you know, features.

48:40 Michael Kennedy: Right.

48:41 Jim Fulton: Where a feature may require a number of development tasks. And ideally as soon as possible. But sometimes, for example, there might be a new feature that requires lots of UI components, and then lots more sort of below the water line.

48:54 Michael Kennedy: Right, like the designer work, the database work, the APIs to make it go.

48:59 Jim Fulton: Yup.

49:00 Michael Kennedy: The data, the backup, the management. There can be many things, right?

49:02 Jim Fulton: Right.

49:03 Michael Kennedy: Of course.

49:04 Jim Fulton: So, the idea is that you want to be able to represent the feature as a whole, the value as a whole, and really focus on moving that value to completion and getting the benefit of it. But you also need to be able to manage the things that make that up. And so, you have this high tier, which is the features, and then the low tier, which is, you know, once you've entered development, all the things you need to do to actually implement that feature. And so, typically what you have is you have a board where you have features that move across various columns. And then they hit the development column, and then they explode to the various pieces that make up that feature.

49:40 Michael Kennedy: Okay, and each one of the things that moves down the board that is a feature, that's basically its own Kanban board as well, right?

49:48 Jim Fulton: Essentially, yeah.

49:49 Michael Kennedy: Yeah, okay. That's the two-tier part. Yeah, it sounds really valuable to me. I always find these hierarchical things in Scrum or in Kanban really hard to deal with. Like, okay, well, this feature cost this much. But the thing I'm working on actually caused this other thing. And someone else has to work on the data part of it, and they need to estimate that. And, you know, it's challenging to represent those. So this seems like a nice way to organize it.

50:13 Jim Fulton: And it provides a little bit of automation around that. I mean, most Kanban people will sort of poopoo estimation, and I've been around enough people who needed estimates to know that you can't sort of completely punt on that. But I really am a fan of really low rent estimation, and then automation to track the low rent estimates. And basically keeping the process really simple. I've been exposed to some environments, to some Scrum environments where, and I think this is actually the norm is that people sort of go through a bunch of motions and there's a lot of ceremony. And a heck of a lot of time gets sucked up in ceremony.

50:53 Michael Kennedy: Right. Yup, I've seen it as well. Okay, cool, so we'll definitely include a link. The link goes to a GitHub project that looks like it is executable code. What do you actually get when you go to that GitHub repo?

51:05 Jim Fulton: Well, right now you get a substantial amount of bit rot.

51:10 Michael Kennedy: Okay.

51:11 Jim Fulton: But I want to get back to it. Some of the bit rot is because initially I punted on a fendication and used Persona, Mozilla Persona project, which actually worked really well, but it relied on Mozilla doing a bunch of work. And they finally got tired of doing that work, and so they no longer run that service. And so, I have to go back and...

51:34 Michael Kennedy: That's a challenge for your authentication and identity management there.

51:38 Jim Fulton: Yeah, so, I need to go back. And I want to add hooks to be able to use things like, I don't really wanna manage, I don't really particularly wanna manage user names and passwords, so I wanna be able to work with Google auth and various others, Facebook auth, et cetera, and let people choose that. Some other bit rot that I've sort of got is that it was written for ZODB Four and ZODB Five changed in ways, there's actually a discussion on the ZODB list right now about, I don't if you're familiar with RethinkDB.

52:10 Michael Kennedy: Yeah.

52:11 Jim Fulton: So there's this idea of having data pushed to you. And that's actually how ZODB works under the hood. But that's never really been exposed.

52:20 Michael Kennedy: Right, with the transactions and commits. And sort of refreshing the objects people in memory, right? Yeah.

52:26 Jim Fulton: Right, so when you use a number of the ZODB storages, when a transaction commits, then the IDs of all the objects that were modified are pushed to all of the other clients. And they're invalidated. So, there's already interesting information being pushed to clients. But that's never really been surfaced at the application level. And in ZODB Four, it was really easy with a small monkey patch to get at that. And the Kanban relied on that, but now on ZODB Five, that's no longer possible, so I'm in the process of adding that feature to ZODB Five. Adding it as an official feature.

53:01 Michael Kennedy: Yeah, that's the way to do it anyway, right? Officially.

53:04 Jim Fulton: Yeah, right. Well, the Kanban has been, the original version that we used at Zope Corporation was actually a client server thing on top of the Asana API. And so, the one that we used there was built on top of Asana, and Asana's API became really, really slow. They too got tired of providing an expensive service for free.

53:26 Michael Kennedy: Yeah, we'll run on this one $10 server over there.

53:28 Jim Fulton: Exactly. So since that, it's been kind of an R and D side project. And I'd like to really push it to completion and maybe even try to offer some sort of, offer it as a service 'cause I wouldn't care so much, especially at my last job, which, you know, the company was a great company, but they really struggle with process. And I think they would have liked to have used the Kanban, but it wasn't quite ready, and that was really frustrating for them and for me. So I'd like to soon take some time to actually, you know, get it much closer to completion.

54:04 Michael Kennedy: Yeah, it sounds like a great software as a service type thing, so. Hopefully you can do that. All right, very cool. Well, it looks like we should probably leave it there. We've covered a lot of ground on this episode. But we're pretty much out of time. So before we move on, let me ask you two final questions. We now have over 100,000 packages on PyPI, so hurray for that. And there's many that I'm sure you've come across that are noteworthy that are not necessarily the most popular, but would be really cool to find out about. So what one would you like to recommend people look into?

54:32 Jim Fulton: Well, it really depends, I mean, obviously, it depends on what you do. But Boto has delighted me over the years. Remember, I've had to touch AWS. I'm a big fan of Boto.

54:43 Michael Kennedy: Yeah, I use Boto as well.

54:44 Jim Fulton: I'm also a huge fan of Mock.

54:46 Michael Kennedy: Right, okay.

54:47 Jim Fulton: I think he did a really nice job of balancing dynamism and functionality. I could go on and on, but those are a couple that sort of come to mind. Then of course ZODB.

54:58 Michael Kennedy: Of course. Yeah, and Newt DB as well, right? Very nice, very nice. Okay, cool. So thanks for that. And, finally, when you write some Python code, what editor do you open up?

55:08 Jim Fulton: Emacs, of course.

55:10 Michael Kennedy: Emacs, all right. Right on. That's definitely a popular one.

55:14 Jim Fulton: I'm giving a webinar next week on PyCharm, and I have to say, I'm actually pretty impressed with PyCharm. I've liked the, as a straight text editor, I still like Emacs a lot. But they really assemble a nice package of things along with that, like, you know, database access, and rest clients, and it's an interesting pile of functionality.

55:37 Michael Kennedy: Yeah, absolutely. And when you give that one, maybe if they have it recorded by the time we release this, we can put the link to your webcast in there. That'd be cool. Okay, awesome. All right, well, that all sounds great. Any final call to action for the listeners? Anything you want them to check out or do?

55:53 Jim Fulton: Learn about transactions. And then check out Newt.

55:56 Michael Kennedy: I'll definitely have Newt DB and all the other ones in the show notes so people should be able to get right to them.

56:00 Jim Fulton: Cool.

56:01 Michael Kennedy: Yeah, Jim, thank you for being on the show. It's been great to learn about all these different projects with you.

56:05 Jim Fulton: Thank you for having me.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon