StaticFrame, like Pandas but safer

Episode #204, published Thu, Mar 21, 2019, recorded Thu, Feb 7, 2019

Episode Deep Dive Links Transcript

Remember back in math class when you would take a test? It wasn't enough to just write down the answer. What's the limit of this infinite summation? pi/2 Yes, but how did you get that number.

Some problems in programming are like this. We want to keep track of the computations done and only add more steps to the results. That's basically the entire premise of functional programming.

On this episode, you'll meet Christopher Ariza who created a project called StaticFrame. Think Pandas and NumPy, but it never changes computation it's already performed.

Episode Deep Dive

Guests introduction and background

Christopher Ariza started programming in Python around the year 2000 while working on algorithmic composition and computer music at NYU. His academic focus at the time included music theory, composition, and building text-based music synthesis tools. He later shifted into finance as a developer at Research Affiliates, where he implements large-scale, data-intensive strategies. His interest in safer, more reliable data workflows led him to create StaticFrame, an immutable data-frame library for Python.

What to Know If You're New to Python

Below are a few key points if you are newer to Python and want to follow along with the concepts covered in this episode:

Basic familiarity with NumPy arrays is helpful, as StaticFrame builds on top of NumPy.
Understanding Pandas or the idea of DataFrames (column-based data structures) will give you a quicker start.
Knowing how immutable objects differ from mutable ones in Python will help you grasp StaticFrame’s design choices.

Key points and takeaways

StaticFrame as an Immutable, Safer Alternative StaticFrame is designed as a Pandas-like library emphasizing immutability. Once you create or transform a data structure, it does not change in place, thus helping avoid unexpected side effects. This approach is particularly valuable for mission-critical work such as finance, where ensuring data integrity is paramount. Immutable data structures also simplify debugging by eliminating in-place mutations.
- Links and Tools:
  - StaticFrame GitHub
  - NumPy
Origins: Financial Data and a Grow-Only Paradigm Christopher and his team at Research Affiliates adopted a “grow-only” workflow. Rather than overwriting or mutating data, they preferred adding new columns with computed results. StaticFrame enforces that approach, improving clarity and traceability of computations. This is especially important in auditing scenarios and long-term financial strategies where you never want to lose your original data.
- Links and Tools
  
  :
  - Research Affiliates (Christopher's employer)
Unique Indices and Strict Data Integrity Unlike Pandas, StaticFrame requires indices to be unique. If you attempt to set an index with repeated labels, you’ll get an error rather than silent acceptance or surprising duplication. This design aligns with dictionary-like expectations for keys and prevents common bugs where non-unique labels lead to confusion or accidental multi-indexing.
Eliminating Dot-Access Ambiguity In Pandas, you can retrieve columns both with bracket syntax (df['col']) and sometimes via a df.col attribute, which can create naming collisions. StaticFrame opts to have a single clear pattern, bracketed access for columns, so you don't accidentally overwrite methods or run into collisions. This choice also aids refactoring and code clarity.
Managing Iteration and Assignments StaticFrame addresses iteration and assignment in a unified, explicit way. You can iterate over columns or rows via dedicated interfaces without surprise outcomes. And for assigning values, you create new frames rather than modifying the original. For example, frame.assign[...] returns a brand-new StaticFrame object.
Sorting with Predictable Stability When sorting data, StaticFrame defaults to a stable merge sort (provided by NumPy). Stable sorting preserves the order of equal elements, which is beneficial for building well-defined, transparent transformations. Pandas, by default, uses a non-stable sort in many cases, leading to potential reordering that can surprise you if you're not paying attention to the sort parameters.
Performance Benchmarks vs. Pandas Despite the overhead of copying data to enforce immutability, StaticFrame still achieves comparable or occasionally better performance than Pandas in many operations. It leverages NumPy effectively, so the underlying vectorized calculations remain fast. While StaticFrame isn’t always faster, Christopher emphasized that the performance is “good enough” for production finance scenarios.
- Links and Tools:
  - Pandas
  - NumPy Benchmarks (general concept of NumPy performance)
Relying on NumPy for Core Calculations Rather than re-implementing statistical methods, StaticFrame delegates to NumPy’s built-in functions for tasks like standard deviation, mean, etc. This leads to consistent results matching NumPy defaults and avoids the subtle differences users might encounter moving between Pandas and NumPy (e.g., the “degrees of freedom” parameter in standard deviation).
Hypothesis for Property-Based Testing Christopher uses Hypothesis to test the wide variety of potential data states in StaticFrame. Rather than writing many one-off test cases, property-based testing generates numerous random inputs under defined constraints to ensure the library handles edge cases. This approach is especially helpful for a library that must accept potentially any type of column data.
- Links and Tools:
  - Hypothesis for Python
Practical Transition and Adoption StaticFrame can be introduced gradually. If you find yourself wrestling with accidental data mutations or complicated multi-index issues, it might be time to evaluate it. You can start by creating reference datasets in StaticFrame to ensure stability while continuing to use Pandas for data exploration, then expand usage as your confidence grows.

Interesting quotes and stories

"If you wanted to build a car, would you start by building every screw? ...Use Python." -- A professor advising Christopher when he initially wanted to write everything in C.

"We want to keep track of the computations done and only add more steps to the results. Heck, that's basically the entire premise of functional programming." -- Michael Kennedy, illustrating how immutability is a functional programming concept.

Key definitions and terms

Immutability: Once created, the object cannot be changed. In StaticFrame, data arrays and frames do not allow in-place modifications.
Grow-Only Workflow: A data workflow paradigm where new results are appended (often as columns) rather than overwriting existing data.
Stable Sort: A sorting algorithm that preserves the relative order of items that compare as equal.
Property-Based Testing: A method of testing that defines characteristics or “properties” of expected behavior, allowing random test-case generation rather than individually crafted tests.

Learning resources

Python for Absolute Beginners: Ideal if you’re just starting out with Python and want to confidently grasp the fundamentals.
Move from Excel to Python with Pandas: Great next step if you’re comfortable in Excel and want a deeper dive into Pandas and data frames, making it easier to compare with or transition to StaticFrame.

Overall takeaway

StaticFrame presents a compelling alternative to Pandas, especially if safety, consistency, and transparency of data transformations are your top priorities. By preventing in-place mutation, using unique indices, and delegating calculations to NumPy, StaticFrame offers predictable behavior that can simplify debugging and auditing. The project’s design prioritizes immutability, a stable “grow-only” mindset, and minimal surprises, making it an excellent fit for large-scale, high-stakes data workflows such as financial analysis.

Links from the show

Chris on Github: github.com/flexatone
StaticFrame: github.com
StaticFrame documentation: static-frame.readthedocs.io
Musical coding in Python: youtu.be
Music21: web.mit.edu/music21
Foundation of property-based testing: cs.tufts.edu
Episode #204 deep-dive: talkpython.fm/204
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #204 deep-dive: talkpython.fm/204

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Remember back in math class when you'd take a test?

00:02 It wasn't enough to just write down the answer.

00:04 What's the limit of that infinite summation?

00:07 Pi over 2.

00:08 Yes, but how did you get to that number?

00:11 Some problems in programming are just like this.

00:13 We want to keep track of the computations done and only add more steps to the results.

00:18 Heck, that's basically the entire premise of functional programming.

00:22 On this episode, you'll meet Christopher Ariza, who created a project called Static Frame.

00:27 Think of it like pandas in NumPy, but it never changes the computations it's already performed.

00:32 It just adds to them.

00:33 This is Talk Python To Me, episode 204, recorded February 7th, 2019.

00:39 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

00:58 This is your host, Michael Kennedy.

01:00 Follow me on Twitter, where I'm @mkennedy.

01:02 Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via at Talk Python.

01:09 Chris, welcome to Talk Python.

01:10 Hi, Michael. Glad to be on the show.

01:12 Yeah, it's great to have you on the show.

01:13 You have a really interesting library that you've been working on, and it's an interesting sort of data-safe take on the whole Pandas API,

01:22 which I think is going to be a lot of fun for all the data scientists and other people working with pandas out there.

01:28 Great. I look forward to talking about it.

01:29 Absolutely. But before we do, let's get started on your story.

01:33 How did you get into programming in Python?

01:34 Sure. I started programming in Python in the year 2000.

01:38 I was a graduate student at NYU, and I was doing a lot of work in computer music

01:42 and algorithmic composition specifically.

01:45 And I was looking for a way to extend my capacity, you know, with these very high-level synthesis languages that I was using.

01:55 So I decided to learn programming in more depth.

01:58 And I was a graduate student there, so I took a course in C programming.

02:02 And I had a graduate advisor who was supposed to oversee my work at the time.

02:07 And I had this great idea to build this system in C.

02:10 And so I sat with this advisor, and I was like, I really want to do this thing in C.

02:13 And he said to me, well, if you wanted to build a car, would you start by building every screw?

02:18 And I said, no.

02:20 And so he said, use Python.

02:22 And I was like, what's Python?

02:23 He said, it's this new great language.

02:25 Go try it out.

02:26 And so I walked over to Barnes & Noble in Astor Place in New York City, back when there were bookstores at Barnes & Noble.

02:34 Yeah, they used to have a great computer section, right?

02:37 You could go and browse and see what was interesting.

02:39 That was a way you learned about stuff, and not so much these days, right?

02:42 Yeah, exactly.

02:43 Yeah, so you went over and got this book, yeah.

02:45 I got Learning Python, actually.

02:46 There was a version of Learning Python out in the year 2000, and I picked it up and started Learning Python,

02:52 and started building a system I called Athena CL.

02:54 This was a tool for algorithmic composition, closely tied to a synthesis language called C-sound,

03:00 and did a bunch of work in that, culminating in my dissertation.

03:03 Wow, that's really cool.

03:04 So your dissertation is music and algorithmic composition, basically?

03:09 That's right.

03:09 Okay, so cool.

03:11 So when you started working on this library, and you were doing the programming around it,

03:14 was this to have the computer generate music, or try to use the computer to understand music?

03:21 What was the goal there?

03:22 I was studying, I was getting a PhD in music composition and theory, and so I was using these synthesis languages

03:29 that took text-based input of event data.

03:32 So C-sound is an ancient, well, it's not ancient, but it's a very old synthesis language.

03:37 In computer terms.

03:38 Yeah, well, it actually comes from an ancient lineage.

03:41 It comes from the very first synthesis languages that Max Matthews invented in Bell Labs in the 60s.

03:48 Wow.

03:48 Called Music 1 and Music 2 and up to Music 5.

03:52 Those happened at Bell Labs in the 60s and 70s.

03:54 And C-sound is the modern version of that.

03:57 And it takes basically a text input file defining your events and your parameters.

04:04 And it became quite clear that you could do really cool things if you could use code to generate these text input files.

04:10 And of course, it's quite straightforward to do in Python.

04:13 So I wasn't using Python to do synthesis.

04:16 I was using Python to generate control data that I would then feed into C-sound.

04:20 Okay.

04:20 How interesting.

04:21 Have you seen some of the programmatic generated music that people are doing with Python lately?

04:28 I'm not sure.

04:29 Is there something specific you're thinking of?

04:30 Yeah.

04:31 There's been a couple of presentations around...

04:33 Gosh, I wish I could remember the library.

04:35 But there's a couple of libraries that you can use to basically live in the REPL program out songs

04:44 and interactions.

04:44 And yeah, it's pretty wild.

04:47 So maybe I'll find a link and throw it into the show notes because I can't remember the name.

04:50 But there's been a few really good conference presentations that are basically live musical performances

04:54 done by programming Python.

04:56 I think I saw one at PyCon last year.

04:59 I believe that one worked on top of the SuperCollider language.

05:02 So SuperCollider is a synthesis language that's much more modern than C-sound

05:07 that has a synthesis server independent from the language.

05:11 And it's possible to use other languages to control the server.

05:14 So I haven't been following it too closely, but I suspect that some people are using Python

05:18 to control the SuperCollider server, which is a great idea.

05:22 Yeah.

05:22 It's definitely interesting to see.

05:24 I'll throw in the video for people because if you haven't seen it, it's pretty creative.

05:27 All right.

05:28 So today you're not doing that much music theory, right?

05:31 You're working in a different discipline.

05:32 Tell us about what you do day to day.

05:34 I worked in academia for a while at a few places and was continuing to do my work in algorithmic composition,

05:40 generative music, that sort of work.

05:41 But, you know, decided to look for something else and found a job at a firm called Research Affiliates.

05:46 We're a finance firm.

05:48 We define and build strategies for investment that we license to many parties around the world.

05:54 That's cool.

05:54 Is this like the so-called algorithmic trading type of stuff?

05:57 Essentially, our strategies are, many of our strategies are what we call passive investment vehicles,

06:03 which means that there is an algorithm, a specific procedure that's used to generate

06:07 the portfolio constituents and the weights.

06:10 Our strategies are fairly slow moving.

06:13 It's not high frequency trading.

06:14 It's not anything like that.

06:15 Yeah, you're not looking for sub-millisecond advantages.

06:18 You're looking for just applying algorithms to long-term investing.

06:23 That's right.

06:24 All right.

06:24 So if like Warren Buffett were a programmer, he might be doing stuff like that.

06:27 Yeah, that's right.

06:28 I guess.

06:29 It's funny.

06:30 All right.

06:30 Cool.

06:31 So that brings us sort of full circle back to this idea of pandas and your variation, your take on a slightly different library

06:41 that is pandas-like.

06:43 So pandas also comes from the whole finance space, right?

06:49 Like it's very popular in data science, but of course, you know, it originated out of finance,

06:55 which is maybe one part of data science, I guess, right?

06:57 Maybe.

06:58 I see data science as, and there's certainly discussions and presentations on this issue,

07:04 I see it as, in some ways, more speculative research into data.

07:09 As opposed to using these tools for the systematic application of an algorithm or a procedure.

07:14 Right.

07:14 So what you're thinking is more that like what you guys do day to day is more programming,

07:19 using these tools that maybe originated out of data science, but you're deploying production systems that are running and doing stuff,

07:27 not so much coming up with graphs and inferring stuff with Jupyter.

07:31 Yeah, exactly.

07:32 Right.

07:32 Okay.

07:33 Yeah.

07:33 But at my firm, we have finance researchers who that's what they do.

07:37 They comb over data and study research and, you know, try to make observations from the data.

07:42 You know, that is closer to data science.

07:44 But by the time strategies come to us, they are well-defined.

07:48 So we are, you know, implementing the production strategy and don't really have that sort of discovery exploration need.

07:55 Okay.

07:55 So let's maybe start from the beginning.

07:59 You told me the idea of building these strategies and you've created this library called Static Frame,

08:05 which is really interesting.

08:06 We're going to talk a lot about it.

08:07 But you started with Pandas, right?

08:09 You're like, let's use Pandas and other Python libraries to solve this problem

08:13 before you decided I'm going to replace Pandas with my own library, right?

08:16 Yeah, that's right.

08:17 Maybe talk about that journey.

08:18 Sure, sure.

08:18 Well, actually, when I started at Research Affiliates in the year 2012, Pandas was still quite young at that point.

08:27 And my predecessor had created his own library to model data transformations

08:34 and basically storing data in a table and then efficiently being able to add new data to that table by column addition,

08:42 kind of an Excel-like data model, but implemented in Python.

08:46 And his implementation was very straightforward.

08:48 It was simply a dictionary that held rows where the rows themselves were a dictionary.

08:54 So kind of like a JSON representation of a table, if you will, a dictionary for the rows and then a dictionary for each row.

09:02 And there, of course, we weren't using NumPy as the back end, but it was actually reasonably efficient.

09:08 And we still use it in some places.

09:11 After about in 2013, we started to, I spent some time looking at Pandas and started to use it because, of course,

09:18 the underlying performance in large part due to NumPy and being able to use the vector operations of NumPy

09:25 gave a significant advantage over using our own table model.

09:29 Yeah.

09:29 There's such an advantage to using things like NumPy where you hand a little data off to the C layer and that layer can do all the computation.

09:37 There was a really interesting analogy or observation made by Alessandro Molina from the last couple episodes ago.

09:44 And he was talking about Python is one of these languages that is a little bit counterintuitive.

09:50 Like in C, if you want something to go really fast, you might make it go really fast by writing and implementing the details in C or some other language.

09:57 And so the more you can kind of control that, the more precise you can be.

10:01 Where Python gets faster, the more high level you try to treat it.

10:05 So if you tried to implement those algorithms in pure Python, they'd be slow.

10:08 But if you just call like a high level NumPy function, boom, it's fast, right?

10:13 And so it's like this sort of inverted understanding of like where the performance is in this language compared to others.

10:20 Yeah.

10:20 And, you know, I have to admit, I had looked at NumPy before.

10:24 I'd been using Python for a long time.

10:26 But the context of using it through pandas as a wrapper to NumPy really started me thinking, oh, when I want to scale a vector, I just multiply by the value.

10:38 And I scale the whole vector.

10:39 And this happens amazingly fast.

10:41 And there's no loops.

10:41 And you begin to take on that mindset that wherever you have a loop, you're doing something wrong.

10:46 You know, you want all your loops to be in the NumPy layer.

10:49 And it takes a bit of conceptual work to get there.

10:52 Yeah, that's such an interesting observation.

10:54 I definitely think that that's true.

10:56 And you want to let it do that for you.

10:58 But to think, oh, there's a loop.

11:00 Where are we missing the opportunity to make this work the way pandas and NumPy should?

11:05 Yeah, absolutely.

11:07 I mean, yeah, that's exactly in our team code reviews.

11:11 That's exactly one of the things that we sort of look out for.

11:14 We see, you know, I see some pandas code that somebody wrote.

11:16 And I see a couple loops or a loop in a loop.

11:19 And I'm like, oh, there's got to be a better way.

11:21 Quote Raymond.

11:22 Yeah, absolutely.

11:24 There's got to be a better way.

11:25 That's cool.

11:26 All right.

11:27 This transition over to NumPy and pandas, this was pretty successful, right?

11:30 Like you guys were able to replace that library and do more of your work in these libraries and these packages?

11:36 Yeah, that's right.

11:37 We didn't entirely replace it because the old table model worked reasonably well in a few cases.

11:41 But in implementing some new strategies, some new tools, I started working with pandas.

11:46 And it's funny because although you have a bunch of utilities on pandas, it's hard to figure out what to do with it or how to use it, really, I think.

11:54 But I already had a precedent.

11:55 The precedent I had from our old library was that you start with data in a table.

12:00 You load up initial data, initial observations about companies, for example.

12:05 And you may maybe have 40 columns on a table of 10,000 companies.

12:10 And that's your initial inputs.

12:12 And then you add new data by doing operations, applying functions on those rows or previous columns and add new columns.

12:20 And the previous library we used was actually very much aligned to that workflow.

12:25 So moving to pandas was actually quite smooth because pandas very easily supports growing a data frame by adding columns.

12:33 And those column additions can easily be performed by doing operations on columns that are already on the table or doing function application to rows already in the table.

12:43 That makes a lot of sense.

12:43 We discussed this a little bit before.

12:46 You talked then about how it was really important as your data flows down the pipeline that is doing all the calculations and eventually comes to a decision of invest in this, not in that, or this much in that area to keep a history and keep track of what's happening.

13:05 Right?

13:06 Yeah, exactly.

13:07 That paradigm was established before we ever moved to pandas.

13:10 And it was a large part of the approach and ethos of my company and how we do our work.

13:15 Our strategies are not black boxes.

13:17 We don't use esoteric machine learning to discover results.

13:22 We use very explicit approaches that we want to be transparent.

13:26 And we do human quality control on everything that we release.

13:31 So it's my obligation to expose as much of the internal calculations as possible.

13:38 A lot of the intermediates, values, groupings, labels, everything that is necessary for a human to understand.

13:45 The calculation we try to expose in our final output.

13:49 So the table becomes initial data first as 20 or 40 or more columns as your initial data.

13:55 And then numerous columns that are intermediate calculations, intermediate results, reducing the opportunity set through screens and some other processes.

14:04 And then finally getting to the actual result, which in our case is weights and constituents.

14:11 Yeah.

14:11 It sounds very inspired by what you almost might do in Excel or Google Sheets.

14:16 Right?

14:17 You have your data and then you create a formula here.

14:20 And then that female formula is based on that previous one.

14:23 But you would never go and like replace the original data and change it with a formula in like some weird iterative way.

14:29 It's always just kind of like to the right and down.

14:31 That would be disciplined use of Excel.

14:34 Unfortunately, there's no discipline inherently in Excel.

14:40 So you see all sorts of things.

14:41 Yeah, that's true.

14:43 That's true.

14:43 But, you know, I guess inspired at least by the proper use.

14:47 So that's pretty cool.

14:49 So you did all this in Pandas and that was working really well.

14:51 Why create a new library?

14:53 Like what were the pain points or what were you like, you know, if we redesign this to be Panda-like but not Pandas, what could you gain?

15:01 Right.

15:01 The initial inspiration was, you know, recognizing this workflow that we had where we would start with initial data and some columns and then add columns as we go.

15:11 That initial workflow, we found that it worked really good and it was relatively safe as long as you followed this sort of grow-only paradigm where the table only gets bigger.

15:24 Yeah, exactly.

15:25 Exactly.

15:26 Yeah.

15:26 So that's the discipline that we were doing implicitly.

15:30 But we started to speculate, well, you know, it would be really nice if we could actually enforce this grow-only paradigm and in doing so remove a lot of opportunities for error.

15:40 Now, by convention, we would never mid-process go and update in place a value we had already used.

15:48 But we were certainly sensitive to that danger and in particular for teaching our paradigm to new members of our team and the rest.

15:56 We had this strong desire, oh, it would be so great if we could sort of enforce in some way this grow-only paradigm.

16:03 And that led naturally to thinking, well, what if the frame data itself could be immutable?

16:09 More than just enforcing a grow-only paradigm, what if you had immutable frames?

16:13 There's places where we open up a – we use a table as a reference data set and we might bring that in as a data frame.

16:20 So I might bring in FX rates, currency conversion rates as a series indexed by currency code and the currency conversion value.

16:29 And that's a reference value that I'm using in many, many, many places.

16:33 And the opportunity for error if any of those values gets mutated is significant.

16:37 So we kept on coming back to this.

16:40 Like, it would be so great if we could freeze a series or freeze a table like we have frozen set, for example, and treat it as a immutable collection.

16:51 Yeah.

16:51 And, of course, it completely simplifies the whole debugging and validation, right?

16:57 Because you no longer have to look for these weird references where somebody still has a pointer to the data frame and they call a function that changes the values or, you know, does some other odd thing where they're off by a column index or something.

17:13 And, you know, it seems like debugging that would be really hard.

17:17 And, of course, making financial decisions on it might be really bad.

17:20 Yeah, that's right.

17:21 It reduces what I often say is it reduces the opportunity for error.

17:25 There's many ways that you can things can go wrong and you can get very confusing, unexpected results by mutating your inputs and your values as you go.

17:35 Yeah.

17:36 There's the safety side of things.

17:37 And that makes perfect sense.

17:39 That's probably primary.

17:40 Another thing that immutable data really opens up, I don't know if this matters at all to you guys, but anytime you have immutable data, you start to have incredible opportunities for parallelism, right?

17:53 Like if you're sharing it, you don't have to worry about, oh, I got to lock on this and make sure that's not changed.

17:57 You just riff on it because it's immutable.

18:00 It's not going to change.

18:00 Yeah, that's a really interesting potential that I haven't really explored.

18:04 The one way I have explored it with Static Frame is that our function application iterators expose an interface to multiprocessing or multithreading function application to columns and rows.

18:18 So I've experimented with a little bit, but there's definitely more opportunities to look out for that.

18:23 Yeah, it sounds like for sure.

18:25 And maybe you could even mix in some Cython in there so it releases the GIL for the threaded side of stories.

18:31 And it just seems like there's a lot of cool possibilities to dig into that.

18:34 Is performance at all something you care about?

18:36 Or is it like, it takes two minutes and we run it once a day or once a week, so it's fine?

18:40 Yeah, definitely.

18:41 Performance is a very significant concern.

18:43 And as I was doing this, as I started working on this in May of 2017, and I started very small with this speculation.

18:53 I wasn't sure if I could do this in native Python.

18:56 That was the thing is that we, for years prior, me and my team who shared these convictions and these goals, speculated on something like this.

19:06 And I always thought I was going to have to implement it in C.

19:08 It's time to build the screws and the nuts and everything.

19:10 Exactly, exactly.

19:11 Yeah.

19:12 So I was going to have to implement this in C.

19:14 And maybe I'd done some work in C++.

19:16 So, you know, I was like, maybe I can implement this in C++ and build off the STL vectors.

19:21 And then I realized, oh, man, I'm re-implementing NumPy.

19:25 I don't want to do that.

19:26 And it was after a PyCon, I think it was two years ago, yeah, because it was in May of 2017, something that the PyCon triggered for me that, you know, why don't I just try it in native Python?

19:36 And if I've hit bottlenecks, I can use Cython, but I should just see what I can do and use NumPy and just use Python.

19:43 And I set out on that goal, and I found that performance is very good.

19:48 I mean, I can get in for many operations I can do as well or better than Pandas.

19:52 Some operations I'm slower.

19:54 Some operations I'm better or significantly better.

19:56 The aggregate performance is very hard to measure.

19:59 It's very dependent on use cases.

20:00 There's some things that are definitely slower than Pandas.

20:03 But at this point, it's just pure Python, pure NumPy.

20:06 We haven't done anything in Cython or C extensions or Numba or anything like that.

20:11 This portion of Talk Python To Me is brought to you by Linode.

20:16 Are you looking for hosting that's fast, simple, and incredibly affordable?

20:20 Well, look past that bookstore and check out Linode at talkpython.fm/Linode.

20:25 That's L-I-N-O-D-E.

20:27 Planes start at just $5 a month for a dedicated server with a gig of RAM.

20:31 They have 10 data centers across the globe.

20:33 So no matter where you are or where your users are, there's a data center for you.

20:37 Whether you want to run a Python web app, host a private Git server, or just a file server,

20:42 you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24-7 friendly

20:48 support, even on holidays, and a seven-day money-back guarantee.

20:52 Need a little help with your infrastructure?

20:53 They even offer professional services to help you with architecture, migrations, and more.

20:58 Do you want a dedicated server for free for the next four months?

21:01 Just visit talkpython.fm/Linode.

21:05 So you're getting great performance out of it already, and then there's all these low-hanging

21:10 fruit opportunities if needed.

21:12 Yeah, that's right.

21:12 Yeah.

21:12 So it's interesting you talk about the performance and could I do it this way?

21:17 And I think people, programmers, are really bad at judging what's going to be fast and what's

21:22 going to be slow.

21:24 You look at some code, you're like, oh, this is definitely the problem.

21:27 Maybe it's slower, but it's sub-millisecond and who cares?

21:31 Or it's actually not even that part.

21:34 It's something totally different.

21:34 Did you do profiling and stuff like that to really try to dial that in, or did it just work

21:40 out?

21:40 Early on, I started benchmarking against pandas for certain operations.

21:43 And so I don't think of my...

21:46 It's actually a huge debt to pandas that they've provided this great framework that does so much

21:51 and really sets the foundation.

21:53 Of course, it's descended from R.

21:54 So pandas inherited a bunch of things from R in terms of the concept of the data frame.

21:59 And I think compared to what I know of the R model, they refined the interface and unified

22:04 it in quite a nice way.

22:05 And in doing so, they've really defined a set of expectations for using libraries like this.

22:10 One example is the drop NA method on a series or frame, like the idea that given a series

22:17 or frame, there should be an easy way to remove missing values.

22:20 We have to do this kind of thing all the time.

22:22 So with that model in mind, I could start to implement those things and test them.

22:26 And the performance metric that is relevant to me is my ratio to pandas.

22:30 So that's what I know.

22:31 Like I know for this operation, oh, I'm 0.6.

22:35 I'm faster than pandas.

22:36 Or for this operation, I'm 10 times slower than pandas.

22:38 And I do it at this very granular level for one-to-one comparisons.

22:43 That's a really interesting metric to think about.

22:45 But I guess it makes sense because you're like, I would like to have this other model, this

22:49 other data model, this other programming model that's data frame-like that has the safety

22:53 immutability thing.

22:54 It used to be pandas.

22:55 Long as I don't wreck the performance too much.

22:58 And if there's a benefit, then hooray.

23:01 Like, we're good.

23:02 Yep.

23:02 That's exactly right.

23:03 Yeah, that's a cool way to think about it.

23:05 So let's talk about how static frame deviates from pandas.

23:10 You know, so the overall idea is this immutable data grow to the right sort of story.

23:16 But there's a lot of details here.

23:18 Do you want to maybe talk us through them?

23:19 The biggest insight is, I mean, one of the biggest changes really is the underlying numpy

23:25 arrays are made immutable.

23:26 This was one of the key observations that led me to start, you know, developing this and

23:30 realize I didn't have to write this thing in C or C++ myself in that I found there's a

23:35 flag on the num.

23:36 Each numpy array has a flag attribute.

23:38 And on that flag attribute are a number of properties.

23:40 One of them is writable.

23:42 And it's a Boolean.

23:43 And you can flip it.

23:44 And in doing so, the numpy, if you try to assign values into the numpy array, it gives you

23:49 an exception.

23:50 And numpy arrays, of course, are already fixed in size and shape.

23:55 They are, numpy arrays out of the box are mutable in terms of the values contained within

23:59 that size and shape.

24:00 But when I found this, I was amazed.

24:02 I was like, oh my God, this is what I've been looking for.

24:04 So with that insight, I began writing the core piece of the library, which is the internal

24:10 component called the type blocks, which manages the heterogeneous typed arrays and exposes a

24:18 unified interface to external clients.

24:22 So that first piece really of making all the internal arrays immutable and what I describe

24:28 as fully managing the array.

24:30 That is, if you create a static frame object with a numpy array, if that array happens to

24:37 be immutable, I can take it and I can use it and I don't have to make a copy.

24:40 But if static frame frame is given a mutable array, I make a copy and I make that copy immutable.

24:47 And from there on, we're safe.

24:48 Yeah, that's really cool.

24:49 So obviously, if you're given immutable data, problem solved, right?

24:53 But if you're not, then you want to take ownership of that data, take it inside of your library

24:59 and say, yeah, you gave me this.

25:01 I've read it.

25:01 Now we have a safe version of it.

25:04 It's really cool that you were able to just leverage that built-in feature of numpy because

25:08 that meant that whole layer down there.

25:11 You could just build on what numpy is doing and not have to go, we're starting from scratch

25:16 with nuts and bolts, right?

25:17 Yeah, exactly.

25:18 And I'm still quite curious why it's there because it's not really advertised anywhere

25:24 in the numpy docs.

25:25 I don't see information as to suggestions of using this or whatnot.

25:30 I found little bits of discussion here and there where I've seen other evidence of people

25:33 using it.

25:34 It is certainly documented in the flags as part of the flags for an array, but I'm actually

25:39 eager to find out more information of how it got there and how the numpy developers imagined

25:44 it would be used.

25:45 Yeah, maybe if someone's listening, they know they could come and put a comment on

25:48 the show.

25:48 Yeah, that'd be great.

25:49 On the show page.

25:50 That'd be cool.

25:50 We'll all learn from it.

25:51 Yeah.

25:52 Yeah, great.

25:53 So how much having it based on numpy was it, I guess, able to more or less stay the same

26:01 as before?

26:02 Like, did it make moving from pandas a lot easier?

26:05 Yeah, that's right.

26:06 Because, of course, pandas, at least in its present state, all data is stored in numpy arrays.

26:12 So basic expectations about how that data would work are the same.

26:17 Our goal, though, with static frame was to be closer to numpy.

26:20 And what that means is that every time that we do a calculation, like produce a standard

26:26 deviation or a mean or something else, we use numpy operations.

26:31 I feel like pandas is a bit ambivalent about this, and they have probably reasons probably

26:36 for performance for doing this.

26:37 But sometimes if you call the STD method on a series, you're not actually executing numpy

26:45 or you're executing numpy in an unexpected way.

26:48 I have a lot of respect for numpy's stability and over the years, over their versions.

26:54 And I trust numpy in terms of their approaches to doing these calculations, their defaults,

26:58 et cetera.

26:59 And I don't want to make those decisions.

27:00 So I'm happy to rely on numpy entirely for those sorts of calculations.

27:05 And then the numpy type system is something also that pandas has sort of struggled against

27:09 or is ambivalent about or actually increasingly seem to want to get away from.

27:14 Rather than try to create or my own type system or augment numpy's types, I took the efficient

27:21 approach for the resources for the project, which is just, OK, we'll just use numpy types.

27:26 Which means one very clear way this shows up is that if you create a series out of two

27:32 character codes, like FX currency codes, three character currency codes, you will get a series

27:39 of fixed offset Unicode, three Unicode characters, which is what numpy does by default.

27:44 So whereas pandas will convert that into an object type.

27:47 So I just let numpy use its types pretty much as it would naturally do and avoid getting involved

27:54 in that.

27:54 Yeah.

27:55 Yeah.

27:55 And of course, you carry over all the validation and testing to make sure that all the calculations

28:00 were done as accurately as possible.

28:02 And that's quite a matter as well.

28:04 Yeah, that's right.

28:04 So yeah.

28:05 So some other things that look like differences for static frame relative to pandas is one

28:11 is around unique indices.

28:12 Yeah.

28:13 So this so there was, you know, there's a couple of things that we as a team would constantly

28:18 be frustrated with in terms of pandas.

28:21 And one really obvious one is this ambivalence about whether an index should be unique or not.

28:26 This is when I think of an index, and maybe most people think of an index, they think of

28:30 it as a mapping similar to a Python dictionary where keys have to be unique.

28:35 And very often, that's how people use indices.

28:37 But in pandas, indices don't have to be unique.

28:40 And we would constantly be surprised when we found that a column in a table was set as the

28:48 index.

28:48 And without us realizing it, those values in that column were not unique.

28:53 And we ended up with a non unique index.

28:55 And if you try to select row from a non unique index using an LLC call, where you expect to

29:02 get a series representing a row, now suddenly you get a data frame representing two rows.

29:07 And that's very confusing and surprising.

29:09 Pandas has an option to enforce uniqueness when you create an index from a column with a amusingly

29:16 named parameter called verify integrity.

29:18 And verify integrity is by default set to false on pandas set index operation, which I understand

29:26 the desire to be accommodating, which I think is the motive here.

29:30 But I do not want to be accommodating.

29:33 I want to say that an index is a unique collection.

29:36 And if you try to create an index that of a non unique collection, you'll get an error.

29:41 Right, you should get zero or one things, not zero one or other numbers.

29:46 Yeah, interesting.

29:47 Another one is around dot access.

29:49 So basically, Dunder get item mapping over to like pulling items out of the by by index.

29:55 Yeah, and I think this was motivated by the ancestry of R.

29:59 So in the R language, the not exactly certain, but I believe that our data frame library exposed

30:06 columns through dot attribute like lookup.

30:09 And I suspected that in early versions of pandas, there was a big pull to try to move our people

30:15 over to Python and having that similar syntax, I assume was was desirable.

30:20 But of course, there's other attributes other than columns on the data frame object.

30:24 And so inevitably, there's some sort of naming collision that's going to come up with getting

30:30 columns from dot attribute.

30:31 So with static frame, we simply say the only way to get a column is by using the get item

30:36 syntax.

30:36 And there's no dot access.

30:38 And that's sort of the general theme of trying to have there be only one and one way to do

30:43 things.

30:43 And in terms of getting a column, we say, okay, you use the get item syntax.

30:46 Yeah, that's, I think that makes sense.

30:49 You know, you want to be like one of the overriding themes, it sounds like a static frame is sort

30:53 of safety predictability.

30:55 Right?

30:56 Not like, oh, we asked for the load.

30:58 And that's not the load on the system column.

31:00 That's the load data method or some weird thing that happens, right?

31:04 When you interact with it that way.

31:05 Yeah, exactly.

31:06 And, you know, of course, it's a huge benefit to have to have pandas.

31:10 And, you know, there's actually a family of data frame like interfaces around these days.

31:15 Not only is there the pandas data frame, but I think some other library, I think of Xarray.

31:20 Xarray has kind of a pandas like thing.

31:23 And there's a few other libraries out there.

31:25 So it's just a huge benefit to be able to look at, you know, the hard work of all of these

31:30 contributors over the years, be in the luxurious position of picking and choosing.

31:34 And so, you know, it's a great debt we have to those other packages and pandas in particular

31:40 to be able to look at those libraries and see, okay, I see why they made all those choices.

31:44 But we can consolidate all that into one thing and remove a bunch of ambiguity, remove a bunch

31:49 of opportunities for error.

31:50 Yeah.

31:50 Does the growth of Python in the data science data exploration space and the popularity of

31:57 pandas make this an easier sell at your company?

31:59 Like, do you feel like you don't have to cheerlead and like make the case for Python so much?

32:04 When you go and talk to people, the management or whatever, say, yeah, we're building it this

32:07 way.

32:08 Yeah.

32:08 Well, in terms of Python in general, the growth and popularity of Python, as I'm sure all of

32:13 your listeners know, has been extraordinary in the last five, 10 years.

32:18 And the role of Python in data science is probably largely due to pandas.

32:25 And I would even go further to say specifically pandas read CSV, which is just extraordinarily

32:31 fast, blows NumPy, blows everybody else out of the water and is such an awesome thing that

32:37 I think it's the gateway into Python for data science.

32:40 Your question specifically about using Python within our firm, it's been a gradual move.

32:45 My team was the first to use Python, but more and more, nearly every other area of the firm

32:50 that's doing something with software engineering is using Python.

32:53 And of course, everybody starts with pandas because that's what you see.

32:57 Because it's a load CSV.

32:58 Yeah.

33:00 And the idea with Static Frame was like, well, that's great for data exploration.

33:04 But if you're going to build something that you want to last and you want to reduce opportunities

33:08 for error, take what you know from that library and try it out with this thing and see what

33:13 you can do.

33:13 That's cool.

33:13 You said it was basically becoming increasingly popular.

33:16 What technologies was it displacing?

33:19 Like what else were you using to the extent you can say?

33:21 Sure.

33:21 Yeah.

33:21 I mean, within our firm, we were using SAS.

33:24 We were using R.

33:25 And those were the primary two languages, which are still quite common in finance firms

33:29 and the like.

33:30 And to a certain extent, people still use those.

33:32 But you can see the effort in the Python community, both SciPy, Pandas, NumPy, Matplotlib, many

33:40 others moving in the last five, 10 years to provide all of that functionality that R had or almost

33:45 all of it and many other platforms.

33:47 So it's quite easy transition.

33:49 Well, not easy, but it's a directed transition from those other languages.

33:54 Yeah, it's definitely it's not like going from that to C or something crazy.

33:58 Yeah, for sure.

33:59 So another difference has to do around iterating static frame, right?

34:03 And Pandas, when you iterate, you get the values.

34:05 And here, it's more dictionary-like, right?

34:08 Oh, OK.

34:08 Yes.

34:09 There's two elements to sort of the iteration thing.

34:11 The first has to do with the static frame series.

34:15 So the frame and the series and both Pandas and static frame are dictionary-like containers.

34:21 Both static frame and Pandas define a keys method, define an items method that work in a way that

34:28 we know well from Python dictionaries.

34:31 With static frame, the difference, though, has to do with the series.

34:34 When you iterate a Pandas series, it iterates over the values, which makes some sense if

34:42 you think of it as a wrapper around a NumPy array.

34:44 But if you call dot items on a Pandas series, you're going to get pairs of the index, the key,

34:51 and the value.

34:52 So again, in this effort to try to be consistent, if you iterate a static frame series, you are

34:58 going to iterate over the keys, just like you would with a Python dictionary.

35:01 So you actually get the index values.

35:03 And if you want to get the values, you have to use the dot values attribute.

35:06 Right.

35:07 That's one difference in terms of iteration.

35:08 The other is, while that dictionary-like interface, we try to be really consistent there, the other

35:15 place is recognizing that Pandas has a number of different approaches to iterating over columns

35:22 or rows in a frame and function application on those iterations.

35:27 So Pandas has an apply function, and it has various iteration functions.

35:31 Like iter rows or iter tuples.

35:33 And I saw an opportunity to unify all of those.

35:36 So the series and the frame all have different families of iterators.

35:41 And all of those iterators return objects that themselves have function application methods

35:47 on them.

35:47 So the same tool you use for iterating exposes an opportunity to do function application.

35:52 And that descends from the old library that we use, where function application across the

35:57 table was a really common move.

35:59 And so making that sort of a first-class element in the library was really important to us.

36:04 Yeah.

36:05 It sounds great.

36:06 Also, you talked about the sorting.

36:08 The default sorting is stable in static frame.

36:12 What's the story there?

36:13 This is very simple because, fortunately, NumPy did all the work here.

36:16 NumPy's sort method provides a number of options for which sorting algorithm to use.

36:22 And again, in the spirit of safety and repeatability and stability, the default sort method for static

36:32 frame is set to Panda's merge sort, which is indeed stable.

36:36 The default for Panda's sort is you can switch it to be merge sort.

36:40 But by default, I forget exactly what it is.

36:42 But it is not a stable sort.

36:45 Like quicksort or something like that.

36:46 Yeah.

36:47 I believe the default is quicksort.

36:49 Now, why they chose quicksort, I don't know if there was any reasoning behind it.

36:52 Maybe quicksort is faster in certain cases.

36:54 But merge sort is reasonably fast.

36:56 And if I can make a choice to ensure that the sort is stable to the order entering the

37:02 sort, that seems like a benefit to me.

37:04 Yeah.

37:04 It comes back to this predictability, safety, overriding theme, right?

37:09 Exactly.

37:09 Yeah.

37:09 So I guess maybe another area is how it's tied to the NumPy defaults for calculations.

37:19 And things like that.

37:20 Yeah.

37:20 So that gets back to the spirit of, you know, being close to NumPy.

37:24 And I have an example of this where, you know, you take the standard deviation of three values

37:31 without any arguments in with a Panda's series.

37:35 And you get a different value if you do the same thing with a NumPy array.

37:40 If you use NumPy's STD function, you get a different value.

37:43 And that's very confusing.

37:45 And it has to do with the DDOF, the delta degrees of freedom argument to the standard deviation.

37:52 Now, people that have played with standard deviations are well aware of this parameter.

37:54 But some people may not be.

37:56 And that's quite confusing.

37:58 And I just, I don't see a need for that heterogeneity.

38:01 I'm fine to stick with NumPy.

38:03 Yeah, that makes a lot of sense.

38:08 This portion of Talk Python To Me is brought to you by StellarS, the AI-powered talent agent for top tech talent.

38:14 Hate your job or feeling just kind of meh about it?

38:17 StellarS will help you find a new job you'll actually be excited to go to.

38:22 StellarS knows that a job is much more than just how it sounds in a job description.

38:26 So they built their AI-powered talent agent to help you find the ideal job.

38:31 StellarS does all the work and screening for you, scouting out the best companies and roles

38:36 and introducing you to opportunities outside your network that you wouldn't have otherwise found.

38:40 Combining deep AI matching with human support, StellarS pairs things down to a maximum of five opportunities

38:46 that tightly match your goals, like compensation, work-life balance, working on products you're passionate about, and team chemistry.

38:53 They then facilitate warm intros.

38:56 And there's never any pressure, just opportunities to explore what's out there.

39:00 To get started and find a job that's just right for you, visit talkpython.fm/StellarS.

39:06 That's talkpython.fm/S-T-E-L-L-A-R-E-S.

39:11 Or just click the link in your show notes in your podcast player.

39:14 Let's see, another one is discrete functions rather than branching parameters.

39:20 So like trying to, is that like breaking stuff apart so there's functions that are simpler to understand rather than taking a bunch of parameters?

39:26 Yeah.

39:27 We've tried to systematically design an interface that has functions that have orthogonal parameters.

39:34 So I think when all of us write functions, that should be our goal.

39:39 That is, the relevance of one parameter to a function shouldn't depend on another parameter.

39:44 That's quite confusing and can lead to mistakes.

39:47 What you get instead is more functions.

39:49 But the functions are more specific.

39:51 And I believe that leads to more clear code.

39:55 And it also aids in refactoring, actually.

39:58 One example of that that I think is nice is the set index method.

40:02 So on pandas, there's a set index method that if you give it one column as the argument, it will set that one column as an index.

40:12 If you give that argument a list of column names, it will give you a hierarchical index.

40:19 And all you did was change your input.

40:22 And now you have a very different structure coming out of this.

40:25 Right.

40:25 Not even necessarily keyword arguments, but you just change the type that you're passing.

40:29 Right?

40:29 Yes, yes, yes.

40:30 So there's many places in pandas where there is this sensitive dependency to the type of an argument that results in a different output, which is very problematic.

40:40 So in static frame, we have two methods.

40:42 We have set index, and we have another method called set index hierarchy.

40:46 And when you set index hierarchy, there you're expected to give an iterable of column.

40:51 And you can't give it a single column and vice versa.

40:54 So we split the functionality into two different functions.

40:58 And now it's completely clear to the reader what was intended.

41:01 And if later on you need to do some refactoring and you need to find all of the places where you created a hierarchical index, well, you just search for the function name.

41:11 You don't have to search for the function and then probe the type of that argument to know whether or not you're getting a hierarchical index.

41:17 Yeah, that's a tremendous difference.

41:19 And, you know, you go to your fancy ID, you right click, you say find usages, it'll say there are six.

41:25 They are here.

41:26 Yeah, exactly.

41:28 Exactly.

41:28 That's way better than they're here, but only sometimes.

41:33 Like that's a little sketchy for sure.

41:35 That's right.

41:36 All right.

41:36 So it sounds like there's a lot of maybe familiarity if you're coming from Pandas, but there's enough difference that this is really something on its own and special.

41:46 And there's good reasons to use it.

41:48 One of the key things from Pandas is the, well, I mean, in Pandas, it took Pandas to figure this out too, is that there's three types of selection.

41:56 When we're selecting data, there is the root get item selection, which in Pandas overwhelmingly is used for column selection, but in some rare cases can be used for row selection.

42:06 That's something we changed, but I'll get back to that.

42:08 So there's the get item, there's the .loc selection, which can take one argument for row selection, two arguments for row and column selection, and the iLock selection, which uses integers instead of the labels of the index.

42:21 So that family of those three selectors really gives you everything you need.

42:26 Now, Pandas at various times had other types of selectors.

42:30 There's this IX method, and there's a few other variants, but they seem to be getting rid of those.

42:35 Recognizing that there's these three types of selection really is one of the fundamental things to bridge the gap for people coming from Pandas to static frame.

42:45 Those are relatively the same.

42:46 One of the key differences we made in line with consistency and having only one way to do things is the root get item selection interface is only a column selector.

42:56 It is never a row selector, which is a shortcut you can do in Pandas, but again, it's undesirable, is not clear for readability, and is difficult for refactoring.

43:06 Yeah.

43:06 Interesting.

43:08 Okay, cool.

43:09 There's three types of selections, the root get item, the lock, and the iLock, and then we expose them in sub-interfaces, if you will.

43:16 So a relevant question is, if I have an immutable data frame, how do I do assignment?

43:21 Well, you don't.

43:23 But Pandas and also NumPy have these really powerful ways of doing an assignment.

43:29 I can do an assignment with Pandas.

43:31 I can do an assignment in a lock call, in an LLC.

43:36 And I can assign to an entire column.

43:38 I can assign to an entire row.

43:40 I can assign to a mixture of columns and rows by using the same syntax I use for selection.

43:45 That's an awesome feature.

43:46 I wanted to maintain that same expressive interface, but you can't do in-place mutation.

43:54 So how do you do it?

43:55 Well, on static frame, there's a .assign attribute.

43:58 And that .assign attribute exposes a root get item, a lock, and an iLock.

44:05 So under that assign attribute, you can do all of the same type of assignment moves you

44:10 used to do, only you get back a new frame, and you're not mutating the old frame in place.

44:15 That's a great feature.

44:16 I love it.

44:16 So let's talk about testing for a little bit.

44:19 I saw that you have unit tests for performance tests, unit tests, things like that, which is

44:25 great.

44:25 One of the things that really stood out to me when I was looking at it was that you actually

44:31 were using hypothesis, which is an interesting library.

44:36 I covered, I had Austin Bingham on the show long ago talking about hypothesis.

44:41 It's probably been three years.

44:42 But you want to just tell us roughly, really high level, what that is and why you decided

44:47 to use it?

44:48 I saw at last, I think it was last year's PyCon presentation on, maybe it wasn't specifically

44:53 hypothesis, but it was related to that.

44:56 Property-based testing in general.

44:57 Yeah.

44:58 And I just was so impressed.

44:59 I was like, oh man, all that time I spend trying to find corner cases and trying to make my

45:05 unit tests have sufficient coverage can be automated for me by using a tool that you control and you

45:14 shape the random generation of values to meet the expectations of finding these extreme corner

45:20 cases.

45:21 I took that away and was like, wow, I really want to do more of that.

45:23 One of my colleagues here at Research Affiliates who does some work in Haskell set off on trying

45:28 to use this a little bit more in depth.

45:30 And this whole idea of property testing, in fact, comes out of Haskell.

45:34 I forget the name of the library that originated it, but the whole library was published as one

45:42 page on the paper that introduced the concept.

45:44 It's really amusing, but the implementation of the original sort of property-based testing tool

45:49 is just one page of Haskell code.

45:52 But through his example, my colleagues' examples and starting to look at it, I'm like, man, this is

45:57 exactly what I need for static frame because, you know, you're trying to build a general purpose

46:01 library.

46:01 There's no way I'm going to be able to anticipate the things that people are going to want to put

46:05 into a series or a frame.

46:07 There's no idea that I can anticipate all the possible values someone is going to try to put

46:11 in an index.

46:13 So with property-based testing, with using hypothesis, you open the door to just defining

46:19 the properties that you expect to have.

46:22 You know, namely that if you create an index with 20 integers, the resultant index is going

46:29 to have 20 values.

46:31 Well, that's true unless you've duplicated any values.

46:33 Or that's true if it's not true if you duplicated values.

46:37 Or it's not true if something else went wrong in reading those values.

46:41 So I think of hypothesis in the context of static frame as a way of simulating my user.

46:46 It's, you know, the user.

46:48 It's thousands of users who are throwing everything into these containers.

46:51 And hypothesis really nicely gives you a way to model that and really changes the way you

46:57 think about testing.

46:58 Again, my same colleague, you know, was like, you know, I enjoy testing again.

47:02 I enjoy writing tests so much more when using this because it just forces you to think about

47:08 it in a different way.

47:09 And it's very refreshing compared to the task of writing unit tests.

47:12 Yeah, it's cool.

47:13 It's almost like writing a meta test, right?

47:15 Instead of going like, here are the seven cases.

47:17 Here's one where the value's in the middle.

47:18 Here's the edge of the array I'm trying to test.

47:20 Here's one that's out of the bounds.

47:22 You can just go, this is the general type of stuff that goes in.

47:26 These are the general types of things I want to verify.

47:29 Go make that happen and vary a bunch of stuff for me, right?

47:33 Yeah, yeah, that's right.

47:34 Yeah, it's pretty cool.

47:35 So I was really thrilled to see that you had put that in there for some of the testing stuff.

47:39 It's cool.

47:39 People can check it out in the GitHub repo.

47:41 Yeah, I have a lot more to do there.

47:42 But again, it's like you have to go into it.

47:45 What's really startling about it is you really have to be in a different mindset.

47:48 So you have to give yourself the time to get into mindset.

47:51 There's much more I need to do with that.

47:52 But it's a refreshing and pleasurable place to be in.

47:57 So yeah, I highly recommend it.

47:58 Yeah, I bet.

47:59 It seems super cool.

48:00 It definitely seems like you can't just bring your main way of thinking about testing.

48:04 Like, I'm going to test this one case and see if it works.

48:06 You've got to sort of step back a level.

48:08 Yeah, that's exactly right.

48:09 Yeah, nice.

48:10 Another thing I wanted to ask you about that I didn't before when we talked about Python

48:14 and finance and just we're coming up on 2020.

48:18 It's the death clock for Python 2 is ticking at pythonclock.org.

48:22 I think it is.

48:23 It's ticking down.

48:25 The time is getting short on it.

48:27 What is it like?

48:28 Is first of all, the static frame support for Python 3?

48:31 Oh, it built entirely in Python 3.

48:33 We're at 3.5 now.

48:35 No support for 2.

48:37 So that was a huge benefit of my predecessor here.

48:40 Research Affiliates.

48:41 He set out building our code base in Python 3 back in 2012 or even 2011, which some people

48:47 would have said, might have said was, you know, kind of questionable choice.

48:51 But at that point, we had NumPy and we had Pandas soon after that.

48:55 So given that foundation of Python 3, we've been using Python 3 entirely and have never looked

48:59 back.

48:59 Yeah, that's super.

49:00 And then what do you see that transition looking like in the finance space larger?

49:07 Not necessarily just for your firm, but the other folks you interact with as well.

49:11 In terms of moving to Python 3?

49:13 Yeah.

49:13 Like, do people just have their head in the ground and go, we're just not doing it?

49:16 Are they going like, oh my gosh, here it comes.

49:17 This is going to be like Y2K again.

49:19 Or do they, are they ambivalent?

49:21 What's the finance vibe around that?

49:23 I can't speak broadly.

49:24 I do know that there's a very large bank that employs a very large number of Python developers

49:30 who use a lot of extensive systems built entirely in Python 2.

49:35 And I don't know if they're even on 2.7 or 2.5.

49:38 Yeah, I think the bank that you were talking about, I think I know.

49:41 And I don't even think they're on 2.7.

49:42 Yeah, I think they're stuck on 2.5 is what I heard.

49:44 But it's going to be very hard, I would expect.

49:47 Maybe they've built their frameworks in such a way that maybe they're okay.

49:52 One of the things I've heard about that very large bank is that their Python tools, to some extent,

49:58 are enforcing immutability.

50:00 And for the same motivations that we have, they may have put constraints on the language

50:05 in a way to help reduce risk that they can keep for a little while.

50:09 But certainly, it's going to require a transition at some time, and that's going to be hard.

50:13 Yeah, I agree.

50:14 I guess two thoughts.

50:15 One, do you feel like maybe that is a failure of leadership, engineering leadership, to say,

50:23 we put ourselves in a corner, you guys, and we have to.

50:26 I know it's not building features or driving the investment engine, but we have to keep moving

50:32 forward if we get stuck, not just on 2.7, but on 2.5.

50:36 And all these libraries, they can't use anything NumPy is doing, or Pandas, or in the future,

50:41 right, as they're dropping, right?

50:42 You know, Pandas already announced they're dropping Python 2 support.

50:44 Right, I saw that.

50:45 Yeah, it's definitely a challenge.

50:47 And it's technical debt, right?

50:48 It's, as, you know, my own team, we're on 3.5, and we're in the process of jumping to 3.7.

50:55 You know, even that, for us, as a relatively small team with a decent but modest-sized code base,

51:01 you know, it takes work and it takes time.

51:03 And just as you say, it doesn't deliver immediate features.

51:05 It doesn't deliver obvious benefits.

51:07 It is a technical debt.

51:08 And it's often, it's very difficult to prioritize that work appropriately, and also to communicate

51:14 the value to upper management and others that are considering what your developers are doing.

51:20 But it's, I mean, the important thing is that it's called debt for a reason.

51:25 You have to pay it, or your survivors will pay it.

51:28 There is no debt forgiveness in technical debt, other than abandonment.

51:34 I mean, you can abandon the code and start over.

51:37 There's no too big to fail, sort of, really.

51:39 Yeah, so it's definitely something to pay attention to.

51:43 I mean, even with Pandas versions, we've struggled to keep up with Pandas updates.

51:48 We're still presently using Pandas 17.

51:51 We are transitioning to Pandas 23 or 24.

51:54 I think we're going to 24 now.

51:55 I think 25 just came out.

51:56 But even before we were on 17, we suffered and spent quite a bit of time accommodating the changes to the API

52:04 and changes downstream of Pandas changes.

52:08 So it's painful, but you just have to do it.

52:11 Yeah, in their defense, right?

52:13 That's a lot of money.

52:14 If you're rewriting the code significantly, that's touching money versus just driving the website or whatever.

52:20 I can understand the hesitation.

52:22 I want to mess with that.

52:23 But at some point, maybe it's not in 2020.

52:26 Maybe it's 2025.

52:27 At some point, it's going to be a problem.

52:29 People are going to go, I don't want to work there.

52:31 You mean really?

52:32 That version from that long ago with that few library support?

52:36 No, thank you, right?

52:37 Like, it's going to be a problem.

52:38 It's going to be like cobalt.

52:39 Yeah, yeah.

52:40 Cobalt.

52:40 I made a joke about cobalt the other day with some of my colleagues, and I was quickly corrected that there is –

52:47 apparently there still is quite a bit of cobalt in production.

52:49 Yes.

52:49 So I was like – I thought it was like a dinosaur, but I guess there's still a lot of cobalt in production.

52:55 But you can get away with it for so long.

52:57 But at a certain point, yeah, you're exactly right.

53:00 It's a huge detriment to recruiting.

53:03 We're a small firm located in Newport Beach.

53:06 Not exactly a tech hub, although Irvine's trying a little bit.

53:10 But for as long as we've been recruiting for this team, I've been – less so now, but a few years ago, I would say to people, yeah, and we're working in Python 3.

53:18 And they would say, oh, really?

53:19 You're working in Python 3?

53:20 I'm stuck in 2.7 or 2.5.

53:22 I'm so excited.

53:23 That would be awesome.

53:24 I am so excited.

53:25 So a few years ago, that we were entirely in Python 3 was explicitly a highly desirable feature for prospective candidates to our team.

53:36 A little bit less so now, but it's something that we always say up front.

53:40 Yeah.

53:40 Well, it's definitely a good thing.

53:43 I think only less so only because other people have started to make that path, go down that path, right?

53:50 Yeah, that's right.

53:51 I mean, when I was in – I went to a PyCon, I believe it was in 2013, and I believe it was at Guido's keynote.

53:59 Maybe it was somebody else.

54:01 But the question was asked to the general assembly, you know, when there's all thousands of – however many thousands of people are in that room.

54:07 And they asked a show of hands of how many people are using Python 3 in production.

54:11 And me and my colleague raise our hand and look around.

54:15 And there's just – I mean, it was far less than 10%.

54:18 Yeah.

54:18 But I think they did that exercise again at a recent PyCon, and it was – looks like it was more than half, you know?

54:23 Oh, yeah.

54:24 Yeah.

54:24 The community is definitely moving, and, you know, it's good to see.

54:27 It's great to see.

54:28 It's great to see.

54:29 All right.

54:30 Well, I think we're getting short on time, so we're going to have to leave it there.

54:34 People should definitely check out Static Frame.

54:36 If Pandas is something that you're doing, maybe this will apply.

54:40 I guess maybe one final question I could ask for you, Chris, is how does somebody know that they have a problem that Static Frame will solve better than Pandas is solving?

54:48 Yeah.

54:49 I mean, often the advice will be like, hey, yeah, use Pandas, right?

54:52 Load CSV, all that kind of stuff.

54:54 But, like, when would you say, actually, you should consider this because it'll solve your problem better?

54:59 I would say there's a couple signs.

55:00 One might be that you keep on making mistakes.

55:04 You make mistakes because you reach for the wrong interface, or you get a surprising result because there's a type sensitivity to an argument, or you make a mistake because you accidentally mutated data you didn't intend to mistake, or you got a multi-index when you expected a unique index.

55:21 You know, those are the kinds of things that are the telltale signs that maybe the kind of work you're doing, you know, requires a different package with a different set of constraints.

55:30 Yeah, that's right.

55:30 That's a great description.

55:31 Thanks.

55:32 All right.

55:32 Now, before you get out of here, I've got the final two questions.

55:34 Sure.

55:35 If you're going to write some Python code or work on Static Frame, what editor do you use?

55:38 I am recently moved over to VS Code, as many people may have had some apprehension about Microsoft products for some time.

55:46 And now there's a Microsoft product that I use every day and really enjoy.

55:50 Prior to that, I used a few different editors, but I've been really happy with VS Code.

55:54 In large part, I don't really ask a lot for my IDE.

55:57 I really want it to get out of my way, and I don't debug in the IDE.

56:02 I don't lint in the IDE.

56:04 I prefer to do those things from the command line.

56:06 I just like my ID to be something close to like a Zen mode that gets everything out of the way, and I'm very aesthetically inclined,

56:14 so I'm very sensitive to my colors and whatnot.

56:16 So with VS Code, I was able to quickly, with a very low transition cost, get it to be visually, aesthetically, sort of ergonomically comfortable for me.

56:27 And in subsequent updates, it hasn't made it worse.

56:30 It's been good.

56:31 So I've been very happy with VS Code.

56:33 That's cool.

56:33 Yeah, they're doing great stuff with that, so I definitely hear that a lot.

56:36 All right.

56:37 And then notable PyPI package.

56:39 I'll go ahead and throw a static frame out there for you.

56:41 People can pip install that, right?

56:42 Yep.

56:43 Yep.

56:43 It's there.

56:44 Ready to go.

56:44 All right.

56:44 Other ones that you're like, oh, I heard about this the other day.

56:47 Maybe you don't know about it, but it's really cool.

56:49 It solves this problem uniquely or whatever.

56:51 Any come to mind?

56:51 I should plug the project I worked at before I started at Research Affiliates, which is Music21.

56:56 Music21 is a Python package that I co-created and founded and did sort of initial three years of work on it at MIT with a former colleague of mine there,

57:06 which is a really fun tool for the other day.

57:07 Which is a really fun tool for examining what we call symbolic music.

57:11 So music represented as XML or music represented as MIDI files.

57:15 Music21 allows you to take in these musical representations and play with them as an object model and ask questions about them.

57:24 Like, for example, given all of Mozart's string quartets, how often does he use a modified pitch on the third beat or something like that?

57:34 Uh-huh.

57:34 Awesome.

57:35 So it's a really fun toolkit if you know anything about music and you want to start experimenting with generating or analyzing musical notation.

57:42 Okay.

57:43 That's a great recommendation.

57:44 That's very cool.

57:44 All right.

57:45 Final call to action.

57:46 People want to get started with static frame.

57:48 What do they do?

57:48 I did the essential thing recently.

57:50 I made a quick start guide.

57:51 So I started to write API documentation, and that was kind of tough.

57:57 And it's not a pleasurable read and not a good introduction.

58:00 So I fairly recently wrote a little quick start guide.

58:02 You can find it on GitHub in the readme.

58:04 You can find it in the documentation, which is a little tutorial using data available from a JSON endpoint that will walk you through some of the key features and main differences from Pandas.

58:14 And hopefully will be enough to get people excited about the package.

58:18 Yeah, very cool.

58:19 And you also gave a presentation at PyCon, which was recorded.

58:21 I'll link to that so people can check that out.

58:23 Final question.

58:24 Are you looking for open source contributors, people to jump on this project, or is it kind of baked?

58:29 What's the status there?

58:30 Oh, absolutely.

58:30 So while this tool is being used internally within my firm and its use will grow within our firm, we are absolutely looking for contributors and users and testers to give us some feedback.

58:45 I've been fortunate in the development of this in that I've had my team to constantly give me feedback and tell me I'm being too nice, as they like to do, to make our interfaces discreet and precise and get a lot of feedback and support from my team.

59:01 So I owe a huge debt to my team and the context of our work here to support that.

59:05 But we need more users.

59:06 We need more testers.

59:07 We need more feedback.

59:08 So at a basic level, people using the tool and giving us some feedback, they may not be ready to move it into it into their production systems.

59:15 And I certainly understand that.

59:16 But some good dabbling, starting to play with it would be really helpful for us in getting some feedback.

59:21 And of course, if I'm pretty happy with the code itself, I would encourage those to look at the code.

59:25 They see opportunities to add things and make things better.

59:28 That would be fantastic as well.

59:29 Yeah, super.

59:29 All right.

59:30 Well, thanks for giving us the whole story and history of Static Frame.

59:34 It looks like a really cool project.

59:35 Great.

59:35 Thank you for your time.

59:36 Happy to be on the show.

59:37 Yep.

59:37 Happy to have you.

59:38 Bye.

59:38 Bye-bye.

59:38 This has been another episode of Talk Python To Me.

59:42 Our guest on this episode was Christopher Ariza, and it's been brought to you by Linode and Stellarress.

59:47 Linode is your go-to hosting for whatever you're building with Python.

59:51 Get four months free at talkpython.fm/Linode.

59:55 That's L-I-N-O-D-E.

59:57 Find the right job for you with Stellarress, the AI-powered talent agent for the top tech talent.

01:00:03 Visit talkpython.fm/Stellarress to get started.

01:00:06 That's talkpython.fm/S-T-E-L-L-A-R-E-S.

01:00:11 Stellarress.

01:00:12 Want to level up your Python?

01:00:14 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

01:00:19 Or if you're looking for something more advanced, check out our new Async course that digs into all the different types of Async programming you can do in Python.

01:00:27 And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle.

01:00:32 It's like a subscription that never expires.

01:00:34 Be sure to subscribe to the show.

01:00:36 Open your favorite podcatcher and search for Python.

01:00:39 We should be right at the top.

01:00:40 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:00:49 This is your host, Michael Kennedy.

01:00:51 Thanks so much for listening.

01:00:52 I really appreciate it.

01:00:53 Now get out there and write some Python code.

01:00:58 I really appreciate it.