Scaling Python to 1000's of cores with Ufora

Episode #60, published Tue, May 24, 2016, recorded Mon, May 2, 2016

Episode Deep Dive Transcript

You've heard me talk previously about scaling Python and Python performance on this show. But on this episode I'm bringing you a very interesting project pushing the upper bound of Python performance for a certain class of applications.

You'll meet Braxton McKee from Ufora. They have developed an entirely new Python runtime that is focused on horizontally scaling Python applications across 1000's of CPU cores and even GPUs. They describe it as "compiled, automatically parallel python for data science".

Links from the show:

Ufora Platform: ufora.github.io/ufora/
Ufora on Github: github.com/ufora
Ufora company: ufora.com
Braxton on Twitter: @braxtonmckee

Episode Deep Dive

Guest Introduction and Background

Braxton McKee is a data scientist and Python developer deeply involved in scaling computational applications. He spent several years in hedge fund technology, focusing on performance and large-scale data processing. Motivated by how often he had to rewrite Python code in C++ for speed and parallelization, Braxton co-founded Ufora to tackle large-scale computation directly in Python. His team built a specialized Python runtime that automatically distributes and compiles Python code across thousands of CPU cores, and even GPUs, without requiring developers to fundamentally rewrite their programs.

What to Know If You're New to Python

If you’re relatively new to Python, note that this conversation dives into advanced topics such as parallel execution, immutability, and distributed computing across clusters. Here are a few things to keep in mind to get the most out of the discussion:

You should be comfortable with basic Python concepts such as functions, loops, and data structures.
Having some exposure to libraries like NumPy or pandas will help you follow the examples.
Recognize that concurrency and distribution in Python usually require special frameworks, this conversation shows how Ufora addresses that challenge in new ways.

Key Points and Takeaways

Automatically Scaling Python Across Many Cores Writing standard, idiomatic Python yet achieving speedups across thousands of CPU cores is the central idea behind Ufora. Braxton’s team built a specialized runtime that analyzes Python code, identifies parallelizable tasks, and distributes them seamlessly with minimal developer intervention.
- Links / Tools:
  - GitHub repository for Ufora: github.com/ufora
  - Docker for local Ufora service: docker.com
Importance of Immutability in Parallel Computing Ufora’s key insight is that dealing with immutable data structures unlocks reliable and aggressive parallelization. By limiting in-place modifications, the runtime can confidently assign computations to many cores without the complexities of shared mutable state or thread locking. This approach is also reminiscent of big-data technologies like Spark and Hadoop, which rely heavily on immutable data sets (or transformations).
- Links / Tools:
  - Spark: spark.apache.org
  - Hadoop: hadoop.apache.org
Comparison to MapReduce and Spark Traditional systems like MapReduce (Hadoop) and Spark offer data-parallel operations but often require developers to think in terms of map and reduce steps. Ufora attempts to remove that mental burden by auto-detecting parallel opportunities within regular Python loops and comprehension constructs. This can simplify code significantly for more complex data analytics or scientific applications.
- Links / Tools:
  - Spark: spark.apache.org
  - Hadoop: hadoop.apache.org
GPU Integration for Major Performance Gains The Ufora platform has ongoing work to identify GPU-friendly operations (e.g., matrix computations, deep learning loops) and automatically offload them to graphics processors. GPUs can be hundreds of times faster on certain tasks. By hiding this complexity from the user, Ufora could enable effortless shifting between CPU and GPU without specialized code.
- Links / Tools:
  - Amazon EC2 GPU Instances: aws.amazon.com/ec2/instance-types/
Balancing Strong Typing and Pythonic Flexibility Python’s dynamic typing can hinder compiler optimizations, but Ufora addresses this by making safe assumptions wherever possible (e.g., immutability) and detecting usage patterns at runtime. This dynamic analysis allows near C-level optimizations while retaining the high-level simplicity Python developers love.
- Tools Mentioned:
  - PyPy (pypy.org) as another JIT-based approach
  - NumPy (numpy.org)
Open Source Development and Commercial Model Ufora is fully open source to encourage community contributions, particularly for porting C-based packages into pure Python implementations that Ufora’s compiler can analyze. Braxton’s company supports the platform through data science consulting, building custom solutions for enterprise and scientific clients.
- Links / Tools:
  - Ufora on GitHub: github.com/ufora
  - Pandas re-implementations in pure Python for Ufora: (no direct link given, but see above GitHub)
Runtime Introspection and Scheduling The platform employs a statistical model of your code execution, measuring how long certain functions take on sample data. It can then predict how long similar operations will take on larger inputs, automatically scheduling chunks across the cluster. This estimation also helps recommend how many cores or GPUs you might rent on AWS to keep costs manageable.
- Links / Tools:
  - AWS Spot Instances: aws.amazon.com/ec2/spot/
Practical Workflows with Python Notebooks The integration with Jupyter (IPython) notebooks means you can develop and test code locally, see how it performs on your cluster, and do iterative data science workflows. Print statements are replaced by real-time feedback on resource usage, which is crucial when debugging hundreds of processes.
- Tools Mentioned:
  - Jupyter/IPython: jupyter.org
Deploying via Docker and the Pyfora AWS Utility To simplify startup, developers can run Ufora on a single machine with Docker or use pyfora aws to spin up clusters on Amazon. This blends well with data scientists’ typical environment, one command to launch, then a familiar Pythonic context manager on the client side to execute remote tasks.
- Links / Tools:
  - Docker: docker.com
  - Pyfora AWS: Shipped with Ufora’s open source tooling
Adopting Ufora for Complex Data Analytics Ufora is especially beneficial for tasks that do not map easily onto a simple map/reduce pattern, such as nested data relationships, Bayesian modeling, or complex loop logic. Users find they can express these algorithms in regular Python, letting the runtime orchestrate concurrency and distribution behind the scenes.

Tools:
- Bayesian inference libraries (general mention)
- C++ rewriting avoided in many cases

Interesting Quotes and Stories

"If you look at regular people's code, there's all kinds of opportunities to parallelize it that they probably weren't thinking about." -- Braxton McKee

"Why is it so hard for the system to see that I have two loops, and there's parallelism in both? That's what we've been tackling." -- Braxton McKee

Key Definitions and Terms

Ufora: A specialized Python runtime that auto-parallelizes and distributes Python code, focusing on immutability and advanced compiler optimizations.
Pyfora: The Python-facing API and client toolset that connects your local Python session to remote Ufora clusters.
MapReduce: A programming model (often associated with Hadoop) that structures distributed data processing into mapping and reducing steps.
GPU (Graphics Processing Unit): Hardware originally designed for rendering graphics, now commonly used for massively parallel computations.
Immutability: A property of data structures where they cannot be changed after creation, allowing safer concurrency in distributed systems.
AWS Spot Instances: A pricing model on Amazon Web Services that allows users to bid on unused compute capacity at lower costs.

Learning Resources

If you need to strengthen your core Python skills or learn the language from the ground up, consider the following courses.

Python for Absolute Beginners: A thorough foundation in Python for those new to programming.

If you’d like to explore concepts in large-scale Python or advanced concurrency, check out:

Getting started with Dask: Scale pandas workflows and Python computations effortlessly.
Async Techniques and Examples in Python: Understand threading, multiprocessing, and async/await for speed and concurrency.

Overall Takeaway

Ufora showcases a promising approach to scaling Python applications for data science and machine learning. By automatically handling parallelization and distribution, it lets developers keep their Python code clean and expressive while leveraging thousands of CPU cores and even GPUs. Braxton’s work reminds us that immutability, introspection, and a thoughtful runtime can deliver a massive performance boost, without forcing engineers to rewrite code in lower-level languages.

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 You've heard me talk previously about scaling Python and Python performance on the show,

00:03 but on this episode, I'm bringing you a very interesting project pushing the upper bound

00:07 Python performance for a certain class of applications. You'll meet Braxton McKee from

00:12 Euphoria. They are developing an entirely new Python runtime that is focused on horizontally

00:17 scaling Python applications across thousands of CPU cores and even across GPUs. They describe it

00:24 as compiled automatically parallel Python for data science. Let's dig into it on Talk Python To Me,

00:30 episode 60, recorded May 2nd, 2016.

00:33 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the

01:02 ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter,

01:07 where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm

01:12 and follow the show on Twitter via at Talk Python. This episode is brought to you by SnapCI and

01:19 OpBeat. Thank them for supporting the show on Twitter via snap underscore CI and OpBeat.

01:24 Braxton, welcome to the show.

01:27 Oh, thanks, Michael. I'm excited to be here.

01:28 Yeah, we're going to take Python to a whole other level of scalability here with sort of

01:33 computational Python, which is where Python has actually struggled the most in parallelism. So

01:38 I'm really excited to talk about that.

01:39 Well, it's a complicated topic. It's something I spent a lot of years on and it's actually really

01:46 fun. So we'll have a lot of good stuff to talk about.

01:48 Yeah, I don't know how many people have heard of your project, but I'm really excited to share

01:52 with them because it sounds super promising. But before we get into all of this, let's start at the

01:57 your story. How did you get into Python and programming?

01:59 Well, I've been programming for pretty much my entire life. I think I started when I was eight

02:04 years old and I'm like 36 now. So it's kind of the way I think. I studied math in school, but

02:12 writing code has always been the way I think about approaching problems. So like when I was doing math

02:17 assignments, if I wanted to understand what was going on, I'd usually go write some kind of program to

02:22 try to understand what was happening, do the numeric integral by hand or whatever. And then when I

02:27 left college, I went and I started working in the hedge fund industry, which was a great experience

02:33 as a young person. I got to log code. I had to solve some really interesting problems. And it was really

02:40 there that my sort of desire to work on tools came around because I was just consistently frustrated with

02:47 how much effort it took me to actually implement these solutions. I would talk with my colleagues about

02:53 some computation that we wanted to do and we could describe in principle what it was in five minutes.

02:58 And then they would ask me, all right, how long is this going to take? And I said, all right, come back to me in three months.

03:03 Because, you know, getting things to actually scale up is really challenging.

03:09 I started writing Python code actually after I left the hedge fund industry, which was in 2008. So, you know, Python really got big in, you know, 05, 06, 07. And it got into my attention.

03:21 And then when I was doing, I was actually doing some speech recognition work. And I spent a lot of time doing the same thing that, you know, most people who are frustrated with Python's performance end up doing, which is you, you try and figure out the little bit of it that's slow and stick that in C++.

03:37 And then write, write wrappers for that, and then try and do all the rest of your work in Python.

03:42 And a lot of the things that I've been interested in recently, and I think a lot of the pretty awesome projects that are out there for speeding Python up, are essentially trying to address that same problem.

03:53 Saying, how can we get more and more of what you're doing to live in Python, because you pick Python for a reason, and be able to just use the tool the way you want and not have to pay a performance penalty?

04:01 Yeah, you know, I've noticed in the industry, there seems to be more and more attention given to the speed of Python.

04:08 I know the NumPy guys have been around for a while, the PyPy guys have been around for a while, really working on this issue.

04:15 But it seems like more and more people are trying different angles of approach.

04:19 So in Python 3.6, they have some really interesting stuff to speed up method invocation in Python, which is one of the slower bits.

04:26 There's the PyJohn stuff coming from Microsoft.

04:30 There's a lot of really interesting work.

04:32 And so yours kind of lands in that space as well, right?

04:36 Yeah, absolutely.

04:37 And I think that what's going on is, you know, Python is a really lovely language to program as an implementer.

04:43 It's got a lot of weird corner cases, and it's a very hard thing for computers to reason about.

04:49 And, you know, the core of what makes it possible for systems to speed up code is their ability to reason about the code that you've written.

04:58 So if you look at, like, a C++ compiler, it does all of these fancy tricks inside to transform your code into something that's faster.

05:05 And it does that because it can make really strong assumptions about what you're doing that it can apply to the machine language that it's generating.

05:13 And that's really hard to do in Python because Python gives you so much flexibility.

05:18 And so I think that the reason you're seeing so many different approaches is because there's a million different ways to look at the problem and say, okay, we can look at what people have written and start to identify particular things that we want to speed up or make, I think this is the more common case, start to put some constraints around what we're going to tell people is going to be fast.

05:36 And as soon as you put constraints, now you give these optimizers something to work with.

05:41 And I bet you if you looked in the internals of all of these different things, each one of them would have a different way of looking at Python and picking a subset of it saying, this is the stuff that we're going to make fast.

05:51 Like NumPy, if you look at it originally, the whole point of NumPy, why it's fast is because you can really, if you look at its roots, really just stick floats and ints in there.

06:00 It's not trying to be fast for arbitrary Python.

06:02 It's trying to make Python fast for matrix algebra, linear algebra kind of stuff.

06:10 And adding that constraint is what allows it to be fast.

06:14 Yeah, that's interesting.

06:14 I think a lot of those things I mentioned do add constraints.

06:18 PyPy sort of gave up the C module integration in order to get its speed.

06:23 Whereas the PyJon stuff from Microsoft, they're trying to make sure they're 100% compatible, but that puts a different kinds of pressure and makes it harder on them.

06:32 The NumPy stuff, like you said, is focused basically on matrix algebra and matrix multiplication and stuff.

06:38 Yeah, absolutely.

06:39 And I think that guys like the PyPy guys have a big amount of respect for what they're doing because they're trying to solve a really, really brutal problem.

06:48 Which is, let's make anything written in Python orders of magnitude faster.

06:52 For most of my work, I've focused on a slightly restricted problem, which is to say, let's try and make Python programs that are written in this sort of purely functional style faster.

07:06 Our original goal was not just to get a compiler to be faster, but also to get code to be able to scale, right?

07:14 To be able to write idiomatic simple Python and get something that can use thousands of cores effectively without the programmer having to really think about how that's happening.

07:24 One of the crucial ways that you can do that is you say, okay, I'm going to make it so that if you make a list of integers, you can't change it again.

07:31 That seems like a pretty trivial change, but it's the underpinning for an enormous amount of different optimizations that we can do inside the program.

07:38 And if you look at systems like Hadoop and Spark, they have as a core tenant this idea that the datasets themselves are immutable and you make transformations on those datasets.

07:49 And that immutability is sort of the key to them being able to do things in a distributed context.

07:55 We said we can take that same idea and apply it not just at the outermost level of the operations we're allowing, but we can let that filter all the way through the program.

08:06 And you can take that same constraint and apply it not just to make things distributed, but also to make the compiled machine code faster.

08:14 Because if I make a list of integers, I can now assume that it's going to be a list of integers forever because you can't change it.

08:20 That makes me able to generate much faster machine code if I'm from the compiler because I can start to make really strong assumptions about what's going on inside of it.

08:28 Okay, that's awesome.

08:29 And you said a few things that I'm sure have piqued people's interest.

08:32 Python, thousands of cores, and machine instructions and compiled code.

08:36 So maybe we can take a step back just for a minute and tell us about your project, Euphora, and what you guys are up to there.

08:41 And then we can dig into the details.

08:43 Yeah, absolutely.

08:44 So Euphora is a platform for executing Python code at massive scale.

08:51 And the basic idea is that you should be able to write simple idiomatic Python that expresses what you want and get the same performance that you get,

09:00 not just if you had rewritten the code in C, but if you had rewritten the code in C and used threads and message passing to generate a parallel implementation.

09:13 You should be able to do this directly from the Python source code without making any modifications to it and having the compiler really see everything going on inside of your code and use that to generate a fast form.

09:24 So we've been working on this for about five years.

09:28 We have a totally separate runtime and compiler framework that we've developed to do this.

09:34 And we actually think of the languages almost like front end.

09:36 So we built a VM.

09:37 And I mean VM in the sense of like the Java virtual machine or Microsoft CLR.

09:42 It sort of has enough structure for it to be able to reason about programs in this context and make them fast.

09:50 And then what we've basically built is a cross compiler from Python down into that virtual machine.

09:57 And we're planning on doing other languages at some point in the future.

10:01 But there's so much goodness in Python that we really wanted to start there because that was what our roots were.

10:08 The project is 100% open source.

10:11 So you can just go get it on GitHub.

10:13 And the Python front end for this thing has really only been around for about nine months.

10:19 So it's still a work in progress.

10:20 But it's evolving pretty rapidly.

10:22 And, you know, we've been able to get cases where you can take relatively naive looking Python programs, dump this into this thing,

10:29 and get them to work actually pretty efficiently on thousands of cores.

10:34 Our biggest installations have gone up to six or 7,000 cores running on Amazon Web Services.

10:39 And actually being able to use those things efficiently without really having to think about it is pretty fun.

10:44 Yeah, that's really amazing.

10:45 Distributed sort of grid computing type frameworks are not totally new.

10:50 But I think two of the things that are super interesting about what you've told me,

10:55 maybe the first and most interesting thing is that you don't have to write special distributed code, right?

11:03 You just write regular code and the system reasons about how to parallelize it.

11:07 Is that correct?

11:07 Right, absolutely.

11:09 And that comes, as I mentioned, from this idea that we put some constraints on the language.

11:15 So the idea is it's immutable Python.

11:17 So you make a list and you don't modify the list.

11:20 You make a new list by writing a list comprehension that scans over the old one and makes a transformation of it.

11:26 Because of that constraint, the system can now see, okay, like you're doing all of these operations on these data structures.

11:33 And I can see you're doing this function now, but you're going to call this other function next.

11:39 And it can reason about the flow of data inside of that and say, okay, you know, these things are independent of each other.

11:46 I can schedule them separately if I want to.

11:48 That freedom then gives the runtime the ability to say, okay, now it's trying to solve this problem of what's the most efficient way to lay things out.

11:57 But the end result is that if you say, okay, here's a list comprehension or I've got a divide and conquer algorithm where I take something, I scan over it, I do some computation, cut it in half, and then recurse.

12:10 It can see that that's what you're doing and parallelize that.

12:14 And it turns out that if you look at regular people's code, there's all kinds of opportunities to parallelize it that they probably weren't thinking about.

12:21 They wouldn't only have thought about it if they really hit a performance bottleneck, but it's in there.

12:26 Like, if you're computing the covariance of two time series, you end up having three different dot products that you need to do, and they can all be done independently of each other.

12:36 And, you know, if you write it out the naive way, the Py4 compiler can actually see that parallelism.

12:42 And if those tasks are big enough, it will split them into three and actually do those on separate threads for you, and you don't have to actually explicitly say that.

12:51 And, you know, in my thinking, there's been a continual movement in computing to raise the level of abstraction at which we work.

12:59 So, like, we used to write an assembler, I guess, 40 years ago, right around when I was being born.

13:05 And then they came up with C and Fortran, and these really made it possible to write more complicated programs because more and more of the details were abstracted away.

13:13 And if you look at Python, Python is like the natural extension of that idea on single-threaded computing as far as it can go.

13:20 You can be so concise and clear in Python because so many low-level details have been extracted away from you.

13:26 But we're still not there on parallel computing.

13:29 Like, people still spend a huge amount of time writing custom code to do parallel compute.

13:34 I think the reason that technologies like Hadoop and Spark have been so successful is that within certain problem domains, they can make a lot of that complexity go away.

13:43 And I wanted to take it a step further and say, let's not just do it for specific patterns of computing, MapReduce jobs or whatever.

13:51 But, like, let's do this for all of your code, everything that you write.

13:54 There should be enough structure in it for us to see and make it fast.

13:57 That's great.

13:57 Maybe you could just tell folks who are not familiar with Spark and Hadoop how they work and then, you know, like, what you mean by taking a little farther to general code with yours because I'm sure a lot of people know about Hadoop, but not everyone.

14:09 Sure. So, Hadoop began out of Yahoo as the technology they developed to index the web.

14:17 And, you know, the core component of it is you take your data set and you split it across a bunch of machines.

14:24 And you're given two basic primitives.

14:28 You've given math and reduce.

14:30 And the basic idea is that if you think hard enough, you can fit any big parallel compute job into that pattern.

14:37 Mappers basically allow you to take every element in your data set and run a little function on it to get a new element.

14:43 And reduce allows you to take collections of elements and jam them together.

14:48 And for a lot of big data parallel applications, this turns out to be a very natural way to express things.

14:55 And the reason these technologies have been successful is because if you can figure out how to fit what you're doing into MapReduce, it's actually a very easy way to program.

15:05 And the infrastructure, the Hadoop ecosystem, can give you all these benefits.

15:09 So, Hadoop is very fault tolerant.

15:12 You can turn the power off on any one of the machines or a rack of machines running Hadoop.

15:17 And because of the programming model, the central scheduler that's making decisions about what to do can react to that and still ensure that your calculation completes because it can schedule the map and reduce jobs that were running on that machine on another machine.

15:32 As an example.

15:33 So, if you work within this framework, you suddenly make a bunch of problems that are sort of endemic to parallel computing go away.

15:42 But the problem is that map and reduce and the other primitives that the community has since added to those systems are still relatively clumsy ways of describing computations.

15:54 There's a bunch of things that fit really naturally into them.

15:56 So, page rank and a lot of the things that people do with web scale data fit very naturally.

16:02 But scientific computing, numerical computing, things where the calculations don't just fall naturally into this very embarrassingly parallel structure don't fit that well.

16:13 And it can take some real thinking to jury rig your application into that model.

16:19 So, when I talk about wanting to be able to do this from the arbitrary code, what I really mean is that you should be able to write code the way you're thinking about solving your problem.

16:27 And the infrastructure should be able to generate that parallel implementation from your code as opposed to you saying, look, here's the place in my code where there's a million little tasks, each one of which can be done in parallel.

16:39 Go do them in parallel, which is what you're essentially doing when you break things down to fit them into the MapReduce model, which I think pretty much what 90% of people doing large-scale computing right now are doing, at least the ones that I'm talking to.

16:52 Yeah, you really have to change the way you approach the problem to fit it into MapReduce, right?

16:57 Yeah, and a lot of people are not comfortable with that.

17:01 Like, my best, my favorite example of that is, is like nested parallelism.

17:05 So, like, imagine you have a data set that has some tree structure to it.

17:11 So, you say, well, I've got a bunch of companies, and then for each company I have a bunch of transactions.

17:16 And now you've kind of got two layers to your hierarchy.

17:20 And if you say, I want to do something in parallel first over the companies, but then within each company in parallel again over all of the transactions, that's like a very unnatural thing to say in MapReduce because which thing are you mapping over?

17:32 Are you mapping over the companies?

17:34 Are you mapping over the records?

17:35 You need to kind of jury-rig it in order to make it work.

17:38 But if I told you to do that in Python, you would just have two loops.

17:41 It would be totally obvious to you how to do it.

17:43 And the only reason you would ever go to the MapReduce land is because you really needed that performance.

17:49 And at the end of the day, this gets back to this original frustration I had, like, I think it was almost 10 years ago when I left the hedge fund industry thinking about this stuff.

17:56 Like, thinking to myself, why is it so hard for the system to see that I have two loops and that there's parallelism in both of them and just make it parallel?

18:06 Why am I having to, like, rejigger everything in order to fit the model that's been given to me?

18:11 And that turns out to be an easy thing to say and a hard thing to implement.

18:14 That's what we've been doing.

18:16 Yeah, that's really amazing.

18:17 So maybe it's worth digging into that a little bit.

18:19 You said in the beginning you thought that you would have to use a different language other than Python, right?

18:25 Yeah, and this has been a really interesting evolution for us.

18:29 So that mutability idea, the idea that, like, if I make a data structure, I can't change it, is at the heart of pretty much any system that's scaled.

18:39 And the reason is because it gets rid of all of these concurrency issues.

18:44 It's like if the data element changes on one machine and you allow that, then if you have a distributed system, you have to make sure that that right operation propagates to everybody else who needs it.

18:54 And that can create real serious problems.

18:57 And if you don't handle it correctly, you end up with programs that are unstable or give you the wrong answer or really slow because there's a huge amount of locking.

19:05 But if you look at Python, Python as a language is a very immutable thing.

19:10 When you make a class, you start with an empty class and you start by pushing methods into it, right?

19:17 And the way people build up lists of operations is they make lists that are empty and they keep appending things to them.

19:22 So when I was first coming to this problem, thinking to myself, okay, I want everything that's good in Python, namely this ability to just pass functions around and not think about typing.

19:32 You know, all of the good things about Python's object model.

19:35 But I don't want that mutability.

19:37 I, my first thought was, okay, this is going to be completely impossible to do with vanilla Python.

19:43 I also thought to myself, well, actually there's a bunch of features in other languages that are purely functional.

19:49 Things like, like OCaml that are really nice features that Python doesn't have.

19:53 And we might actually need those things when we take away the immutability.

19:57 Yeah, that's interesting because, you know, Python is super mutable.

20:01 It's not just the data itself is mutable, but the structures that make up the execution of your program, like the classes and functions themselves are mutable, right?

20:11 So this is a really hard problem to solve with Python.

20:13 If you actually look at good Python programs, like they go through two phases.

20:17 They go through one phase where they set everything up and then they go through another phase where you don't change things.

20:22 Like the worst bugs I've ever seen in live Python programs are the ones where people are changing the methods on classes, like as they're running their program.

20:31 And so now, you know, you've got some class and it's got some function on it and the meaning of that function changes as the program is running.

20:39 You know, it's impossible for a human being to reason about those programs.

20:43 So like large scale stuff, people don't do that.

20:45 People actually tend, the bigger their program gets, the more closely it adheres to the actual, to the immutable style, because that actually makes it possible for human beings to reason about the code.

20:56 And let me move it this way.

20:58 If a human being can't reason about your program, the chances that a computer can reason about it are pretty low, at least right now, maybe in two or three more years.

21:05 That won't be true.

21:06 But they haven't figured that one out yet.

21:07 That's right.

21:08 That's right.

21:09 Continuous delivery isn't just a buzzword.

21:25 It's a shift in productivity that will help your whole team become more efficient.

21:29 With SnapCI's continuous delivery tool, you can test, debug, and deploy your code quickly and reliably.

21:35 Get your product in the hands of your users faster and deploy from just about anywhere at any time.

21:41 And did you know that ThoughtWorks literally wrote the book on continuous integration and continuous delivery?

21:47 Connect Snap to your GitHub repo and they'll build and run your first pipeline automagically.

21:51 Thanks SnapCI for sponsoring this episode by trying them for free at snap.ci slash talkpython.

21:58 So yeah, that immutability key, it felt, my initial reaction was it was going to be impossible and we needed new language constructs.

22:13 But after we built the language out, I realized, wow, this is basically Python.

22:17 Like there's no, it didn't actually change that much.

22:20 And what we ended up with is, is basically our language is now like bit code.

22:26 Like actually mapping Python down into our language, which is called Fora.

22:31 And saying Fora is now bit code for all of this other stuff.

22:35 It's like, it's like a language for describing scalable parallel scripting.

22:40 That frees it from needing to be a language that you would program in every day because we have that.

22:46 We have Python.

22:47 And so it was actually a fairly nice transition for us.

22:50 The reason I chose to do this, it came out of a conversation I had with someone who's done a lot of modified Python.

22:57 So my original, if you talked to me five years ago, I would have said, look, no one wants a language that has weird, messed up semantics.

23:04 They want Python to be Python and it's going to give them something different.

23:08 It should be its own very pure thing.

23:10 But I talked to this guy named Mike Dubno.

23:12 He was the CTO of Goldman Sachs for many years.

23:15 He's the godfather of new language stuff on Wall Street.

23:19 The team that worked for him at Goldman in the 80s, they've all gone to all the other financial institutions in New York and all re-implemented really interesting large-scale computation systems.

23:29 Usually they have a language component to them.

23:31 Yeah, there's a lot of interesting language work happening on Wall Street.

23:34 A friend of mine just got hired at a startup to build a new language to express ideas just slightly differently.

23:41 You know, it's so, yeah, that's cool.

23:43 Yeah, no.

23:43 And I bet you if you trace it back, it traces back to Mike Dubno at some level.

23:47 He was the one that convinced me that actually Python with slightly modified semantics is actually something people are totally comfortable with.

23:57 And his evidence was that, you know, he built a system at Bank of America that's got hundreds of millions of lines of Python code.

24:03 And it's a big graph calculation thing.

24:06 Like you make one node that represents the price of oil and another node that's some trade that some trader needs to have repriced.

24:13 And you describe these things in Python.

24:15 And they're updated.

24:16 They can run on different machines from each other, but they kind of look like they're on one machine.

24:20 And there are some rules about how that needs to work.

24:24 It's not vanilla Python.

24:26 And his point was that Python programmers are totally happy having some constraints around their work as long as it's mostly Python.

24:34 And his evidence was there's a hundred million lines of code written against this thing.

24:38 And I thought that was pretty good evidence.

24:39 And that convinced me that...

24:40 That's good evidence, yeah.

24:41 Right.

24:41 That I could make a go of this idea of restricted, you know, of immutable Python, that that was something that people could work in.

24:49 Yeah.

24:50 And you were able to basically take those ideas and then map them to your custom runtime, right?

24:54 Yeah.

24:55 Yeah.

24:55 And it actually didn't even take that long.

24:57 Like, there's been a...

24:58 Most of the work has been around libraries and tooling.

25:01 But we had the core mapping from, like, the Python object model and runtime down into...

25:08 Down into Fora done in just a few weeks.

25:11 Mostly because the object models are really not that different.

25:14 Most of the pain is around all of the little idiosyncrasies of Python.

25:18 Like, if you divide two integers in Python 2, you get an int.

25:22 But if you divide them in Python 3, you get a float.

25:25 Like, these little details, making sure that all that stuff works perfectly tends to be a lot of work.

25:29 But that's not the meat of the intellectual problem, right?

25:33 That's the details of getting a working system that can run existing Python code and get the same answer as what you get in a regular Python interpreter.

25:40 That totally makes sense.

25:41 So there's a bunch of questions I have.

25:43 The first one is, you know, what kind of problems are you guys solving?

25:48 And what kind of problems do you see other people solving with this system?

25:51 Like, where are you focused on applying it?

25:54 Sure. So for the most part, we're interested in data science and machine learning.

25:59 So because of the immutable nature of this version of Python, it's really not a good fit for systems programming tasks.

26:06 Like, I wouldn't write a real-time transaction processing thing in it because it's designed to take really big data sets, do really big computations on them efficiently.

26:15 You could think of it almost like we optimized for throughput of calculations, not for latency.

26:21 Like, anything you stick into this thing is going to take at least a half second.

26:23 But the goal of it is to be able to use 10,000 cores to take something that might take five hours down to five seconds or something like that.

26:31 Right.

26:31 We're obviously good at solving all of the standard data parallel things.

26:35 So, like, what you're saying is I've got a bunch of data on Amazon S3, and I want to pull it into memory and parse it and, you know, slice it the way you would in Pandas.

26:45 All that kind of stuff works pretty well.

26:47 And, you know, the goal of the platform is to actually make it so you can write regular idiomatic Pandas code and just have that scale.

26:55 We haven't done all of the functions, but we're working towards that.

26:58 I think the place where you really see differentiation is when you have a more complicated algorithm where you've got, as I said before, some kind of structure that doesn't fit nicely into MapReduce.

27:09 Like, I'll give you an example of one project that we did.

27:12 One of our customers does Bayesian modeling of retail transactions.

27:16 So they have all of these transactions from different human beings at different vendors, and they build these models to try and predict whether they're going to be, you know, good customers for these vendors in the future.

27:28 And they're trying to ask questions like, well, you know, what does it mean about you for your purchases at Home Depot?

27:35 So the fact that I can tell that you go to Dunkin' Donuts every single day in the morning for coffee.

27:39 So they're interested in relationships across vendors.

27:41 And so you end up with this structure of observations where you can group the data by person or you can group the data by vendor.

27:51 And you have variables that have to do with, like, how much people care about that particular vendor because, you know, the way people interact with Amazon.com is very different than the way they interact with Lowe's.

28:01 And then you also have this information about the individual person.

28:06 And so this ends up wanting to suck about 100 gigabytes of data, keep it in memory, and then pass over that data in different orders depending on whether you're trying to update the variables about the individuals or whether you're updating the variables about the vendors.

28:23 It turns out that it's very easy to describe this in Python.

28:26 You just have two different loops, one in one direction, one in the other.

28:30 And you kind of alternate back and forth between them as you're updating the model.

28:35 This would have been very hard to express as pure map produced, but it's just, you know, it's the whole project is like a couple hundred lines of Python code, which is pretty nice.

28:43 And this is all distributed and whatnot, right, across the different cores and machines?

28:48 Exactly.

28:49 The reason it's only a couple hundred lines of code is because all of the painful stuff, which is making it fast, getting the numeric likelihood calculation to be fast, doing the gradient descent minimization, like, all of that is handled by the infrastructure as opposed to being explicitly

29:04 described by the user.

29:05 So the workflow is literally you run a command line thing that boots machines in AWS.

29:12 We're a big fan of using spot instances in Amazon's infrastructure.

29:16 If you're not familiar with that, it allows you to basically bid for the market against the market price for compute.

29:23 If the market price goes above your level, then you don't get the machine, but if it goes below, then you do.

29:28 And it is usually around 10% of the regular cost of buying a machine, which means that I can get something like a thousand cores for something like 10 bucks, which is a pretty insane price for that level of hardware.

29:42 That's great.

29:43 And spot instances are really good for when you want to spin up some stuff, do some work for 30 minutes and throw them away.

29:48 It's less good for like running them indefinitely as a web server, right?

29:52 Yeah, exactly.

29:53 Although there are people who are figuring out that if you spread your spot instance request over enough different zones in Amazon, that the chances that the price spikes in more than two or three of those zones is relatively low.

30:07 So like I know there are people running real time ad bidding networks entirely on spot and the pricing is so cheap that that's feasible, but it takes a lot of careful design to make it work.

30:18 But for the kind of analytics workloads we're talking about, it's pretty perfect because usually as a programmer, what you're hoping to do is, you know, tune your algorithm, get it so that you like it on a small scale, and then fire it off and just have as many machines as is required to get the calculation done, you know, in an amount of time that allows you to like look at the answer and solve whatever business objective you have.

30:41 And so spot is perfect for that because you can literally just say, all right, how many machines do I need to get this done in a half an hour?

30:47 You know, you look at the workload, you look at the throughput, you divide, and it will just happen.

30:52 And then you're not committed to holding onto that hardware for any longer than you need it.

30:55 And at least in our case, because the backend infrastructure is fault tolerant, if the market price happens to change and you lose the machines, you can just go and raise your bid price or wait for the price to drop again and continue where you left off.

31:11 You don't really lose anything from having the volatility of the pricing shutting your machines down.

31:16 Yeah, that's really interesting from a distributed computing perspective.

31:19 It sounds to me with the restriction on sort of read-only type of operations that not every package out there is going to be suitable.

31:28 So can I go to your system and say pip install something and start working that way?

31:34 And are there packages that will run or does it have to kind of all be from scratch?

31:38 It's certainly the case that if you do it from scratch, you can control everything and that you have the highest likelihood that it works.

31:45 We adopted this hybrid approach where we said we're going to have two kinds of code.

31:50 There's going to be code that we understand how to translate, and then there's going to be code where we know there's no hope of translating it, mostly because it's written in C.

31:59 Like if you look at the core of NumPy and Pandas, they're all written in C.

32:02 There's nothing to even analyze there.

32:04 So you don't need to do anything to make them fast, but you would need to do something in order to make them parallel.

32:09 So what we have is a library translation approach.

32:13 So we've rewritten the core of NumPy and Pandas in pure Python.

32:19 And then whenever we see that you're using the libraries, we replace your calls to the C versions of those libraries with calls to the pure Python versions.

32:29 And that allows our infrastructure to see the library and make it scale out.

32:35 The downside is that we obviously have to translate any libraries that are written in C that way back into Python.

32:42 It's not as much work as you'd think because part of the reason those libraries are so complicated is, in fact, because they're written in C.

32:48 Like NumPy has to have separate implementations of every routine for both floats and doubles and ints and everything else.

32:55 But if you just write it as Python, it's obviously a lot simpler.

32:58 And the compiler is responsible for generating all of those specializations.

33:03 As a user, this means that, you know, you're better off if your algorithm either uses really vanilla stuff that we've already translated, or if you're willing to write the algorithm in regular Python yourself.

33:17 So most of the work where we've had really good success scaling this up is cases where people wrote their own algorithm anyways, because it didn't work with something off the shelf.

33:27 But our vision over time is that we could get most of the major algorithms like scikit-learn-stop, pandas, NumPy, all of the core stuff that people use, translated back into pure Python in a way that works with the compiler.

33:39 And it's pretty easy to do these translations.

33:43 So, you know, one of the things I'd say to the community is if you're using the system and you run into a function that's not there, you know, go and do the translation.

33:51 In most cases, it's pretty easy, and you can contribute it back, and it's a nice way of extending the system.

33:58 We're working on that kind of thing all the time, and we just tend to do the functions as we run into them, as we need them for our other work.

34:05 This episode is brought to you by OpBeat.

34:20 OpBeat is application monitoring for developers.

34:23 It's performance monitoring, error logging, release tracking, and workflow in one simple product.

34:27 OpBeat is integrated with your code base and makes monitoring and debugging of your production apps much faster and your code better.

34:35 OpBeat is free for an unlimited number of users.

34:37 And starting today, December 1st, OpBeat is announcing that their Flask support is graduating from beta to a full commercial product.

34:45 Visit opbeat.com slash Flask to get started today.

34:56 We also have plans that should be released later this summer of letting you run stuff out of process in regular Python so that if you don't want to translate it and you're willing to say, look, here's some crazy model that somebody wrote.

35:11 It's in C. I'm never going to translate it.

35:13 I don't want to run this model in parallel.

35:15 I want to run thousands of copies of it.

35:17 The distinction is I don't need this model itself to scale.

35:20 I just need to have lots and lots of different versions of it running on different machines.

35:24 In that case, we'll be able to let you run that out of process basically using the same kind of thing that you would do if you were solving this problem by hand, which is multiprocessing.

35:36 And this is basically the same approach as what PyPy has taken, which is to say you then chunk the problem up into smaller chunks yourself.

35:44 We'll ship each slice of the problem to a different Python interpreter on a different machine, and then it'll just run in vanilla Python the way it runs.

35:52 The upside of that approach is that it makes it possible to access all of this content that may never fit nicely into one of these models.

36:00 The downside of that approach is that you end up with all of the headaches that are usually associated with things like PySpark, which is like if that process runs out of memory and crashes, you don't have any idea what happened.

36:13 PySpark can't know.

36:14 You know, we won't really be able to know why that happened.

36:16 And then the responsibility is now back on the user of the system to figure out, well, why did it run out of memory?

36:23 What do I need to do to get it to fit?

36:24 Which is all stuff that we can make go away when we can actually see the source code in itself and translate it.

36:30 So I think it's kind of a necessary step to getting everything working for a lot of use cases.

36:35 Yeah, absolutely.

36:36 Because you don't want to try to translate the long tail of all these libraries and stuff.

36:41 So talk to me about how you share memory.

36:43 So, for example, if I've got a list in memory and it's, as far as my program is concerned, has like a billion objects in it, what does that look like to the system, really?

36:54 Sure.

36:55 This is a super interesting question.

36:57 So the idea is we put it in chunks, each chunk being a reasonably small amount of memory, like 50 or 100 megabytes.

37:04 And then the idea is that these chunks are scattered throughout memory.

37:08 So, like, imagine that you had a billion strings and each string is like, on average, a kilobyte.

37:13 So you've got a terabyte of stuff in that list.

37:18 But, like, the strings are different lengths, right?

37:20 So, you know, you might have some chunks might have 50,000 records and some other chunks might have 5,000 records and some other might have 500,000 records, depending on how you and how your data actually looks.

37:32 So the system chunks it up.

37:34 And then what it does is, as your program is running, it thinks about this sort of at the page level.

37:40 It says, okay, I can see your program is running.

37:42 And depending on what it's doing, it needs different blocks of that data to be located together on the same machine.

37:49 So if you do a really simple thing, like, imagine you say, okay, let's make a list comprehension.

37:53 I'll scan over my list of strings and take the length of each one.

37:57 Well, that's a really simple operation.

37:59 You know, I don't need anything more than each individual string at once to do that.

38:04 So it doesn't have to move any data around.

38:06 It just literally goes to every chunk and says, okay, take that chunk and apply the length function to it, and now you get back a bunch of integers.

38:12 But you could do something more complicated.

38:14 You could say, all right, imagine that these strings are log messages and, you know, the first tenth of them are from one machine and the second tenth are from another machine.

38:25 You might say, okay, I'm actually going to go figure out the indices of the different machines,

38:28 and I want to do something where I'm scanning through them and looking at them at different timestamps across all the different machines.

38:35 So now you actually have a more complicated relationship.

38:39 You need different chunks.

38:40 You might need the first hundred thousand strings and string 500 million to 503 million.

38:47 You might need those two blocks together because you're actually doing something with the two of them.

38:50 One partition might be really evenly distributed across the machine.

38:55 Then you ask the question the right way, and it just parallels it perfectly.

38:57 But you might cross-cut it really badly, right?

38:59 There's two problems here, right?

39:01 One of them is, did you write something where I can get enough data in memory to solve your problem without having to load a lot of stuff?

39:09 And then the other problem is, like, which things need to be together.

39:12 So, like, what I was just describing is a case where if you just naively put the chunks on different machines,

39:18 it's very easy to ask a query where the chunks are in the wrong place.

39:22 And our system can handle that, right?

39:24 So it'll see, like, oh, you're accessing these two chunks together because you're using both of them.

39:30 That problem it solves well because you can actually see that co-location and break your problem down.

39:36 It's basically active working sets of pages.

39:39 And so it'll just shuffle everything around in memory to meet the requirements of your problem.

39:43 There's a second issue, which is, like, imagine you said, look, I've got a terabyte of strings.

39:47 I'm going to start indexing randomly into the set of terabytes of strings,

39:52 and you're just not going to be able to predict anything about what I'm going to do.

39:56 I'm just going to grab them randomly in sequence.

39:58 You know, there's kind of no way to make that fast.

40:01 What you'll end up doing is repeatedly waiting on the network while the system goes and fetches, you know,

40:07 string 1 million and then string 99 million and then, you know, whatever.

40:10 It's going around and grabbing.

40:11 And this is just because you've written an algorithm that doesn't have any cache locality to it whatsoever.

40:17 And so there isn't really a good solution around that other than trying to minimize the amount of network latency.

40:24 But the way I try and articulate it to people is to say, try and think about it so that your program in your mind,

40:30 you could break it down into sets of Python function calls that use, you know, a couple of gigs of data at a time.

40:37 And it will figure out how to get those gigs of data to the right place.

40:41 You know, it'll see what that structure is by basically through actually running your program and seeing where you have cache misses.

40:48 Yeah, that's pretty interesting because one of the things you can do to make your code much more high performance,

40:54 especially around parallelism, but even in general, is to think about caching cache misses on the CPU cache, right?

41:02 Like locality of data, things like that.

41:05 And that's, you know, running on my own machine.

41:08 But it's interesting that that also applies on the distributed sense for you guys.

41:12 Yeah, well, in many senses, it's exactly the same problem.

41:15 It's just that it's a much, much, much worse problem.

41:18 Like when you cache miss on L1 cache in your CPU, you're doing something and the data is not present.

41:25 Like it's like 100 CPU cycles to go and fetch that data from long-term memory, which is like much slower than what your CPU is used to.

41:34 But that's still incredibly, incredibly fast.

41:36 Like if you have a job running on one computer on a network and it says, hey, I need this 50 meg of data on another machine.

41:43 Like go get that for me.

41:45 Like if you have 10 gig Ethernet, that means you can move like a gigabyte around per second on that network between two machines.

41:52 That's like a 20th of a second that you're going to have to wait to get that 50 meg.

41:55 You know, four orders of magnitude, five, I don't even know.

41:59 It's some stupidly larger number of time to wait for that data to be present.

42:03 So it's the same basic problem.

42:05 It's just that you're way more sensitive to the cache locality issue when you're operating in a distributed context.

42:11 And it's one of the reasons why the MapReduce Spark model has been so successful is like you're giving the framework a very clear idea of what the cache locality is going to be.

42:23 When you say take this function and run it on every element of this data set, what you're implicitly telling it is like copy the data for this function to every single machine before you do the job.

42:33 And now you'll never be waiting for any data.

42:36 And like we're saying, look, you can take that even a step further.

42:39 You can infer which data actually needs to go there for arbitrarily complicated patterns and make sure that you're not waiting on cache misses.

42:49 But it's all boiling down to like trying to fit the program into a model where threads are not waiting on data, where you're getting the most out of the CPUs as you possibly can.

43:00 Okay.

43:00 Yeah, that makes a lot of sense.

43:01 So let me ask you a little bit about your business model.

43:05 Euphoria is open source.

43:08 You can go to GitHub.com slash Euphoria and check it out, right?

43:12 But obviously you guys are doing this as a job.

43:15 So what's the story there?

43:16 Sure.

43:17 So we're basically a data science and engineering consultancy at this point.

43:23 We deploy Euphoria as part of our work for our clients.

43:29 We open sourced it because we wanted technology to get to as many people's hands as possible.

43:36 You know, I didn't leave the hedge fund industry in order to purely just make a pile of money, right?

43:41 I wanted to build stuff that people could use that would enable the world to do interesting things.

43:47 So we put it out there and we're hoping that we're hoping other people will pick it up.

43:51 But our business model mostly revolves around us actually doing data science and engineering work for our clients.

43:59 And so in some cases, the Euphoria platform is really only 25% or 30% of the solution because you have algorithmic work that needs to be done regardless of what platform you're doing it in.

44:10 And then you have data integration work that needs to be done, you know, making all these various databases talk to each other.

44:17 And so the infrastructure ends up being sort of a part of it.

44:21 I think over time we're anticipating building out additional products and services on top of the Euphoria platform and selling as standalone applications.

44:32 But infrastructure like this really wants to be open source.

44:35 Like it's very hard to charge for it successfully because you're asking people to take infrastructure and then build a lot of things on top of it.

44:44 And nobody likes to do that if they don't actually control the infrastructure.

44:47 And it's also one of the things that really benefits from community involvement.

44:53 Again, like this idea of the library work.

44:56 We as a small company are not going to be able to port every library.

45:00 But if I put this infrastructure out there and it's useful for people, then each marginal developer can look at it and say, hey, there's some piece of content in NumPy or some crazy scikit-learn algorithm that I'm missing.

45:11 It's not that much work for me to go port that one thing and contribute it back.

45:15 And that spirit of like everybody working together is the kind of thing that makes these big platform systems actually run, which is, I think, again, why you've seen so much access in open source for this kind of platform infrastructure.

45:29 That sounds really nice.

45:30 I think the one thing I could see that you guys could make sort of directly charge for is if you actually managed the servers and you kind of sold it as computation as a service or something like this.

45:42 I thought about that, but it's a little like there was a company called PyCloud that raised a bunch of venture funding.

45:50 Yeah, I met some of those guys.

45:51 Yeah, and their stuff was super awesome.

45:53 And I think they ultimately shut that down because they didn't feel like they could generate the kind of return that they wanted.

46:00 And, you know, the Databricks guys are doing Spark host as a service.

46:05 But then Google came out with Spark host as a service and Google's cost to be able to provide that infrastructure is super low.

46:13 So, you know, I looked at it and it's a nice idea.

46:16 But I think that you just you run into this problem where if you're successful, Amazon or Google will just run the same thing and their cost advantage will eat your lunch.

46:25 You know, I think in enterprise land, there's a lot more value to solving the LDAP problem, which is, you know, most of the people listening to this podcast are probably big users of open source software.

46:36 And a lot of them, you know, will pick this kind of technology up to use personally.

46:40 When you bring this kind of technology into the enterprise, you need all these additional support services that regular people don't need.

46:47 It needs to integrate with some crazy Microsoft legacy product from 10 years ago.

46:52 It needs to be able to talk to all kinds of funny databases that that startups don't have, as an example.

46:57 I think that a lot of the monetization and open source happens around that kind of that kind of problems.

47:03 Yeah, I think you're right.

47:04 Enterprise definitely has a lot of its own special challenges, let's say.

47:08 And that's great.

47:09 Right.

47:09 And those challenges, they don't affect adoption.

47:11 They affect how big organizations like charade and lock this kind of technology down.

47:18 And so it makes sense to basically build products and services around that and to try and make the core infrastructure as widely used as possible in the community so that it's as good as it can be.

47:30 Yeah.

47:30 Okay, great.

47:31 So you said you have some interesting things going on and sort of what's coming as well.

47:35 And one of them is you had recently added iPython notebook integration, right?

47:40 Yeah, so this is something that had been on our to-do list for a while.

47:44 It doesn't turn out that it's actually that much work.

47:46 You just need to make sure that you understand where to get the source code.

47:50 Because instead of living on disk, like most Python modules, it's in memory in these specialized Python notebook cells.

47:58 We also did some work around getting feedback from the cluster.

48:01 So you can fire something off now and it will tell you while the thing is running, you're using 500 cores.

48:07 You can see that number go up and down as it paralyzes your computation because you can write something that is naturally single-threaded.

48:15 And then you'll get feedback saying, hey, like this thing that you wrote, it's really only using one core because you've got some loop that can't be parallelized, something like that.

48:22 And this is, I think, a place we're planning on doing more, giving more feedback, more introspection, like some profiling tools and stuff like that.

48:30 So that as you're running your calculation, you can see, okay, it's really spending a lot of time inside of these functions.

48:35 This place is like thrashing because it's using too much data or whatever.

48:40 So I think there's a lot that we can do there to sort of help give more feedback to people about what's going on in their calculations.

48:47 This is a common problem in any distributed programming context, which is like when you're working on a regular Python interpreter, you can just put print statements and everything is good.

48:55 But when you're working on across hundreds of machines, even if you could get all the print statements, you'd be drowning in print statements.

49:03 You wouldn't be able to get anything productive out of it.

49:05 And it can be really hard for people to understand, like, you know, okay, in a distributed context, why is this slow or fast?

49:12 So I think there's a lot to do to give people more clarity.

49:16 I think the other things that we're working on right now that I think are super interesting is we just finished doing something where we're actually making an estimate of how long your computation is going to run for.

49:30 So what we do is as we're running your computation, we're constantly breaking it up into little pieces.

49:35 And we actually look at what functions you're calling and what values are on the stack.

49:39 So, you know, I can notice that if you, you know, if you call F with the number 10, it takes one second.

49:44 If you call it with the number 1,000, it takes 100 seconds.

49:48 I can build a little model for all your function calls about how long they take.

49:51 And this means that on subsequent runs, I can actually make a projection for how long the different pieces of the computation are going to take.

49:58 And this is useful in two ways.

50:01 This is useful for the scheduler because now I can say, okay, well, this super long computation, I'd better schedule that first because that will improve the overall runtime of the computation, gives it a sense of how to spread the calculations across the system better.

50:15 But it's also useful in that I think over the long run, we'll actually be able to give you accurate estimates of how long your calculation is going to take from the beginning and actually make recommendations to you about how much hardware you should use, which would move us towards a more automated way of using the cloud.

50:32 Where instead of booting machines and thinking of them as your cluster, you just literally fire the thing off and think purely in terms of cost.

50:39 That's great.

50:40 So you have like some machine that would just be sort of in charge of orchestrating this and you say, go run this job.

50:46 And it would go, all right, that's 100 reserve or 100 spot instances at this price of this many cores and go.

50:52 Something like that?

50:53 Yeah, exactly.

50:54 Like it could actually, it could come back to you and say, I think this is going to take, you know, 100,000 compute seconds, you know, and that it's parallelizable up to 1,000 cores.

51:05 So this is how much it's going to cost.

51:07 And you could think of it totally that way and then really completely abstract away the actual machines.

51:13 But you can't really do until you have a good estimate because the problem with computer programs is like you can keep adding zeros and quickly get to runtimes that are astronomical.

51:23 So you don't want to do that naively because then you'll end up with some system that just says, hey, like I'm just going to run on a million cores and send you an astronomical Amazon bill, which you really don't want.

51:33 So actually getting a good estimate is pretty important if you're going to make that feature work.

51:39 But like that model also has incredibly useful features because it's essentially a profile of your code.

51:44 It can tell you, hey, by the way, the reason why your program is taking so long is that like when you're calling this function, it's taking 3,000 compute seconds.

51:51 And then that tells you, oh, I should go look there and maybe see why it is that it's so slow.

51:55 Yeah.

51:56 Yeah.

51:56 That's really cool.

51:57 And so far what we've been talking about is running this on CPUs, on dedicated VMs in the cloud.

52:04 But if you're going to do computational stuff like pure mathy type stuff, some of the fastest hardware you can get a hold of are actually graphics units, GPUs, right?

52:14 Yeah, absolutely.

52:15 I think that that's the thing that's driving the current AI renaissance right now is the fact that like these graphics processors aren't just good for playing video games.

52:26 They're good for doing general math.

52:27 And in many cases, they're several hundred times more math operations than your regular CPU can do.

52:33 The biggest issue with them is that they're really, really challenging to program.

52:38 Unlike your CPU, where each of the cores on your machine can do a completely different thing at once, and you can just program as if they're totally independent of each other.

52:47 The threads on a GPU all have to do exactly the same thing simultaneously.

52:51 They can kind of try and hide that from you a little bit.

52:54 But as soon as you do something that the GPU doesn't like, it suddenly gets hundreds of times slower and there's no benefit to using it.

53:00 It's really hard to write code.

53:01 It's efficient on a GPU.

53:05 And one of the things that we're spending a lot of time on, Euphoria, is figuring out how to get the Python code to run natively on GPU and to solve some of these programming problems.

53:15 And so the idea is, in the same way that we're able to make fast programs scaling out on the cluster by actually running them and learning from the way they're behaving, like which functions are slow and how they're accessing data, we can apply those same techniques to solve some of the problems that people have with GPUs.

53:33 So as an example, we can identify places where threads are in fact all doing something together in lockstep and say, hey, we've noticed that this is a good piece of code to run on a GPU and schedule that automatically.

53:45 And then there are more aggressive program transformations you can do where you can detect that if you've modified the program slightly, you would end up with something that actually would run efficiently on GPU.

53:59 I was reading some blog post by a guy at NVIDIA where he pointed out that if you move an array from local memory to shared memory in a GPU, suddenly it speeds up 25 times, which is like an enormous performance difference, right?

54:14 And I didn't even remember exactly when I was reading it, like what's the difference between local and shared memory?

54:19 And my point is that this is the kind of thing that like an optimizer that actually had a statistical model of your program would be perfectly capable of doing in an automated way, which would free you from the burden of trying to figure that out.

54:32 And in some cases might do optimizations that no programmer had ever thought of because it's such a hard thing to optimize and they didn't realize, wow, if I move this one little thing over, it's going to be faster.

54:41 So we're just getting started with this, but I think there's an enormous amount that we can do and you should expect to see more commits related to that coming out over the summer.

54:50 Yeah, I think that's really amazing.

54:52 And just for those of you guys who don't know out there, you can go to Amazon, AWS EC2 and say, I would like to get a GPU cluster with NVIDIA or ATI Radeon or whatever type of graphics cards, right?

55:04 Yeah.

55:05 So what's even better is you can do this with Spot.

55:09 So you can go and get machines that have four Teslas on them and you can pay like 25 cents an hour for that.

55:17 And so you can put a hundred machines with four Teslas on them for 25 bucks, which is just an incredible, that's like just a stupid amount of computing hardware.

55:27 The GPU instance prices on Amazon fluctuate a little bit more wildly because I think there are some researchers doing a whole lot of deep learning research on there.

55:36 So it's actually kind of funny.

55:38 Sometimes the smaller GPU instances actually cost a lot more than the larger ones because there's pressure for those on pricing.

55:45 But it is an amazing ecosystem.

55:47 And a lot of our work is focused on this idea that like not only is it easy to boot those machines, but it should be easy to use 100 GPUs without having to think about it, right?

55:56 There's actually some great software out there right now for using a single GPU effectively.

56:02 But what happens when you now say, okay, well, I have data on one GPU and it wants to talk to data on a completely different GPU on a completely different machine, right?

56:10 That's the thing that I'm trying to make totally transparent.

56:12 That's a really awesome problem.

56:13 And so obviously the GPUs are fast and this would make it possibly, maybe possible for you to get an answer to your question sooner.

56:21 But would it actually make it cheaper as well to answer your question in general?

56:25 No question that there are a whole host of problems that if you move them from CPU to GPU, your total cost of computation goes down by a couple orders of magnitude.

56:36 So actually as an example, this maximum likelihood calculation I was doing on the retail stuff for one of our customers, one of the reasons we're pushing into the GPU space is that we currently use about 40 Amazon machines,

56:50 the really big instances to get their calculation done about an hour, but they have to do this fairly frequently to update the model when new data comes in.

56:58 And, you know, I mean, that costs them 10 bucks every time they run it.

57:01 And if they, you know, do that every day, four times a day for a long enough time, that starts to become real money again.

57:06 You know, we've estimated that it would be about 5% of the cost if we can get that same calculation to run as efficiently as we think it could using GPU.

57:17 Now the problem with that is like, as soon as you do that, it's not clear that people will take the cost savings and not just reinvest it into a solving a bigger problem.

57:25 But like, look, that's great for them if they want to do it.

57:27 Yeah, that's interesting.

57:29 So 20 times cheaper when you say 5%.

57:31 Yeah, something like that.

57:32 Yeah, that's awesome.

57:33 I mean, it's extremely dependent on the problem.

57:35 Sometimes it's only like two times faster.

57:38 Sometimes it's slower.

57:39 And sometimes it's like, look, for the neural network stuff that people are doing, it's at least 100 times faster slash cheaper.

57:45 I mean, the neural network people all think in terms of electricity costs, right?

57:50 They literally say how many like training cycles they can do per kilowatt hour.

57:53 Wow.

57:54 Okay, that's really amazing.

57:57 So is there a way with your system to sort of extrapolate and estimate how much something would cost?

58:04 Is there a way to say, I have these three problems?

58:07 If I were to give it this much data, how much would each one of these cost?

58:10 Because I can only afford to answer one.

58:12 Well, so that's one of the things that we're hoping to answer with some of this extrapolatory runtime stuff, this ability to predict what runtime is actually going to be and how much compute power you need.

58:25 So the idea would be that you would run all three problems at several smaller scales.

58:30 The system could see how the runtime of all the little functions was changing as you're changing the problem size.

58:37 And then it would be able to extrapolate from there.

58:40 At the end of the day, though, it all depends on how accurate you want your estimate to be.

58:46 If you really want a perfect estimate, you should actually do that experiment yourself and make a decision.

58:51 And like, honestly, the way I usually do this problem is doing that process by hand.

58:56 I'll run it with some smaller input and I'll just keep adding zeros until it becomes appreciably slow.

59:01 And then I'll try and understand why it's changing.

59:03 But yeah, in principle, that's totally the direction that we're moving in where you can literally just get a price estimate for each of the three things.

59:12 And there might be, you know, plus minus two times or whatever, depending on how accurate the estimates are.

59:18 Yeah.

59:19 Okay.

59:19 That sounds great.

59:20 So I guess the last thing we have time to cover is how do you get started?

59:24 Like, I know it's on GitHub, but there's a lot of moving pieces with distributed computing.

59:29 And so can you just walk me through going from nothing until I have an answer, maybe, what that looks like?

59:35 Yeah, absolutely.

59:36 So we publish Docker images with the software backend every time we do a release.

59:42 So if you want to run this on your local machine, you pip install the front end, which is called Pyfora.

59:49 And that's just vanilla Python code that, like, has the thing that takes your Python code and sends it to the server.

59:56 And then you need to get some nodes that can actually run work.

01:00:00 And you've got two options.

01:00:01 One of them is to run it just on your local machine.

01:00:03 And doing this gets you the benefit of using the cores on your local machine.

01:00:07 So it's not trivial.

01:00:08 So to do that, you just do Docker run the euphora slash service colon latest.

01:00:15 That, you know, pulls the latest version of the euphora service and runs it.

01:00:18 And then if you want to run this on Amazon, which is the way I do everything, we have a little command line utility called Pyfora AWS.

01:00:24 And that thing can just boot instances.

01:00:28 You have to have your AWS credentials exposed in the environment the same way you would if you were using Voto to interact with AWS.

01:00:35 But if you do that, it knows how to start machines and stop them.

01:00:39 And it will make sure everything is configured on there correctly.

01:00:41 You do need to make some decisions about the security model.

01:00:45 Personally, I prefer to boot machines and then use SSH tunneling.

01:00:49 So then I'm connecting to a server that looks like it's on local host and SSH is taking care of all of the security.

01:00:56 But there are other ways to do it that are documented in Pyfora AWS.

01:00:59 But the basic idea is you use AWS or you use Docker locally to get something going, you know, one of these machines.

01:01:06 And that gives you an IP address that you can talk to.

01:01:09 In your Python program, you connect to that IP address.

01:01:12 You get a little connection object.

01:01:14 And then any code that you want to execute in Euphoria, you just put inside of a with block that references that connection.

01:01:21 So you say with my connection colon.

01:01:24 And then anything inside of there when the Python, when your Python interpreter gets to it, instead of executing it, it will take that code and all the resulting objects, ship them over to the server.

01:01:34 The server will execute them and then it will ship it back.

01:01:38 You know, I'm going to be now that we have Python notebook integration, I'm going to be publishing a bunch of example, Python notebooks.

01:01:45 So you should be able to go to our website and just pull down a few examples of those things operating.

01:01:49 But, you know, the basic point is that if you if you are a user of Amazon already, you can get up and running with the amount of time it takes to boot an AWS instance.

01:02:00 And how long it takes you to pip install Pyfora.

01:02:03 Wow, that sounds really cool.

01:02:05 Nice work on that.

01:02:06 And props for using the context manager, the width block.

01:02:10 It turns out it's a really it's a really nice way of doing it there.

01:02:13 You have to you have to do some clever trickery under the hood to make that work, especially when it comes time to propagate exceptions.

01:02:22 Because if you produce an exception on the server, you've got to take that state move back over and then kind of rebuild the appropriate stack trace objects on the, you know, on the client, which is a kind of an interesting thing to do.

01:02:35 But the end result is a is a really nice integration.

01:02:37 And it gives you this nice ability to pick and choose where you want to do the Pyfora, you know, to use the technology.

01:02:45 So it means that if there are parts of your code that are really never going to be paralyzed and you don't want to move them or if you're touching things on your file or you're reading things off of the Internet or whatever, that stuff can all happen on your local box.

01:02:57 It's just a heavy compute stuff can happen remotely.

01:03:00 It also gives you a nice way of deciding, like, which objects are going to live remotely and which objects are going to live locally.

01:03:06 So you can do a calculation that ends up producing this, you know, like think about the example we were talking about earlier.

01:03:12 We have a list of a billion strings like in the width block.

01:03:15 That list of a billion strings, you can do whatever you want in your local Python interpreter.

01:03:20 You'll end up with like a reference to that list.

01:03:22 It's like a proxy object.

01:03:24 You can't do anything with it locally because you obviously can't bring a terabyte of strings back into your local Python process without crashing it or waiting for hours and hours.

01:03:33 But you can then pass that back into a subsequent width block and do something there.

01:03:37 Cut it down to something smaller if you want to pull back a slice of it or whatever.

01:03:42 And so it makes a really nice workflow for kind of describing which things are going to be in your local process and which things are going to be remote.

01:03:47 That sounds awesome.

01:03:48 This is really cool.

01:03:50 If you've got big data processing, it sounds like people should check it out.

01:03:53 So I think we're going to have to leave it there.

01:03:55 We're just about out of time.

01:03:56 Let me ask you just a couple questions to wrap things up.

01:04:00 I always ask my guests what their favorite PyPI package is.

01:04:05 There's 75,000, 80,000 of them out there, and we all get experience with different ones that we'd want to tell people about.

01:04:10 What's yours?

01:04:11 Well, I hate to not be super interesting on this, but I have to say I love Pandas.

01:04:16 I think Wes did a great job.

01:04:18 I would argue that the resurgence of Python in the financial services community is basically due to the existence of Pandas pulling people out of the R mindset and into Python.

01:04:29 So if you're not familiar with it, check it out.

01:04:31 Probably everybody listening to this knows about it.

01:04:33 Yeah, Pandas is very cool.

01:04:34 And how about an editor?

01:04:36 What do you open up if you're going to write some code?

01:04:37 I am a strict adherent to Sublime.

01:04:40 I even paid for it, although I didn't have the energy to actually paste the key so that it still keeps asking me for the key.

01:04:47 But I did pay for it.

01:04:48 I think he did a great job with that editor.

01:04:50 Yeah, Sublime is great.

01:04:51 All right, so final call to action.

01:04:54 What should people do to get started with you for it?

01:04:56 So go and check us out on GitHub.

01:04:59 Try running some code.

01:05:01 We've tried to make it as easy as possible to get started.

01:05:04 As I said before, we're going to be posting a bunch of my Python notebooks with examples.

01:05:07 And then, you know, please give us feedback.

01:05:11 Tell us what libraries you want ported, what problems you're running into using it.

01:05:17 And, you know, like, honestly, if you run into some NumPy function that we didn't get to yet, take a crack at implementing it yourself.

01:05:23 It's a pretty straightforward model.

01:05:25 And send us a pull request.

01:05:26 We'd love to include it.

01:05:27 All right.

01:05:28 Excellent.

01:05:29 I think this is a great project you're on, and I'm happy to share with everyone.

01:05:32 So thanks for being on the show, Braxton.

01:05:34 Thank you so much for having me.

01:05:35 It's a pleasure to talk about this stuff, as always.

01:05:37 This has been another episode of Talk Python To Me.

01:05:41 Today's guest was Braxton McKee, and this episode has been sponsored by SnapCI and OpBeat.

01:05:46 Thank you guys for supporting the show.

01:05:48 SnapCI is modern, continuous integration and delivery.

01:05:51 Build, test, and deploy your code directly from GitHub, all in your browser with debugging, Docker, and parallels included.

01:05:57 Try them for free at snap.ci slash talkpython.

01:06:00 OpBeat is mission control for your Python web applications.

01:06:05 Keep an eye on errors, performance, profiling, and more in your Django, Flask, and Pyramid web apps.

01:06:09 Tell them thanks for supporting the show on Twitter, where they're at, OpBeat.

01:06:13 Are you or a colleague trying to learn Python?

01:06:15 Have you tried books and videos that left you bored by just covering topics point by point?

01:06:20 Well, check out my online course, Python Jumpstart by Building 10 Apps, at talkpython.fm/course,

01:06:26 to experience a more engaging way to learn Python.

01:06:28 You can find the links from this show at talkpython.fm/episodes slash show slash 60.

01:06:35 Be sure to subscribe to the show.

01:06:37 Open your favorite podcatcher and search for Python.

01:06:39 We should be right at the top.

01:06:40 You can also find the iTunes feed at /itunes, Google Play feed at /play,

01:06:45 and direct RSS feed at /rss on talkpython.fm.

01:06:49 Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.

01:06:54 You can hear the entire song at talkpython.fm/music.

01:06:58 This is your host, Michael Kennedy.

01:07:00 Thanks so much for listening.

01:07:01 I really appreciate it.

01:07:02 Smix, let's get out of here.

01:07:12 Mike Beckton, who rocked his best.

01:07:14 I'm first developers.

01:07:16 I'm first developers.

01:07:18 I'm first developers.

01:07:19 I'm first developers.

01:07:22 I'm first developers.

01:07:25 .

01:07:26 Don't forget.