WEBVTT

00:00:00.001 --> 00:00:04.920
When you think data exploration using Python, Jupyter Notebooks likely come to mind.

00:00:04.920 --> 00:00:08.300
They are excellent for those of us who gravitate towards Python.

00:00:08.300 --> 00:00:10.680
But what about your everyday power user?

00:00:10.680 --> 00:00:14.780
Think of that person who is really good at Excel but has never written a line of code.

00:00:14.780 --> 00:00:20.540
They can still harness the power of modern Python using a cool application called Superset.

00:00:20.540 --> 00:00:25.060
This open-source Python-based web app is all about connecting to live data

00:00:25.060 --> 00:00:28.820
and creating charts and dashboards based on it using only UI tools.

00:00:28.820 --> 00:00:32.440
It's super popular too, with almost 50,000 GitHub stars.

00:00:32.440 --> 00:00:36.160
Its creator, Max Bushman, is here to introduce it to us all.

00:00:36.160 --> 00:00:41.460
This is Talk Python To Me, episode 382, recorded September 19th, 2022.

00:00:41.460 --> 00:00:57.400
Welcome to Talk Python To Me, a weekly podcast on Python.

00:00:57.400 --> 00:00:59.120
This is your host, Michael Kennedy.

00:00:59.120 --> 00:01:04.740
Follow me on Twitter where I'm @mkennedy, and keep up with the show and listen to past episodes at talkpython.fm.

00:01:04.740 --> 00:01:08.320
And follow the show on Twitter via at Talk Python.

00:01:08.320 --> 00:01:11.960
We've started streaming most of our episodes live on YouTube.

00:01:11.960 --> 00:01:19.560
Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

00:01:20.560 --> 00:01:24.860
Did you know one of the best ways to support the show is by taking one or more of our courses?

00:01:24.860 --> 00:01:32.080
In fact, we have one of the largest libraries of Python courses out there with over 240 hours of videos.

00:01:32.080 --> 00:01:37.340
Before we get to the conversation, I want to quickly let you know that we just released three new ones.

00:01:37.340 --> 00:01:42.560
Django Getting Started, Getting Started with pytest, and Python Data Visualization.

00:01:42.740 --> 00:01:47.500
All three are excellent courses, and their landing pages each have a video introducing the course.

00:01:47.500 --> 00:01:52.140
Visit talkpython.fm and click on Courses in the nav bar to learn more.

00:01:52.140 --> 00:01:56.020
Thank you for making Talk Python training part of your career journey.

00:01:56.020 --> 00:01:57.200
Now on to the show.

00:01:57.200 --> 00:02:00.020
Max, welcome to Talk Python To Me.

00:02:00.020 --> 00:02:00.640
Well, thank you.

00:02:00.640 --> 00:02:01.540
Exciting to be here.

00:02:01.540 --> 00:02:04.940
Love to talk about Apache Superset, so hit me up.

00:02:05.020 --> 00:02:10.000
Yeah, it's quite the thing that you've created, and it looks like it's really going strong.

00:02:10.000 --> 00:02:18.060
So we're going to talk about tools for data exploration in general, and then we'll dive in and focus on Superset, which is what you've created.

00:02:18.060 --> 00:02:19.220
So I'm really excited to do that.

00:02:19.220 --> 00:02:20.120
Excited to do it, too.

00:02:20.120 --> 00:02:29.340
I've been kind of baiting and kind of swimming in this old world of data, data orchestration, exploration, visualization for the past 20 years or so.

00:02:29.340 --> 00:02:35.600
So that's been really my focus, so I should have a lot to say about everything related to this space.

00:02:35.600 --> 00:02:36.320
Yeah, fantastic.

00:02:36.320 --> 00:02:43.200
And you've got a lot of experience at many of the big tech companies that people would think of as having lots of interesting data to look at.

00:02:43.200 --> 00:02:45.580
So we can dive into that just a bit at the beginning here.

00:02:45.580 --> 00:02:49.260
But before we get to any of those things, let me kick it off with a beginning question.

00:02:49.260 --> 00:02:52.320
How did you get into programming and Python and all these things?

00:02:52.320 --> 00:02:52.900
Oh, goodness.

00:02:52.900 --> 00:02:59.040
Yes, I did a decline of an associate degree back around like in the late 90s.

00:02:59.100 --> 00:03:03.220
So that kind of, you know, says about how long I've been doing this.

00:03:03.220 --> 00:03:04.580
But I never finished it, too.

00:03:04.580 --> 00:03:08.980
So I never finished and got my actual diploma for it, too.

00:03:08.980 --> 00:03:12.180
So I got into an internship to join a company called Ubisoft.

00:03:12.180 --> 00:03:13.340
So it's a video game company.

00:03:13.340 --> 00:03:16.000
It's one of the major video game companies out there.

00:03:16.000 --> 00:03:20.940
And I went on to my first internship and never looked back and never finished a program.

00:03:20.940 --> 00:03:22.660
So that's where my career started.

00:03:22.660 --> 00:03:23.300
That's awesome.

00:03:23.300 --> 00:03:27.280
This program was like very, it's called a Technic Online Filmatics.

00:03:27.480 --> 00:03:28.880
I'm from Quebec City originally.

00:03:28.880 --> 00:03:31.660
So I grew up speaking French and I was a program in French.

00:03:31.660 --> 00:03:33.380
And it's a technical program.

00:03:33.380 --> 00:03:39.800
The goal of the program was to send, you know, technicians, as they call them, people that are very technical, really focused.

00:03:40.180 --> 00:03:44.540
And then give them the skills that they need to be effective joining companies.

00:03:44.540 --> 00:03:49.560
So some programming, some data modeling, a little bit of SQL.

00:03:49.560 --> 00:03:54.300
And then really, like, the skills that you need to get started and start coding.

00:03:54.300 --> 00:03:57.900
Not necessarily thinking about, like, computer science and, like, data structures.

00:03:57.900 --> 00:04:01.120
Like, much more, like, what do you need to get started?

00:04:01.260 --> 00:04:02.780
Let me interrupt you for just a second there.

00:04:02.780 --> 00:04:05.360
And we can maybe talk just a bit about this.

00:04:05.360 --> 00:04:12.500
I feel like a lot of people looking in from the outside feel like, oh, I need a computer science degree in order to do X, Y, or Z.

00:04:12.500 --> 00:04:16.360
Whatever it is, you know, create APIs, create a business, do data science or whatever.

00:04:16.720 --> 00:04:23.620
And so much of the focus of CS degree seems to be on algorithms, on operating systems.

00:04:23.620 --> 00:04:30.620
And while those are really good to know, they're not necessarily the skills you sit down and go, let me remember my algorithms things.

00:04:30.620 --> 00:04:33.140
Like, you just call a function on a data structure.

00:04:33.140 --> 00:04:35.120
Or let me remember my operating system stuff.

00:04:35.120 --> 00:04:36.180
Like, you just run the code.

00:04:36.180 --> 00:04:37.860
I mean, it's helpful to have that.

00:04:37.860 --> 00:04:39.820
But I don't feel like it's that necessary.

00:04:39.820 --> 00:04:44.200
I don't want people out there listening to think, oh, I've got to go get a CS degree or I'm not going to do anything, right?

00:04:44.200 --> 00:04:54.820
Yeah, I think, you know, the boot camp, I've been told, like, flip that upside down to say, like, oh, no, all you need is the technical skills to get started to, I don't know, build an app, you know?

00:04:54.820 --> 00:04:59.040
And then you don't need those fundamentals or maybe the premise that you need them later.

00:04:59.040 --> 00:05:02.200
I think there needs to be some balance there, you know?

00:05:02.200 --> 00:05:06.820
So the CS approach, like, let's start with the foundation and how we got here.

00:05:06.820 --> 00:05:08.860
And then, you know, the rest should follow.

00:05:08.860 --> 00:05:10.700
I don't think that's right.

00:05:10.700 --> 00:05:16.360
So to me, you don't really have that curiosity about how you got there until you've been a practitioner.

00:05:16.360 --> 00:05:22.380
So to me, I'm like, hey, teach people the skills they need to be successful and useful to employers.

00:05:22.380 --> 00:05:27.500
It seems like the way that university in general or education should be oriented towards.

00:05:27.500 --> 00:05:32.680
Let's teach people the skills they need to be to contribute effectively to the market.

00:05:32.680 --> 00:05:41.080
And then I think maybe the CS constructs is something you would learn, that wisdom you would build, you know, as you learn.

00:05:41.080 --> 00:05:43.840
Some of you would pick it up and some of you are like, I really need to know.

00:05:43.840 --> 00:05:45.100
I've been doing this for a year now.

00:05:45.100 --> 00:05:46.380
I need to know how this thing works.

00:05:46.380 --> 00:05:47.880
And you'll dive into it, right?

00:05:47.880 --> 00:05:52.200
But when you're motivated and you're kind of, you have that experience already, yeah.

00:05:52.380 --> 00:05:55.760
Yes, it's like when you have to solve the problem, certain problems.

00:05:55.760 --> 00:06:04.020
And then maybe at that point, I don't know, if you're writing a bunch of SQL and you're building a lot of data structure, maybe you need to understand like data modeling construct.

00:06:04.020 --> 00:06:08.960
And that's a good time to go and understand the history of the different approaches to data modeling.

00:06:08.960 --> 00:06:11.940
But maybe you don't start from the theory, right?

00:06:11.940 --> 00:06:12.300
Yeah.

00:06:12.640 --> 00:06:20.300
But yeah, so like going back to your question, so I, so then I joined a, so I was a web developer kind of building internal apps for like a year.

00:06:20.300 --> 00:06:33.780
And then very quickly I got into data and got into using, building data warehouse at Ubisoft and then using the business intelligence kind of tool set toolkit to build all sorts of ports, dashboard kind of self-service things.

00:06:33.780 --> 00:06:35.780
So people could consume data.

00:06:35.780 --> 00:06:37.540
So very quickly got into that.

00:06:37.540 --> 00:06:41.140
And it's a little bit later that I learned, I started doing more scripting.

00:06:41.220 --> 00:06:54.920
So when I joined Yahoo in 2007, I believe that was like the birth of Hadoop and Yahoo had some, some pearl, you know, so I learned a script a little bit and kind of interpreted languages more.

00:06:54.920 --> 00:07:00.100
And then by the time I think I started, I started building more, more, more website for personal projects.

00:07:00.100 --> 00:07:01.800
So learn a bunch of Python there.

00:07:01.800 --> 00:07:03.320
I did a lot with Django.

00:07:03.320 --> 00:07:08.800
And by the time I, by the time I joined Facebook in 2012, I knew Python very well.

00:07:09.200 --> 00:07:18.820
And then that became kind of my, my main kind of my main language, you know, and that's really, you know, what we use internally for a lot of things at Facebook.

00:07:18.820 --> 00:07:25.940
And that just became like more and more of the established kind of language for everything data related right around that time.

00:07:25.940 --> 00:07:27.380
What a cool set of experience.

00:07:27.380 --> 00:07:30.320
You know, you were at, was it Lyft, Airbnb?

00:07:30.320 --> 00:07:31.120
Yeah.

00:07:31.120 --> 00:07:31.640
Facebook.

00:07:31.640 --> 00:07:32.060
Yeah.

00:07:32.060 --> 00:07:32.560
Facebook.

00:07:32.560 --> 00:07:33.080
Yeah.

00:07:33.080 --> 00:07:34.240
A lot, a lot of.

00:07:34.240 --> 00:07:34.720
And Ubisoft.

00:07:34.720 --> 00:07:35.120
Yeah.

00:07:35.500 --> 00:07:37.220
So Ubisoft is interesting.

00:07:37.220 --> 00:07:38.660
They're a Canadian company, right?

00:07:38.660 --> 00:07:39.980
They are a French company.

00:07:39.980 --> 00:07:47.620
So their headquarters is in Montreux, like very, like next to Paris, or I think they were actually like they're from Bretagne, so Brittany somewhere.

00:07:47.620 --> 00:07:47.980
And.

00:07:47.980 --> 00:07:48.720
Okay.

00:07:48.720 --> 00:07:50.060
They're a French company.

00:07:50.060 --> 00:07:52.040
They have a huge studio in Montreal, though.

00:07:52.040 --> 00:07:55.960
There's like amazing tax breaks in Quebec and Canada.

00:07:55.960 --> 00:08:01.120
And they decided to build like one of the biggest, if not the biggest video game studio in the world in Montreal.

00:08:01.120 --> 00:08:02.540
So that's where I started my career.

00:08:02.540 --> 00:08:02.900
Yeah.

00:08:02.900 --> 00:08:09.380
Well, the reason I'm bringing it up is I want to ask you about what it's like working at a game company versus a more traditional.

00:08:09.380 --> 00:08:12.660
I don't know if you call Yahoo traditional, but like standard.

00:08:12.660 --> 00:08:15.560
You know, a lot of people dream of being at these game companies.

00:08:15.560 --> 00:08:17.400
And that's even maybe why they got into programming.

00:08:17.400 --> 00:08:18.400
And I don't know.

00:08:18.400 --> 00:08:20.140
Tell us what your experience was like there.

00:08:20.140 --> 00:08:20.580
Yeah.

00:08:20.580 --> 00:08:22.000
So it's a dated experience, right?

00:08:22.000 --> 00:08:23.020
I don't know what it is.

00:08:23.020 --> 00:08:26.140
Like I left Ubisoft in 2007.

00:08:26.140 --> 00:08:28.060
So it's like a pretty dated.

00:08:28.060 --> 00:08:29.260
It's 15 years ago.

00:08:29.260 --> 00:08:31.800
I can say about like what it was like at the time.

00:08:31.800 --> 00:08:33.680
It's a mix of like super fun.

00:08:33.680 --> 00:08:36.060
It was like super young, a bit bro-y too.

00:08:36.060 --> 00:08:38.480
And a lot of ways in a very masculine environment.

00:08:38.480 --> 00:08:43.560
Also, like, you know, some of it is because it's, you know, 15, 15 to 20 years ago.

00:08:43.560 --> 00:08:45.640
I think it was a slightly different world.

00:08:45.640 --> 00:08:51.360
And a lot of things that were maybe dubious back then are definitely not okay anymore.

00:08:51.700 --> 00:08:53.100
I mean, I think there's that culture.

00:08:53.100 --> 00:08:55.960
People talk about, I think Electronic Arts has been famous for that.

00:08:55.960 --> 00:09:00.820
And a lot of the big video game companies is having like these work environments that were

00:09:00.820 --> 00:09:03.920
really like, you know, dubious in some ways.

00:09:03.920 --> 00:09:07.700
But I think Ubisoft was a great place to be, I think, at the time.

00:09:07.700 --> 00:09:13.000
And I think like maybe one of the better ones kind of bringing it back to where it should be

00:09:13.000 --> 00:09:14.780
and ahead of its time, perhaps.

00:09:14.780 --> 00:09:19.300
But my experience at Ubisoft was so interesting because it's difficult for me to talk about

00:09:19.300 --> 00:09:23.960
what is Ubisoft because I work at three different studios, Montreal, Paris.

00:09:23.960 --> 00:09:29.660
I was in Montreal for about a year until I was in Ubisoft Paris for about three years.

00:09:29.660 --> 00:09:33.560
And then Ubisoft San Francisco for another three years.

00:09:33.560 --> 00:09:36.940
So the three different offices were vastly different.

00:09:36.940 --> 00:09:41.900
And I think, you know, the things that kind of plagued the video game business are like long hours,

00:09:42.180 --> 00:09:46.900
kind of low pay, I think that just like grinding people out sort of thing.

00:09:46.900 --> 00:09:47.160
Yeah.

00:09:47.160 --> 00:09:47.560
Yeah.

00:09:47.560 --> 00:09:51.680
Kind of like there's never enough, a lot of crunch time all the time.

00:09:51.680 --> 00:09:55.140
And then kind of a great place maybe to start your career.

00:09:55.140 --> 00:09:58.280
But then as people mature, they tend to go other.

00:09:58.280 --> 00:10:02.160
So at least I think it has changed so much.

00:10:02.160 --> 00:10:04.120
The whole world culture has changed a lot.

00:10:04.520 --> 00:10:05.000
Yeah.

00:10:05.000 --> 00:10:10.840
As you have relationships and families and you want to see them, things like that.

00:10:10.840 --> 00:10:11.120
Yeah.

00:10:11.120 --> 00:10:11.540
Yeah.

00:10:11.540 --> 00:10:12.540
That might evolve.

00:10:12.540 --> 00:10:12.940
Yeah.

00:10:12.940 --> 00:10:13.320
Seriously.

00:10:13.320 --> 00:10:17.640
You're going to age out of working out all the time or just by necessity too.

00:10:17.640 --> 00:10:18.460
Yeah, exactly.

00:10:18.460 --> 00:10:24.100
Let's kick off our conversation focusing on data exploration, I think.

00:10:24.240 --> 00:10:29.700
So when I think about data exploration, not from a developer or data science, but in like

00:10:29.700 --> 00:10:36.420
the super broad sense, I don't know what comes to mind for you, but Excel, I feel like most

00:10:36.420 --> 00:10:37.640
people are like, I've got some data.

00:10:37.640 --> 00:10:42.240
I need to maybe think about it a little bit more analytically than just a bunch of numbers.

00:10:42.240 --> 00:10:44.040
Let me throw it in Excel and see what I can do with it.

00:10:44.040 --> 00:10:44.320
Yeah.

00:10:44.320 --> 00:10:47.180
I think Excel is like a super open.

00:10:47.180 --> 00:10:52.260
If you think about like Excel as a playground or as a framework, you know, it's super open

00:10:52.260 --> 00:10:52.540
ended.

00:10:52.760 --> 00:10:56.200
You can do so much in there and there's not a lot of constraints, right?

00:10:56.200 --> 00:11:00.120
So the constraint that exists in an Excel file, the ones that you make for yourself.

00:11:00.120 --> 00:11:04.480
And then maybe one constraint is like, it used to be like, you know, I forgot what it was,

00:11:04.480 --> 00:11:08.260
but it's like, you know, 65,000 rows, you know, for a long time.

00:11:08.260 --> 00:11:13.400
And I think now there's no such limits anymore, but there's still like a limit of how much your

00:11:13.400 --> 00:11:17.440
laptop is going to be able to, you know, in terms of the size of a pivot table.

00:11:17.440 --> 00:11:22.460
And the past companies where I was at, there's no way you could bring the dimensionality

00:11:22.460 --> 00:11:24.680
and kind of the raw data that you need in Excel.

00:11:24.680 --> 00:11:27.440
So you need to kind of prepare and extract of the stuff you're going to play with.

00:11:27.440 --> 00:11:27.840
Yeah.

00:11:27.840 --> 00:11:28.940
First, you got to be in Excel.

00:11:28.940 --> 00:11:33.320
And then there's like things that, you know, BI historically has not been really good at

00:11:33.320 --> 00:11:37.420
is a what if analysis, creating different scenarios, doing forecasting.

00:11:37.420 --> 00:11:43.560
So I think like that is an area where our spreadsheet dominate will keep dominating, right?

00:11:43.600 --> 00:11:48.720
If you want to tie certain things to variables and change the numbers and see how other like,

00:11:48.720 --> 00:11:50.140
you know, charts and models are.

00:11:50.140 --> 00:11:53.820
So modeling kind of is a really good case, I think, for Excel.

00:11:53.820 --> 00:11:58.780
Then, you know, the downside is like, oh, how do you collaborate on these things?

00:11:58.780 --> 00:12:03.120
And like the version is kind of a mess where you end up with like another file, SharePoint.

00:12:03.120 --> 00:12:07.340
Even the files are binary, so you can't, you can't diff them or anything easily, right?

00:12:07.340 --> 00:12:07.780
Oh yeah.

00:12:07.780 --> 00:12:09.040
They're not in source control.

00:12:09.340 --> 00:12:13.760
And then you don't know, like there's no introspection as to like how things got there.

00:12:13.760 --> 00:12:18.900
There are a mix of like data that are from a source and then kind of made up stuff sometimes.

00:12:18.900 --> 00:12:19.920
Like, I'm going to tweak this.

00:12:19.920 --> 00:12:20.720
I'm going to change that.

00:12:20.720 --> 00:12:25.380
So you don't know what is the list of all the changes that were applied to the source data.

00:12:25.380 --> 00:12:25.780
Yeah.

00:12:25.780 --> 00:12:26.160
Yeah.

00:12:26.160 --> 00:12:26.720
So I don't know.

00:12:26.720 --> 00:12:29.820
I think it's a good tool, but it's definitely incomplete, right?

00:12:29.820 --> 00:12:32.420
It's part of what the impact will always be, you know?

00:12:32.420 --> 00:12:33.880
I don't bring it up as a recommendation.

00:12:33.880 --> 00:12:36.780
I bring it up as I feel like a lot of people are starting here.

00:12:36.940 --> 00:12:41.420
And so like, how can, how can we look around and see maybe what is a better option out there,

00:12:41.420 --> 00:12:41.700
you know?

00:12:41.700 --> 00:12:42.040
Yeah.

00:12:42.040 --> 00:12:46.740
And it's, I think if you've used Excel a lot in your organization or personally, I think

00:12:46.740 --> 00:12:51.500
people discover kind of hit their head on the limitations and the problems that come with

00:12:51.500 --> 00:12:52.980
such an open framework.

00:12:54.660 --> 00:12:57.440
This portion of Talk Python To Me is brought to you by Sentry.

00:12:57.440 --> 00:13:00.580
You know, Sentry as a longtime sponsor of this podcast.

00:13:00.580 --> 00:13:04.240
They offer great error monitoring software that I've told you about many times.

00:13:04.240 --> 00:13:07.100
It's even software that we use on our own web apps.

00:13:07.100 --> 00:13:10.700
But this time I want to tell you about a fun conference they have coming up.

00:13:10.700 --> 00:13:13.940
Sentry is hosting DEX, Sort the Madness.

00:13:13.940 --> 00:13:19.500
The conference for every developer to join as we investigate the movement and trends for a

00:13:19.500 --> 00:13:21.640
better and more reliable developer experience.

00:13:22.180 --> 00:13:23.840
What is this madness, you ask?

00:13:23.840 --> 00:13:28.040
It's the never-ending need to deploy stable code quickly.

00:13:28.040 --> 00:13:33.520
Come to DEX to engage with developers who will share their epic fails and their glorious saves.

00:13:33.520 --> 00:13:37.660
Sentry can't fix the madness, but they can start sorting through it with you.

00:13:37.660 --> 00:13:44.940
Register today to join in San Francisco or attend virtually on September 28th at talkpython.fm

00:13:44.940 --> 00:13:45.960
slash DEX.

00:13:45.960 --> 00:13:48.740
That's talkpython.fm/DEX.

00:13:48.740 --> 00:13:50.320
The link is in your show notes.

00:13:50.320 --> 00:13:53.620
Thank you to Sentry for supporting Talk Python To Me.

00:13:55.520 --> 00:13:59.640
Out in the audience, Ollie says, my local data extraction people default to Excel and they

00:13:59.640 --> 00:14:02.600
seem limited by the number of sheets available in a workbook.

00:14:02.600 --> 00:14:03.080
Yeah.

00:14:03.080 --> 00:14:09.560
Well, I guess now that it's not the number of lines in a file, I guess the number of sheets.

00:14:09.560 --> 00:14:10.040
That's right.

00:14:10.040 --> 00:14:15.360
So, you know, sort of stepping up a level from this, I feel like maybe heading down a more

00:14:15.360 --> 00:14:16.200
structured way.

00:14:16.200 --> 00:14:20.140
Like, one of the problems with Excel is how do I talk to databases and APIs?

00:14:20.360 --> 00:14:23.680
Like, how do I bring in other more live data is really, really limited.

00:14:23.680 --> 00:14:26.080
I know there's like BI stuff, but not really.

00:14:26.080 --> 00:14:27.460
Sort of the next step up.

00:14:27.460 --> 00:14:28.040
What do you think?

00:14:28.040 --> 00:14:30.520
Is that, is that Jupyter or like, where's the next level here?

00:14:30.520 --> 00:14:30.940
I don't know.

00:14:30.940 --> 00:14:31.240
Of course.

00:14:31.240 --> 00:14:36.340
We're talking about consumption now in some ways, but I feel like in a lot of ways we

00:14:36.340 --> 00:14:38.760
should be talking about data engineering too.

00:14:38.760 --> 00:14:40.280
So where is your data, right?

00:14:40.280 --> 00:14:41.320
Is the first question.

00:14:41.320 --> 00:14:46.280
So first your data is not, or maybe some data lives in Excel, but that's not where your data

00:14:46.280 --> 00:14:49.580
lives nowadays in the SaaS application you use, right?

00:14:49.580 --> 00:14:55.220
So the modern, like just even any startup or company uses hundreds of SaaS applications.

00:14:55.220 --> 00:14:55.800
Yeah.

00:14:55.800 --> 00:15:02.100
CRMs, applicant tracking systems, you know, GitHub and just a million different data sources.

00:15:02.100 --> 00:15:08.060
And it feels like one of the first thing you need to do is to bring that data together,

00:15:08.060 --> 00:15:08.560
right?

00:15:08.560 --> 00:15:14.660
Like in a central place or into some sort of like inside, you know, either data marts

00:15:14.660 --> 00:15:15.360
or data warehouse.

00:15:15.360 --> 00:15:20.340
I think like an early construct that you need as an organization because data is most useful

00:15:20.340 --> 00:15:24.240
when it's put alongside the other data you have in your organization.

00:15:24.240 --> 00:15:29.100
It does make sense to hoard all this data and bring it all to a central place.

00:15:29.100 --> 00:15:33.680
If you want to do consumption, otherwise consumption is going to be kind of a stitching story, right?

00:15:33.680 --> 00:15:37.560
So let's say you're in Excel or you're in a, you're in the local database or whatever,

00:15:37.560 --> 00:15:38.780
whatever it might be.

00:15:38.780 --> 00:15:43.860
The first thing you have to do is bring the things that are related in one place.

00:15:43.860 --> 00:15:47.500
So you can do that visualization consumption analysis, right?

00:15:47.500 --> 00:15:52.880
How do you join on a thing that's partly in an API and partly an air table or something?

00:15:52.880 --> 00:15:53.420
Right.

00:15:53.420 --> 00:15:56.620
Even if you, let's say we take a notebook, so super open-ended, right?

00:15:56.620 --> 00:15:57.260
What is a notebook?

00:15:57.260 --> 00:16:01.460
It's just like, you know, it's a script with REPL and, you know, where you can run, you

00:16:01.460 --> 00:16:06.840
know, chunks of the script sequentially and you have a persistent kernel or interpreter

00:16:06.840 --> 00:16:10.320
kind of supporting what you're doing at any point in time.

00:16:10.320 --> 00:16:15.160
But the first thing, if you don't have a data warehouse or your data all in one place, you're

00:16:15.160 --> 00:16:17.860
going to try to do some data engineering is probably the first thing you're going to do

00:16:17.860 --> 00:16:21.160
within your notebook is to say, how do I get the data that I need?

00:16:21.160 --> 00:16:27.480
The source or sources that are interesting to me and the notebook, you know, will enable

00:16:27.480 --> 00:16:33.380
you for sure to do this, but then is it, you know, can other people build on top of the

00:16:33.380 --> 00:16:34.480
work that you did in a notebook?

00:16:34.480 --> 00:16:38.000
Probably not or not as easily as you'd want them to.

00:16:38.000 --> 00:16:42.700
So I think the data warehousing kind of approach of saying like, hey, let's bring data that we

00:16:42.700 --> 00:16:47.500
need in our organization to a central place and try to stitch it together there so it can

00:16:47.500 --> 00:16:49.680
then best be used for consumption.

00:16:49.680 --> 00:16:54.940
And analysis is still a very important step in the process.

00:16:55.180 --> 00:17:00.180
Sure. I totally agree. And, you know, Jupyter gets Jupyter and JupyterLab gets a lot of the

00:17:00.180 --> 00:17:06.320
mindshare, but there are many, many choices. I interviewed Sam Lau and he did a research project

00:17:06.320 --> 00:17:11.700
where they categorized over 60 different notebook environments where Jupyter was one of them.

00:17:11.700 --> 00:17:17.300
It's just, it's off the hook. So there's a lot of choices out there and so on, but let's focus on

00:17:17.300 --> 00:17:17.900
superset.

00:17:17.900 --> 00:17:21.860
I'd love to talk about outputs. It's like, why do we need to set 60 different notebooks? I feel like I

00:17:21.860 --> 00:17:26.180
missed a step of like the evolution of notebooks. I'm very familiar with Jupyter,

00:17:26.180 --> 00:17:33.180
deploy Jupyter hub at Airbnb a while ago, but then, you know, followed Hex a little bit.

00:17:33.180 --> 00:17:39.160
That's one of the players in space. Also followed. So at Lyft, we kind of built our own little

00:17:39.160 --> 00:17:46.140
notebook service, right? So we had a Kubernetes cluster. We kind of say like, I want this Docker

00:17:46.140 --> 00:17:51.860
image base for my notebook. You'd pick like, I want the AI and now package, or I want, you know,

00:17:51.860 --> 00:17:56.620
basically what's the base for your notebook. And then you could pick some hardware, like I need GPUs

00:17:56.620 --> 00:18:01.100
or I need a big machine or small machine. And then we'd spin off these environments for people, but

00:18:01.100 --> 00:18:05.340
try to understand like, what are the different, like, why is there 60 notebooks? And what's,

00:18:05.340 --> 00:18:09.600
what are the different flavors? Or like, how do they all differentiate from each other? Is this dubious

00:18:09.600 --> 00:18:14.080
question? It was crazy. I was kind of blown away by this. And if you look, they have a,

00:18:14.080 --> 00:18:19.120
it seems like it always differs on some axis. Like, well, we want more collaboration like Google Docs,

00:18:19.120 --> 00:18:24.940
or we want it to run into a different place, like PyIodip. We want that to run in the front end,

00:18:24.940 --> 00:18:30.540
rather than, you know, with like some sort of Python in the browser. And like, there's just all these crazy

00:18:30.540 --> 00:18:35.180
variations. So I think there's a lot, I just kind of only highlight that to point out, like,

00:18:35.460 --> 00:18:40.140
it's not just Jupyter. There's like a ton of these things where, where Jupyter is the main

00:18:40.140 --> 00:18:46.120
environment that kind of lives in a web browser where people go and explore data. And I feel like

00:18:46.120 --> 00:18:51.780
Superset, it's a pretty modern, interesting player in that space of many choices.

00:18:51.780 --> 00:18:57.680
Yeah. Happy to talk about Superset too, and trying to introduce it in the context of what we're talking

00:18:57.680 --> 00:19:02.100
about before. Yeah. But think about Superset, right? Tell us about Superset.

00:19:02.100 --> 00:19:10.520
So Superset is essentially very much like a data exploration, dashboarding, visualization tool that's very much

00:19:10.520 --> 00:19:18.600
like catering to organization, right? So we, Superset solves like challenges or the problem space of data

00:19:18.600 --> 00:19:25.380
consumption for entire teams. So we're not necessarily focused on people who know Python or people who are,

00:19:25.380 --> 00:19:30.260
you know, data scientists or data analysts or data engineers, like we very much cater to the entire team.

00:19:30.260 --> 00:19:36.580
And the idea there is to have a single place to explore data, visualize it, interact with it,

00:19:36.580 --> 00:19:43.280
share, create dashboard. And then we have a SQL IDE on top of that too. I think like on the GitHub page,

00:19:43.280 --> 00:19:47.580
I don't know here if we have good screenshots to, I think an image is worth a thousand word. And I know

00:19:47.580 --> 00:19:53.640
not everyone is like looking at what we're looking at, but here we have the drag and drop kind of

00:19:53.640 --> 00:19:58.680
explore. I think the screenshots a little bit dated there might be a little bit more recent on a

00:19:58.680 --> 00:20:03.480
GitHub on the GitHub page too, where you can see like, and we have this drag and drop interface,

00:20:03.480 --> 00:20:09.000
very similar to what people are familiar with in, in business intelligence, right? Like you, where you

00:20:09.000 --> 00:20:14.280
have access to your data set and you drag and drop your metrics and dimensions and pick your visualization

00:20:14.280 --> 00:20:20.040
type, get to the exact chart that you want. And you can assemble these charts into interactive dashboards

00:20:20.040 --> 00:20:26.280
with like, you know, dynamic filtering on the dashboard and expose that to, to, to business users,

00:20:26.280 --> 00:20:30.920
right? So they can explore on their own, they can create their own dashboard. They can answer their

00:20:30.920 --> 00:20:32.920
own questions. Yeah. This sort of thing.

00:20:32.920 --> 00:20:38.840
It lives in a really, really interesting space. And that's why I brought up Excel as well as because

00:20:38.840 --> 00:20:44.520
Excel is not meant for programmers, but it's meant for people who are trying to do serious stuff with

00:20:44.520 --> 00:20:49.000
it. They kind of, well, maybe the right equals and they'll find a formula they can put in there or,

00:20:49.000 --> 00:20:53.320
you know, they'll do like a V lookup or they'll, they're kind of trying to go more than just like,

00:20:53.320 --> 00:20:59.720
I need a grid of stuff. And while Jupyter and those things are awesome, superset feels like it caters a

00:20:59.720 --> 00:21:05.640
little bit more to like a power user type of person that has Python extension capabilities,

00:21:05.640 --> 00:21:09.480
but you don't have to start as a Python developer to get into it. Is that right?

00:21:09.480 --> 00:21:14.920
No, actually not. Right. Like, so the premise is like, you don't need to have any Python skills.

00:21:14.920 --> 00:21:19.640
The skills that may help if you want to go deeper inside superset is, you know, knowing some SQL,

00:21:19.640 --> 00:21:24.360
knowing SQL, it's not a requirement. So think about like, if you think about Tableau,

00:21:24.360 --> 00:21:29.560
people familiar with Tableau or Looker, right? That's really the space that we're in. So it's,

00:21:29.560 --> 00:21:34.280
platforming in a sense that, okay, you can access, you can, you access your database connection,

00:21:34.280 --> 00:21:39.720
you interact with datasets, but then, you know, think about the experience of a consumer,

00:21:39.720 --> 00:21:45.000
someone just consuming a dashboard. You just, you open a dashboard, it's a collection of chart,

00:21:45.000 --> 00:21:51.720
maybe it's titled like financial forecast for, you know, 2023. And you really need that little

00:21:51.720 --> 00:21:57.240
technical skills to, to use to, you need business knowledge mainly to consume a dashboard. These

00:21:57.240 --> 00:22:01.960
dashboards are interactive. So that means you'll be able to apply a filter on a specific quarter,

00:22:01.960 --> 00:22:07.320
a specific like customer type of market, right. And then, interact with the dashboard in that way.

00:22:07.320 --> 00:22:13.000
But primarily like the dashboard interface caters to the bit like the business user or anyone that

00:22:13.000 --> 00:22:13.960
is trying to understand.

00:22:13.960 --> 00:22:18.680
I see almost like a more of a BI type of a user person rather than...

00:22:18.680 --> 00:22:26.360
It is BI tool. Super set is a BI tool to be, to be there. It's a BI tool that maybe is modern in

00:22:26.360 --> 00:22:32.120
many ways and assumes. So if you want to get the way you get deeper, say in the Explorer, and I don't

00:22:32.120 --> 00:22:37.160
know if you can, if you can click on the upper left on the Explorer. So here for context, we're looking

00:22:37.160 --> 00:22:41.880
at more of the drag and drop place, and super set where you pick metrics and dimension and

00:22:41.880 --> 00:22:47.560
visualization type you want to look at. So it's your typical kind of tableau like interface. And here you

00:22:47.560 --> 00:22:53.320
can essentially just drag and drop, but if you don't do no SQL, you're able to create your own metrics and

00:22:53.320 --> 00:22:55.880
express them as SQL expressions, for instance.

00:22:55.880 --> 00:22:55.880
Right.

00:22:55.880 --> 00:22:56.280
Right.

00:22:56.280 --> 00:22:56.280
Right.

00:22:56.280 --> 00:22:57.560
Calculated metrics.

00:22:57.560 --> 00:23:02.520
Exactly. You can have computed columns and aggregation and stuff like that. Right.

00:23:02.520 --> 00:23:08.840
Exactly. Yeah. So you'll define metrics as aggregate, as SQL aggregatable expression. So sum of this

00:23:08.840 --> 00:23:15.160
divided by the count, this thing of that, and it has to be a valid SQL expression. But yeah, so for people

00:23:15.160 --> 00:23:19.960
who are a little bit more technical, maybe understand the data better and a little bit of knowledge of

00:23:19.960 --> 00:23:25.720
SQL, they can, they can, they don't have to, but they can use SQL as part of a

00:23:26.040 --> 00:23:31.640
that exploration experience. So you, for instance, if you pick a filter here, you'll be able to pick

00:23:31.640 --> 00:23:37.880
a column, an operator, like customer ID in, and then go pick the customer IDs in a

00:23:37.880 --> 00:23:43.960
GUI type setting. But you can also go to the little SQL editor and a filter pop over and then write a more

00:23:43.960 --> 00:23:48.840
complex SQL expression if you want to. So we wanted to not necessarily bury SQL as we feel like more and

00:23:48.840 --> 00:23:54.360
more people are learning SQL. It's becoming like the lingua franqua of data. We feel like there's going to be,

00:23:54.360 --> 00:23:58.520
you know, a certain percentage of the workforce in the next decade that's going to become

00:23:58.520 --> 00:24:03.720
more data literate. And that's in part by learning SQL and understanding, you know, understanding

00:24:03.720 --> 00:24:11.240
data set, data structures, and what data sets in their particular organizations are and are made out of.

00:24:11.240 --> 00:24:17.560
Right. And by using SQL, it means you can connect to different data sources and you can connect to live

00:24:17.560 --> 00:24:18.520
data, right?

00:24:18.520 --> 00:24:19.160
That's right.

00:24:19.160 --> 00:24:22.120
You don't have to do some kind of export or whatever. You just connect to Postgres or you

00:24:22.120 --> 00:24:24.040
connect to whatever and then go from there.

00:24:24.040 --> 00:24:29.640
Exactly. Yeah. So the, you know, the way things work in superset. So you go and create your database

00:24:29.640 --> 00:24:36.920
connection or connections to whatever, you know, SQL speaking databases you use as a data warehouse,

00:24:36.920 --> 00:24:41.400
as a data store. Things that are really popular right now are, you know, the big cloud data warehouse,

00:24:41.400 --> 00:24:47.400
Snowflake, BigQuery, but there's still a lot of Postgres and MySQL, even for analytical use cases,

00:24:47.400 --> 00:24:53.320
right? And people, so you connect to that database and then you go and you have different ways to get

00:24:53.320 --> 00:24:59.320
started. One is to go and start exploring the tables that exist already, tables or views,

00:24:59.320 --> 00:25:04.600
or you have this SQL IDE that you're kind of pointing to now, so it's possible for you to go and,

00:25:04.600 --> 00:25:10.040
you know, step down to that level that's more interacting at the SQL level. And here you can

00:25:10.040 --> 00:25:15.080
also create data sets, right? And create what we call virtual data sets that are essentially views for

00:25:15.080 --> 00:25:21.560
people are familiar with the database construct of a view. And that allows people to go and explore

00:25:21.560 --> 00:25:27.160
that data set of virtual data set, assemble dashboard, create visualization, collaborate with others,

00:25:27.160 --> 00:25:31.800
right? Share links on Slack and, you know, annotate, add comments, that kind of thing.

00:25:31.800 --> 00:25:35.800
Yeah. I want to dive into the data sources more, but I want to make sure that we highlight this for

00:25:35.800 --> 00:25:41.080
people listening who don't know about superset. Two things, and you've hinted pretty strongly at one

00:25:41.080 --> 00:25:46.360
already. First of all, when I go to Excel, I don't see a fork me on GitHub. I mean, I'm looking,

00:25:46.360 --> 00:25:54.680
I don't see anywhere on this page that it says fork me on GitHub. Over on Apache slash superset on GitHub.

00:25:54.680 --> 00:26:01.400
Yeah, clearly right there you can. So this is one, it's open source and two, very popular. It's

00:26:01.400 --> 00:26:08.120
almost got 50,000 stars and 10,000 forks. Like that's Django flask level of popularity for people,

00:26:08.120 --> 00:26:13.240
keeping score, I guess. Yeah, that's right. Yeah. And depending on, you know, and stars are just like,

00:26:13.240 --> 00:26:19.320
some sort of proxy for, for hype or interest. Yeah. Right. And fork or like, it's good proxy for how

00:26:19.320 --> 00:26:23.880
many people have kind of, you know, wanted to play with a code, which is also a proxy for a different

00:26:23.880 --> 00:26:28.520
kind of hype and interest. But yeah, it's up there, you know, probably in the top 50 to 100

00:26:28.520 --> 00:26:33.720
source projects of all time in terms of like value delivered and just popular.

00:26:33.720 --> 00:26:39.320
Yeah. Which is like way beyond what I expected, you know, in 2015 when I started abroad. Same with

00:26:39.320 --> 00:26:45.000
Apache Airflows. I also started Apache Airflow. That's also like very, very, very popular and used

00:26:45.000 --> 00:26:51.000
in like tens of thousands of organizations. I think it's similar. It speaks to like the scale and the,

00:26:51.000 --> 00:26:57.320
and just like how like the problem space is super validated. Like everyone needs to visualize data,

00:26:57.320 --> 00:27:02.840
explore data, create dashboard, you know, write SQL, see results, you know, visualize results.

00:27:02.840 --> 00:27:02.920
Yeah.

00:27:02.920 --> 00:27:08.520
So very popular, definitely the leading open source project in this space, you know, of,

00:27:08.520 --> 00:27:13.960
call it business intelligence data consumption. And it's a very mature project, right? So it's

00:27:13.960 --> 00:27:20.600
used by thousands of people at places like Airbnb, Microsoft, Tesla, people have forked the project or

00:27:20.600 --> 00:27:26.600
use it super heavily internally. That in the wild section that you're pointing to, which is kind of

00:27:26.600 --> 00:27:31.080
trying to list out the people who use the project is very limited kind of version, the tip of the

00:27:31.080 --> 00:27:35.240
iceberg type thing of the people who self-reported using the product.

00:27:35.240 --> 00:27:40.120
Yeah. So you have a link in the GitHub repo called in the wild, and it just lists out under these

00:27:40.120 --> 00:27:45.960
different verticals, you'll find these companies using them, which is, you know, on one hand,

00:27:45.960 --> 00:27:50.280
it doesn't matter if these other companies are using it or not. But then if you're trying to sell

00:27:50.280 --> 00:27:54.920
it to your organization or just trying to decide if you can trust it, like, well, if you know,

00:27:54.920 --> 00:27:59.400
you're in education and it works for brilliant.org and it works for you to me and the Wikimedia

00:27:59.400 --> 00:28:03.560
Foundation, maybe it'll work for you. You know, like that's a, it's a bit of validation, right?

00:28:03.560 --> 00:28:07.720
Yeah. And then, you know, especially looking at like, those are people that, you know, open

00:28:07.720 --> 00:28:13.000
a pull request to add their name to this like hidden file on the repo. All right. It shows how like

00:28:13.000 --> 00:28:18.200
two percent of the iceberg it is. But I think one thing I've been telling people on, in the context of

00:28:18.200 --> 00:28:23.240
this podcast is it makes sense. It's like, if you want to contribute to open source, there's a lot of ways you can

00:28:23.240 --> 00:28:28.280
contribute. And the obvious one is to, you know, use the software, open a pull request. But the less

00:28:28.280 --> 00:28:34.600
obvious one is to let the world know, like the very, the most basic and the very minimum, maybe when you

00:28:34.600 --> 00:28:41.080
use a, when your organization is getting significant value from an open source project, just to be public

00:28:41.080 --> 00:28:46.520
about it. Let the world know, you know, if you work at Uber and you get tons of value from, I don't know,

00:28:46.520 --> 00:28:52.440
Gatsby or like when, like whatever, let the world know that you do. And that's a vote of confidence.

00:28:52.440 --> 00:28:57.640
And it speaks to the scale of the community and kind of to work for other others. It probably,

00:28:57.640 --> 00:29:00.360
you know, the chances it's going to work for you are much greater.

00:29:00.360 --> 00:29:04.520
Yeah. Another thing that's interesting about the GitHub repo source code, really, I guess,

00:29:04.520 --> 00:29:08.440
is what I'm thinking of. Two things here. One is it's, it's super active, right? If you go in here

00:29:08.440 --> 00:29:12.840
and you look around, like sometimes you'll see, you know, last change two years ago or whatever,

00:29:12.840 --> 00:29:18.440
right. But oh yeah. Last change seven hours ago, a couple of days ago, two days ago, right?

00:29:18.440 --> 00:29:24.120
There's a lot of, a lot of activity here, right? Yeah. It's, it's super intense in terms of like

00:29:24.120 --> 00:29:28.120
how many people work on a project. There's like a contributors tab. You might be able to click on to

00:29:28.120 --> 00:29:34.600
on the right there, click contributors. So 832 people have contributed today. And that's just looking at

00:29:34.600 --> 00:29:39.640
code contributions. Here's possible to see the history of who's contributed. Something that's interesting

00:29:39.640 --> 00:29:46.280
is like, we distributed on PyPI, but, and the project was largely Python code. It looks like we

00:29:46.280 --> 00:29:50.600
have too much data and the GitHub UI is struggling to render. We're going to break GitHub. Sorry, GitHub.

00:29:50.600 --> 00:29:55.960
Yeah. We're breaking out GitHub right now because we have too many contributions. Here you can kind of

00:29:55.960 --> 00:30:02.120
see the scale contribution. You can also see how I've been selling into my CEO role and less in the

00:30:02.120 --> 00:30:10.760
code. A bunch of people have contributed over time. But yeah, I was going to say we decided to

00:30:10.760 --> 00:30:17.320
distribute on PyPI originally and, you know, was largely a Python project from the get go,

00:30:17.320 --> 00:30:23.560
like more and more. I feel like at the code distribution, a lot of the code is in TypeScript,

00:30:23.560 --> 00:30:29.160
JavaScript now because the nature of the project is so, such a front end project. And something that's

00:30:29.160 --> 00:30:36.520
interesting about open source is we have seen less like application GUI type app, like up the stack

00:30:36.520 --> 00:30:43.000
type projects really succeeding at scale. And superset is definitely one of those, like very much a

00:30:43.000 --> 00:30:49.880
front end application type product that's open source and then succeeding at a massive scale too,

00:30:49.880 --> 00:30:55.160
where typically in open source, we see libraries, we see backends and frameworks,

00:30:55.160 --> 00:31:01.080
right. Like being really massively successful. But you know, that was part of the reason that I

00:31:01.080 --> 00:31:06.520
really wanted to, I wanted to prove that superset that that superset and, and that, you know,

00:31:06.520 --> 00:31:11.720
open source can succeed up the stack too. And we've been working very, very actively on that in this,

00:31:11.720 --> 00:31:16.920
in this community. Yeah. It's a super good point because it's clearly open source has won on the

00:31:16.920 --> 00:31:23.640
frameworks in the libraries level, but there's fewer examples of it creating beautiful user interface

00:31:23.640 --> 00:31:29.160
experiences and types of applications. And yeah. And I pretty good theory on that too. Like,

00:31:29.160 --> 00:31:31.480
why is it the case that we've seen this? Why do you think?

00:31:31.480 --> 00:31:36.360
So I think like open source has been very much playground for engineers, right? Like the,

00:31:36.360 --> 00:31:43.000
the tool set and, you know, GitHub and Git and source control on the pull requests and issues,

00:31:43.000 --> 00:31:48.120
like all of these things have been historically the way that engineers build software. And it's

00:31:48.120 --> 00:31:53.480
been a little bit hostile to PMs and designers, not hostile and like, oh, you know, actively hostile,

00:31:53.480 --> 00:31:58.600
but just, it was not, not welcoming. Yeah. Or yeah. Not it's just built by engineers for engineers,

00:31:58.600 --> 00:32:02.360
like GitHub and Git was built by engineers for engineers. And we never really thought of like,

00:32:02.360 --> 00:32:09.640
how do we include product designers and product managers to the workflows there? And then the interest,

00:32:09.640 --> 00:32:15.400
I think a lot of engineers have like this great image of open source and see it as an outlet for

00:32:15.400 --> 00:32:20.440
their careers. And then they love the idea of working in the open that does not exist that drive of

00:32:20.440 --> 00:32:25.240
working in the open, you know, with the designers. So we've been thinking about how do we create and

00:32:25.240 --> 00:32:31.960
large our community and open up our community to very much welcome PMs and product designers as part of

00:32:31.960 --> 00:32:38.600
this community. And it's been, I think we've made some headways. We should blog about how we did this

00:32:38.600 --> 00:32:44.920
in the superset project, but we opened up and we created some processes where, where we also do design

00:32:44.920 --> 00:32:51.240
review. We do, you know, product reviews that our PMs get together with other people in the community to,

00:32:51.240 --> 00:32:54.680
to kind of design beyond, you know, technical solutions.

00:32:54.680 --> 00:32:59.720
Yeah. There's a ton of visualizations here for people who haven't seen it yet. Just visit the

00:32:59.720 --> 00:33:04.360
website and you'll see right away. There's a primarily a visual tool, the tool for visualizing

00:33:04.360 --> 00:33:04.840
data, right?

00:33:04.840 --> 00:33:09.160
Yeah. So it is like a GUI tool in a lot of ways. But, but I think what's interesting too,

00:33:09.160 --> 00:33:13.160
it's a GUI tool first, right? So it's a BI tool in the sense that, you know,

00:33:13.160 --> 00:33:18.680
a lot of what you do is point and click and drag and drop and, you know, hit a save button. But because

00:33:18.680 --> 00:33:26.600
we're open source, we also have, we're pushing the APIs and SDKs very strongly too. So it's probably

00:33:26.600 --> 00:33:30.520
the most platformy BI tool around because of our open source.

00:33:30.520 --> 00:33:31.080
Oh yeah. That's cool.

00:33:31.080 --> 00:33:35.000
Because we started from the ground up. So say the visualization is a plugin system,

00:33:35.000 --> 00:33:41.160
so you can create your own visualizations and distribute them. The backend and Python is like,

00:33:41.160 --> 00:33:45.320
you know, the coverage of the API is like a hundred percent, right? It's like all over. So everything you

00:33:45.320 --> 00:33:47.240
can do in the GUI, you can do as code too.

00:33:47.240 --> 00:33:51.640
Okay. Yeah. Right now the audience is asking, does it expose an API to your data?

00:33:51.640 --> 00:33:52.440
You know, which is-

00:33:52.440 --> 00:33:56.600
Yes. And it should be in the docs, right? So if you go to, in the docs that,

00:33:56.600 --> 00:34:02.440
here somewhere in there, there should be, maybe it's API at the bottom there. I don't know how

00:34:02.440 --> 00:34:09.160
well documented it is here. It should be, it looks like it's not rendering right on like 480 by 320 pixel.

00:34:09.160 --> 00:34:09.720
Yeah.

00:34:09.720 --> 00:34:11.640
Here we go. How's that?

00:34:11.640 --> 00:34:15.560
Oh, there you go. So command minus to scale this, but yeah.

00:34:15.560 --> 00:34:15.960
Exactly.

00:34:15.960 --> 00:34:21.160
So very good API coverage and well-managed, you know, API behind the scenes.

00:34:21.160 --> 00:34:26.040
Yeah. It looks like you even expose some directly, some of the open API Swagger type of documentation,

00:34:26.040 --> 00:34:32.280
which you could maybe even auto-generate some stuff. Does it have like a library of a Python package that

00:34:32.280 --> 00:34:35.560
talks to the API, anything along those lines? Or is it just HTTP?

00:34:35.560 --> 00:34:39.480
I think it's a open API and then Swagger.

00:34:39.480 --> 00:34:39.640
Yeah.

00:34:39.640 --> 00:34:44.280
Right. I think I set up the first version of that a long time ago, but yes, it's a self-documenting

00:34:44.280 --> 00:34:48.760
thing. So if you put the right decorators and the right doc strings and it's self-document.

00:34:48.760 --> 00:34:54.360
I think we do Marshmallow too, and other things to do like schema definition of what can come in and

00:34:54.360 --> 00:34:58.840
out. And that dictates, I think that's self-documenting too, in terms of like the input and expected

00:34:58.840 --> 00:35:00.440
output schemas too.

00:35:00.440 --> 00:35:00.920
Mm-hmm.

00:35:00.920 --> 00:35:01.640
Sound very neat.

00:35:01.640 --> 00:35:01.720
Yeah.

00:35:01.720 --> 00:35:08.360
Or it could be like Python, Python three, like type annotations too. I think it gets picked up properly,

00:35:08.360 --> 00:35:12.600
which is great. Beyond that, there's more like there's JavaScript stuff. There's a plugin. I think

00:35:12.600 --> 00:35:19.320
if you were to Google superset plugin examples, you'll find all sorts of resources, maybe out of the-

00:35:19.320 --> 00:35:19.880
There you go.

00:35:19.880 --> 00:35:22.600
Oh, there's even a whole collection of them. Yeah. Look at that.

00:35:22.600 --> 00:35:24.120
Yeah. So it's managed a different reef.

00:35:24.120 --> 00:35:28.840
I didn't Google, I kagied it. I don't know what the word googling with kaggy is, but there you go.

00:35:28.840 --> 00:35:34.200
Got it. Yeah. And then we have a good blog post on the preset blogs. If you go preset io/blog,

00:35:34.200 --> 00:35:39.320
we have like how to get started and write your first superset plugin. That's a much more like

00:35:39.320 --> 00:35:44.840
JavaScript. That's a hundred percent, you know, TypeScript, JavaScript, front end code to build plugins.

00:35:44.840 --> 00:35:46.840
It has to be, right? Yeah.

00:35:46.840 --> 00:35:51.800
You don't want to be in the backend trying to figure out how to, you know, lay things out or use

00:35:51.800 --> 00:35:58.120
the Python library to do interactive visualization. It just doesn't work super well. So the plugin

00:35:58.120 --> 00:36:00.840
framework is all, it's all front end code.

00:36:00.840 --> 00:36:01.560
Yeah, makes sense.

00:36:01.560 --> 00:36:07.560
Beyond that, there's more API than there's component libraries as part of superset. And there's other,

00:36:07.560 --> 00:36:13.640
SDKs and component libraries.

00:36:13.640 --> 00:36:18.840
Yeah. So the first thing I wanted to point out about the source code and the GitHub repo is just

00:36:18.840 --> 00:36:26.360
the popularity and all the contributors and whatnot there. The other is that this, while not necessarily

00:36:26.360 --> 00:36:33.560
made for Python people, the way that Jupyter would be made for BI users, but it is open source and in Python,

00:36:33.560 --> 00:36:39.320
built on Flask and tools like that. Right. And you talk about the extensions on the backend and pieces

00:36:39.320 --> 00:36:44.840
along there. So maybe just talk about for people that want to dig in from a Python side, what can they find?

00:36:44.840 --> 00:36:49.240
Yeah, we could try to open the requirements folder. Because at this point, it's not even a requirements

00:36:49.240 --> 00:36:50.840
at TXT file here.

00:36:50.840 --> 00:36:54.360
You have a whole project for setting it up. Okay.

00:36:54.360 --> 00:36:56.040
Our consider requirements.

00:36:56.040 --> 00:36:56.600
Yeah.

00:36:56.600 --> 00:36:56.600
Yeah.

00:36:56.600 --> 00:37:00.280
Oh, you guys using pip-tools here? Nice.

00:37:00.280 --> 00:37:03.800
I believe it's pip-tools and a pip compile, you know.

00:37:03.800 --> 00:37:07.320
Yeah. I love working that way. That's my way these days. It's great.

00:37:07.320 --> 00:37:11.800
Yeah. Because we need to pin the versions. And so we have to, for people to know with it,

00:37:11.800 --> 00:37:16.760
you define an in file that's like your version ranges, and then you can kind of pick, compile your

00:37:17.400 --> 00:37:23.000
version. And then that turns into like kind of frozen, like libraries, like specific numbers.

00:37:23.000 --> 00:37:28.440
So it's, you know, you can have like everything pinned out. We use so much stuff here and we use

00:37:28.440 --> 00:37:33.160
stuff that uses a lot of stuff. So if you import, you know, just Flask, like Flask itself is likely

00:37:33.160 --> 00:37:39.320
to import a bunch of things. So once you kind of on, you kind of recurse to that dependency tree and

00:37:39.320 --> 00:37:44.440
expand it, it's a massive dependency tree on the Python side. It's also a massive dependency

00:37:44.440 --> 00:37:50.360
tree on the JavaScript side. Oh yeah. Big application made out of, you know, hundreds

00:37:50.360 --> 00:37:57.000
of open source packages because we kind of need it all to build this full, like this, this application.

00:37:57.000 --> 00:38:01.080
So dependency management, it's a little bit of a struggle when you manage such a big piece

00:38:01.080 --> 00:38:05.240
of software that's connected to everything. Yeah. There's no joke. There's a lot of dependencies

00:38:05.240 --> 00:38:09.400
here. So, but there's ways you can run it without worrying too much about that, right?

00:38:09.400 --> 00:38:13.640
Yeah. I mean, you can definitely like just run the Docker container. You can tip install superset.

00:38:13.640 --> 00:38:19.160
There's, you know, somewhat straightforward way to set it up locally and get things going.

00:38:19.160 --> 00:38:24.520
Yeah. Yeah. It's, it's kind of interesting how like building application nowadays, if you think

00:38:24.520 --> 00:38:30.520
about the dependency tree that go behind any kind of solution or application, that's not just a library.

00:38:30.520 --> 00:38:33.960
Like library should, should have like very minimal requirements.

00:38:33.960 --> 00:38:38.760
Kind of dependency trees, right? This should be self-contained and kind of focused, I think. But

00:38:38.760 --> 00:38:44.040
here, I think for, to build such a large scale application, we just need to have a lot of

00:38:44.040 --> 00:38:48.440
the dependencies. And then these dependencies have often a fair amount of dependency. I'm

00:38:48.440 --> 00:38:52.120
surprised to see like, now we're looking at click for people, not necessarily looking,

00:38:52.120 --> 00:38:56.200
but just click itself probably as a lot of like its own, like sub packages now too.

00:38:56.200 --> 00:39:01.160
Exactly. And there's a lot of things that it wouldn't click to be into your, in your dependencies

00:39:01.160 --> 00:39:06.680
here. I guess there's a, we'll talk about running in just a minute. And there's a lot of architectural

00:39:06.680 --> 00:39:13.240
layers at play here. So you've got superset, but you've also got celery, you've got Redis,

00:39:13.240 --> 00:39:16.760
you've got some database layers. There's a lot of technologies that people would know working

00:39:16.760 --> 00:39:20.920
as a group that luckily Docker just takes care of for us. More like Docker compose.

00:39:20.920 --> 00:39:27.960
Yeah. Docker, Docker compose and, you know, help charts. I think we have like a helm chart too.

00:39:27.960 --> 00:39:32.680
It was always important for me to keep it such that you can kind of just pick install superset

00:39:32.680 --> 00:39:38.840
and run a few commands and get it running locally. So you don't need to have, you know, Redis out of

00:39:38.840 --> 00:39:44.200
the box and celery out of the box, similar to airflow and that way where I wanted to have like a very

00:39:44.200 --> 00:39:52.200
self-contained thing at first. But then if you want to run any modern web app, you know, that does serious

00:39:52.200 --> 00:39:57.880
kind of work and solve some real problem. It's likely that you need, okay, you need to have web servers and

00:39:57.880 --> 00:40:02.200
application server, but you need to have the whole front end stack, right? Like something like when pack

00:40:02.200 --> 00:40:07.000
and you probably have like front end infrastructure, just on the, like how you build your front end,

00:40:07.000 --> 00:40:12.040
it gets pretty complicated quickly. Then you probably need async workers. So also then you need something like

00:40:12.040 --> 00:40:18.120
Maccelery and something like Redis to as a message queue to talk to the async workers. Then you probably

00:40:18.120 --> 00:40:23.320
want to start caching some things. So you need a caching back end for, for certain things. And then

00:40:23.320 --> 00:40:29.320
you need to support an open source. You probably need to support different databases. So some people

00:40:29.320 --> 00:40:36.040
might want to use MySQL as a backend or Postgres or some, some more other things. So then you need to

00:40:36.040 --> 00:40:41.960
like optionally support these things through abstraction layer. So it gets complicated really quickly.

00:40:41.960 --> 00:40:46.280
Yeah. I was really cool though, that you can just pip install it. And there's a more lightweight

00:40:46.280 --> 00:40:50.680
version without going through all the details. Let's talk about getting, getting going, get it

00:40:50.680 --> 00:40:55.960
running, exploring it a bit and hosting it. But before we do, I said like 15 minutes ago, two quick

00:40:55.960 --> 00:41:00.280
comments before we talk about databases, let's just talk about the database thing real quick here.

00:41:00.280 --> 00:41:00.520
Sure.

00:41:00.520 --> 00:41:06.440
Over here at the bottom, you've, you know, obviously where your data comes from, we opened this,

00:41:06.440 --> 00:41:11.000
you know, I'm pointed out that Excel is bad at getting data from different data sources. You know,

00:41:11.000 --> 00:41:16.920
people have operational data, they have data warehouses, they have data lakes, whatever you call

00:41:16.920 --> 00:41:17.400
them. Yeah.

00:41:17.400 --> 00:41:23.000
Things like this, right? So there's a lot of different places people are putting data. Maybe just touch a bit on

00:41:23.000 --> 00:41:28.680
on the database integration here. Yeah. And I think like in the context of this pod, this Python podcast,

00:41:28.680 --> 00:41:35.720
too. So for us, we use SQLAlchemy very heavily. So SQLAlchemy is a SQL toolkit first, and then an

00:41:35.720 --> 00:41:42.920
ORM built on top of it. And probably much more than that. But the way that we support first,

00:41:42.920 --> 00:41:49.480
I would say the metadata database for superset, right? So in superset, when you save dashboards,

00:41:49.480 --> 00:41:54.120
save visualization, save queries, that goes to metadata database. And we tend to recommend

00:41:54.120 --> 00:42:01.240
Postgres and MySQL as the backend for the app, just to keep the state of the app somewhere in a proper

00:42:01.240 --> 00:42:02.440
relational database. Sure.

00:42:02.440 --> 00:42:07.640
That's one. And then we connect to all these databases to do analytics on them, right? And that's

00:42:07.640 --> 00:42:12.760
what we're looking at here, the supported databases in the sense of like, what can we build charts off of?

00:42:12.760 --> 00:42:19.560
And what can we enable data exploration around of? And then this is powered by SQLAlchemy. So that

00:42:19.560 --> 00:42:26.760
means that anything that has a DBAPI driver and a SQLAlchemy dialect, and then maybe that's an

00:42:26.760 --> 00:42:31.160
opportunity to talk a little bit more about the database abstraction and the Python world since we

00:42:31.160 --> 00:42:38.600
have a Python centric audience. So DBAPI is a spec is one of the peps out there. I forgot the number of

00:42:38.600 --> 00:42:44.600
the DBAPI peps, but I was like, you know, just a common interface for all the databases and Python.

00:42:44.600 --> 00:42:51.880
So that's called DBAPI. And then SQLAlchemy, the SQL toolkit knows how to speak certain dialects and

00:42:51.880 --> 00:42:58.760
builds an RM on top of things. And this is a PEP 249. So it came a little bit later in the story of

00:42:58.760 --> 00:43:02.680
Python. I don't know what's the number. What's the latest PEP number? They're pretty high these

00:43:02.680 --> 00:43:09.800
days, although they seem to be organized by my concept. Let's see. We've got some of the 8000 here.

00:43:09.800 --> 00:43:13.640
Oh, I guess there's there's like some encoding in the number.

00:43:13.640 --> 00:43:18.760
There's some kind of grouping. Yeah, I'm not sure exactly what it is. But it's 3,000 for a specific

00:43:18.760 --> 00:43:19.640
thing in the 8,000.

00:43:19.640 --> 00:43:22.520
I think so. But yeah, don't hold me to it. Yeah, I think so.

00:43:22.520 --> 00:43:29.640
Okay. Anyhow, what you need in order to basically for Superset to connect to any flavor of database is a

00:43:29.640 --> 00:43:35.800
viable DBAPI driver. And once that's built, SQLAlchemy dialects, SQLAlchemy dialects are fairly

00:43:35.800 --> 00:43:41.160
easy. Like we've written in a bunch of API drivers and SQLAlchemy dialects in the past. They're not that

00:43:41.160 --> 00:43:45.480
hard to implement. So that means it's pretty much anything that speaks SQL out there, you know,

00:43:45.480 --> 00:43:50.600
we can talk to essentially. Yeah. Yeah. So we've got the standard MySQL, Postgres,

00:43:50.600 --> 00:43:56.760
Microsoft SQL Server is probably a big one in the BI space because a lot of enterprises are back with

00:43:56.760 --> 00:44:04.440
that. But it also has more unique ones like Snowflake and Druid and Google BigQuery and Firebird,

00:44:04.440 --> 00:44:08.200
a lot of different places that people can talk to. Yeah. When we see, you know,

00:44:08.200 --> 00:44:13.640
like the Superset community and the preset customers. So I started a company three years ago that's

00:44:13.640 --> 00:44:17.800
essentially commercializing Apache Superset and offering a managed service so you don't have to

00:44:17.800 --> 00:44:22.120
run it. So we're on call. You're on call. There's a freemium to say if you want to try Superset,

00:44:22.120 --> 00:44:26.360
you can pip install Superset and kind of struggle with Docker and all this stuff. Or you can try it

00:44:26.360 --> 00:44:31.480
directly at preset so you can just like start for free and see if it works for you. Then you can kind of

00:44:31.480 --> 00:44:36.280
postpone the decision of like, do I want to run it on my own or do I want to use a managed service and kind

00:44:36.280 --> 00:44:42.120
of pay as I go instead. But yeah. So what we see in terms of what our customers use, so a lot of Snowflake,

00:44:42.120 --> 00:44:49.160
a big query, these cloud data warehouses can a no brainer nowadays. If you have a true analytical

00:44:49.160 --> 00:44:54.840
workload, just put all your data and Snowflake, big query. And then there's still some Redshift and

00:44:54.840 --> 00:45:00.440
there's still like, you know, all sorts of like, you know, database engines for whatever circle reasons

00:45:00.440 --> 00:45:05.960
people have or they have constraints of them to run something, you know, on premise or in their cloud

00:45:05.960 --> 00:45:10.600
and then Redshift. Right. Absolutely. So because it's open source, they can go and host it to their heart's

00:45:10.600 --> 00:45:17.480
content or they can go SaaS style and work with you all. That's right. So for us, we do offer the

00:45:17.480 --> 00:45:22.760
managed service as you know, the freemium and pay as you go can proceed. So it's like 20 bucks per user

00:45:22.760 --> 00:45:29.400
for months. It's pretty, pretty straightforward and kind of easy to grow into and you pay as you go.

00:45:29.400 --> 00:45:34.520
Then we have something called the managed private clouds. If you do want to run a managed service inside

00:45:34.520 --> 00:45:38.760
your cloud because you don't want your data to leave. Maybe your data is not already in a cloud

00:45:38.760 --> 00:45:43.720
data warehouse. Maybe it's inside your VPC and you want to keep it there. So we offer a service. It's

00:45:43.720 --> 00:45:49.000
still a managed service with a centralized control plane, but that runs on your cloud. So we do offer

00:45:49.000 --> 00:45:53.960
this and then you're always free to like run on your own. Right. Like to and there, the question is like,

00:45:53.960 --> 00:46:00.840
can you know, you have to think the math of like running a piece of open source software versus

00:46:00.840 --> 00:46:07.080
like running on your own versus like paying a vendor, like running Kafka or buying Confluent for events or

00:46:07.080 --> 00:46:12.760
running Spark or Databricks is whether you're interested in the bells and whistle that the vendor uses. And

00:46:12.760 --> 00:46:18.600
and then the constraint you have around like quality of service and think about total cost of ownership. So

00:46:18.600 --> 00:46:25.080
the reality is like running something like superset at scale in your organization, if you want the latest,

00:46:25.080 --> 00:46:31.560
greatest, secure, kind of patched up version of it is that it's pretty expensive to the total cost of

00:46:31.560 --> 00:46:38.840
ownership of open source is fairly high. So often the vendors can do it at a better, better price and better quality.

00:46:38.840 --> 00:46:47.160
To patch Celery and Redis and Memcash and databases and your servers hosting them and keeping them all

00:46:47.160 --> 00:46:51.960
going. It's non-trivial. And then there's disaster recovery and failure. And you know, as soon as you

00:46:51.960 --> 00:46:56.440
were thinking, well, maybe we should hire somebody to do this, then all of a sudden a paid service starts

00:46:56.440 --> 00:47:02.040
to sound pretty appealing. Right? Oh yeah. I mean, when you think about like what it really takes to

00:47:02.040 --> 00:47:07.720
manage a piece of software or collection of pieces of software, like, you know, superset and Kafka and

00:47:07.720 --> 00:47:12.760
airflow and all these things, and you want it to be state of the art, you know, latest, greatest version

00:47:12.760 --> 00:47:18.520
and kind of secured compliance, if compliance is a concern and all this stuff. Generally for,

00:47:18.520 --> 00:47:23.160
for at least for smaller organizations, it makes tons of sense to like, you know, who's the best

00:47:23.160 --> 00:47:28.040
people to run the software reliably is the people writing the software. Yeah. You know, even on things

00:47:28.040 --> 00:47:34.840
like I preset, we have a multi-tenant version of superset that we run where you can't really have that if you

00:47:34.840 --> 00:47:40.920
run out on your own. So that means we, how much we pay per cycle in terms of infrastructure costs

00:47:40.920 --> 00:47:45.560
is going to be much cheaper than what you can get to running superset on your own. Sure. Not every user

00:47:45.560 --> 00:47:51.720
is asking a, an active BI question all the time. So you got extra resources to share. And then you

00:47:51.720 --> 00:47:56.440
provision for peak. It's a little bit the same with infrastructure, right? Like you, if you run a

00:47:56.440 --> 00:48:01.640
database server on your own, you have to provision for peak access, where if there's a cloud service, then you

00:48:01.640 --> 00:48:06.600
you have to provision the cloud vendor as the provision for the total peak across all the

00:48:06.600 --> 00:48:11.560
customers. So yeah, there's tons of economies of scale there. And we passed that on to our customers.

00:48:11.560 --> 00:48:16.200
Cool. All right. Well, let's talk about maybe getting started and just the first touch type of

00:48:16.200 --> 00:48:23.240
experience before we run out of time here. You have a nice doc that says installing and using superset.

00:48:23.240 --> 00:48:28.760
And I went for the easy way. So on my Mac, I have Docker for Mac already set up, which means I have

00:48:28.760 --> 00:48:36.280
Docker and Docker compose. And so basically that's clone the repo, the superset repo, go in there and

00:48:36.280 --> 00:48:42.440
then just run Docker compose pull and then Docker compose up on a certain definition file, configuration

00:48:42.440 --> 00:48:46.520
file. And then pray that there should be a comment that says pray and hope for the best.

00:48:47.160 --> 00:48:54.200
One thing that's really interesting is like, and I'm sure a lot of other like open source leaders can,

00:48:54.200 --> 00:49:02.360
can kind of relate to that is that no one agrees on the best way to run something for production use

00:49:02.360 --> 00:49:09.000
cases for sandbox use cases, and even in developer mode. Right. Like, so for me, I'm like, I freaking,

00:49:09.000 --> 00:49:15.880
I hate Docker and Docker compose because I don't have enough control and I'll tend to just kind of run my own set of my

00:49:15.880 --> 00:49:23.080
own environment. I run Tmux and I do my own builds and I prefer having more control instead of trying to understand

00:49:23.080 --> 00:49:29.080
that abstraction layer that Docker and Docker compose is. So there's an alternative, I think, documentation

00:49:29.080 --> 00:49:37.000
somewhere. And there's a big contributing MD that's more geared towards people like, how do I run my setup if I want to

00:49:37.000 --> 00:49:43.240
actually develop on the tool? So somewhere on the superset repo, there's a computer contributing MD file that says,

00:49:43.240 --> 00:49:50.360
here's if you want to develop with Docker, Docker compose, you do this. If you want to develop using more kind of a different,

00:49:50.360 --> 00:49:52.280
like more raw level, how do you think?

00:49:52.280 --> 00:49:53.560
Create your own virtual environment and go.

00:49:53.560 --> 00:50:00.760
Yeah, that's it. If you want to, and there's some people use pyenv, pyenv. Some people prefer using like virtual and more directly.

00:50:00.760 --> 00:50:06.360
So it's really hard to come up with a, we got a prescribed way to do it with a good documentation.

00:50:06.360 --> 00:50:10.600
But then, you know, half of the people are going to go their own way anyway.

00:50:10.600 --> 00:50:14.680
So say Docker compose here too, is like a lot of people prefer helm charts for Kubernetes.

00:50:14.680 --> 00:50:22.880
So then we have helm charts, we have a Docker compose construct, then we do have other documentation as to how to do it.

00:50:22.880 --> 00:50:28.600
But it's been really difficult to have a very clear prescribed way to do it and then maintain the different ways

00:50:28.600 --> 00:50:30.600
that we can do it individually and keep them all working.

00:50:30.600 --> 00:50:40.520
Sure. So as much as I'm not a huge fan of developing code in Docker, I do think this is a nice way for a low effort, first touch experience.

00:50:40.520 --> 00:50:44.840
You're like, I just want to run it and log into the web app and see how it feels and play with it.

00:50:44.840 --> 00:50:50.040
And you get, you know, all the various moving parts or you get celery and Redis and whatnot, which is pretty cool.

00:50:50.040 --> 00:50:55.600
That's also kind of a map of how to run it on your own, right? So maybe you're like, oh, I don't like Docker compose.

00:50:55.600 --> 00:51:02.080
I prefer, you know, my own version of something else, but I'm going to look at the Docker compose to see what it's playing and have the recipe.

00:51:02.080 --> 00:51:05.500
That recipe is still very useful for people at different ways.

00:51:05.500 --> 00:51:10.480
Sure. Or just knowing, look, there has to be, or maybe it's good if there's a Redis server.

00:51:10.660 --> 00:51:15.180
Okay, well, I have Redis. Let me just set up a, set it up to connect to that one, for example, right?

00:51:15.180 --> 00:51:19.920
That's it. Yes. I'm just going to change that part of the recipe because I already have that ingredient, you know, run.

00:51:19.920 --> 00:51:25.660
Yeah, exactly. Exactly. When you run the Docker container, it says, you wait a moment.

00:51:25.660 --> 00:51:30.220
It says, everything worked. Go over to localhost 8088 and log in.

00:51:30.220 --> 00:51:34.280
The super secure default password and the username is admin admin.

00:51:34.280 --> 00:51:38.040
So you're going to change that, but, but, you know, it's an easy way to get in there.

00:51:38.120 --> 00:51:42.660
And what you get is you get some example dashboards and some example charts, right?

00:51:42.660 --> 00:51:47.400
You want to just maybe tell us about the things we find when we get here so people know how to go explore when they get started?

00:51:47.400 --> 00:51:52.520
Right. And you probably want to zoom out a little bit because like the rendering here, it's going to look a little bit better.

00:51:52.520 --> 00:51:56.000
It's kind of interesting too, because you don't, out of the box, you don't get the thumbnail backend.

00:51:56.000 --> 00:52:06.740
So you don't get the pretty thumbnails, you know, that will have a preset or that you can set up if you spend a little bit more time on setting up your salary backend and getting all the thumbnails to compute in the backend.

00:52:07.420 --> 00:52:13.960
Yeah. So what you get out of the box is a set of very small data sets and charts and dashboard built on top of that.

00:52:13.960 --> 00:52:15.780
You can navigate and play with.

00:52:15.780 --> 00:52:28.400
If you really want to get value and get a real POC, you probably want to connect to your real data warehouse, probably not out of your local, but get to a slightly more, or maybe you have a copy of your data warehouse or some data you want to play with.

00:52:28.480 --> 00:52:30.740
And you can connect here. If you were to look at.

00:52:30.740 --> 00:52:31.720
It's data sets, right?

00:52:32.020 --> 00:52:38.200
So data sets are like coming from your database connection. So somewhere in the upper right, you have settings.

00:52:38.200 --> 00:52:38.560
I see.

00:52:38.560 --> 00:52:47.300
Right. So you could connect database connections right here. So you could create a new database connection on the upper right. If you click, you'll see, you know, just a screen to connect to your database.

00:52:47.300 --> 00:52:52.960
So you pick the database you want to connect to, add your connection string, and then you can start playing with your own data.

00:52:52.960 --> 00:52:56.840
If you don't want to play with your own data, you can play with the data we provide. It's fairly limited.

00:52:57.100 --> 00:53:07.920
We've spent a lot of cycles working on adding the latest, most fun data sets to play with, with the best dashboard examples, but allows you to get started and get a sense for what superset can do.

00:53:07.920 --> 00:53:15.800
Yeah. So we have a couple of major building blocks. We have dashboards, we have charts, we have data sets, and we have the SQL IDE thing.

00:53:15.800 --> 00:53:16.640
That's right.

00:53:16.640 --> 00:53:22.260
I'll pull up a, here, we'll pull up a sales dashboard. Nothing screams BI more than sales dashboard.

00:53:23.420 --> 00:53:32.020
That's right. We've started to have a, you know, an example there, but it loads like a few bar charts and it's not like the best design dashboard. It shows that we support.

00:53:32.020 --> 00:53:35.080
But it looks good. Like there's a lot of, there's some beautiful stuff here.

00:53:35.080 --> 00:53:44.520
Yeah. You can do so much more. I feel like our examples are dated. You can do so much better with like superset. If you actually take a little bit more time, we should work on our examples as a community.

00:53:44.520 --> 00:53:56.600
You know, it's like really compelling data sets to play with, but it gives a good overview. And here, if you click on the dot, dot, dot for any chart. So here for people who can't see the visual support and click on edit chart.

00:53:56.900 --> 00:54:10.880
So that will send you to our Explorer. So we're in the dashboard. We're looking at a specific chart. Now we just move to our chart editor. That's very much like your exploration. So here you can click on a metric. You can drag and drop different metrics.

00:54:10.880 --> 00:54:15.520
Change my sum to max and see what happens. There we go. Look at that. Biggest sales instead of most.

00:54:15.520 --> 00:54:23.180
Yeah. So you can update the charts and if you were to click on view all charts, I don't know if you see that somewhere at the top metal somewhere.

00:54:23.180 --> 00:54:30.940
There's all sorts of visualizations that are supported here. You get a big list of all the visualization plugins that ship with superset today.

00:54:30.940 --> 00:54:36.900
So all your common charts, but also some geospatial stuff and some more, some more advanced and complicated charts.

00:54:36.900 --> 00:54:38.260
Nice.

00:54:38.260 --> 00:54:44.800
All sorts of things. Yeah. So that's here. Maybe like just to do a little bit of the flow of the demo.

00:54:44.800 --> 00:54:51.640
Apologies for people not watching and just look at it. So hit cancel and then click on the upper right dot, dot, dot.

00:54:51.640 --> 00:55:00.180
So not settings, but the dot, dot, dot here. So you can say view the query or run in SQL lab will allow you to go kind of step deeper.

00:55:00.180 --> 00:55:08.540
Where now the SQL that happened to be running behind this chart. Now you can, you can alter and, you know, push your own analysis.

00:55:08.540 --> 00:55:10.080
Oh yeah. That's cool.

00:55:10.080 --> 00:55:18.740
So we went from a dashboard to kind of your exploration session and into a SQL IDE. You can go deeper here and just like run your own analysis.

00:55:18.740 --> 00:55:20.580
It's a big playground for data, you know?

00:55:20.680 --> 00:55:26.340
Yeah. And you get a little, pull up your table or your SQLAlchemy model. Maybe it is. I'm not sure.

00:55:26.340 --> 00:55:33.420
And call it like a schema navigator. So in this case, it's very much like you're navigating your database, your database kind of object. Right.

00:55:33.440 --> 00:55:45.380
So you can navigate your schemas and see the tables and the views. And then there's good autocompletes. It's very much an IDE. If you start typing, you know, it will autocomplete the table names and then the column names.

00:55:45.380 --> 00:55:55.240
Yeah. Super cool. You also get a query history. That seems nice for if you're like playing around, you're like, oh, five versions ago of typing in this. I had the picture I wanted and I know where'd it go, right?

00:55:55.240 --> 00:56:01.240
Yeah, totally. So I think that for the people who speak SQL, you know, they can go deeper and run more complex analysis.

00:56:01.240 --> 00:56:07.800
Sure. Yeah. Very neat. All right. Well, maybe let's close it out with a quick conversation on this and then I know we're out of time.

00:56:07.800 --> 00:56:16.800
I picked on Excel for having very poor source control options. What about, what's the story here about versioning and sharing and collaboration?

00:56:16.800 --> 00:56:22.600
There's this thing in BI called Headless BI. It's the ability to manage BI assets as code.

00:56:22.600 --> 00:56:32.760
Add preset. We have built a CLI on top of the superset API. It allows you to import and export objects from into the BI tool to the file system.

00:56:32.760 --> 00:56:40.660
So it's really easy to say, I want to store this dashboard or this set of dashboard or this set of object and manage them as code.

00:56:40.660 --> 00:56:47.460
So there's a CLI that allows you to push and pull from the BI tool from superset into Git and GitHub.

00:56:47.460 --> 00:56:53.220
All right. Let me see if I got this right. So I might create a folder in it that is a GitHub repo or a Git repo rather.

00:56:53.220 --> 00:56:58.640
Then I would export all my stuff, commit that, and then I would just like write over it and keep committing.

00:56:58.640 --> 00:57:05.960
Those would sort of track my changes. And if I ever need to, I can reinstantiate or rehydrate that thing out of the file set into superset.

00:57:06.060 --> 00:57:22.760
Yeah. So there's more to it than that. And I'm going to try to explain it well. But once you say hit the eject button, which would be exporting the BI assets as code, then you get a collection of YAML files that represents your chart, your data sets and your database connection definition.

00:57:22.760 --> 00:57:34.480
Right. So your dashboard then is represented as code. When you push things. Well, so first you can template eyes things at CML. So you can use Jinja, which is a great Python package to template files.

00:57:34.480 --> 00:57:47.300
So you can inject some templates into your BI tool for if you were to say broadcast this object to multiple superset instances or to say I'm going to do permutation of variation on a theme.

00:57:47.300 --> 00:57:48.860
You can do that through templating.

00:57:48.860 --> 00:57:49.240
Okay.

00:57:49.360 --> 00:58:02.020
And that's through the preset CLI and you can push. As you push, then there's a flag. I believe the flag is on by default where it will prevent people from updating the object in the GUI saying this object is managed as code.

00:58:02.020 --> 00:58:12.840
The source code lives here for reference. So you can click and go see the code on GitHub. But then you can't save it because it's essentially read only and managed by source control.

00:58:13.160 --> 00:58:26.200
I think in the future, we're looking to have a companion for each superset workspace on preset to be able to have the full history over time of what has changed. So you can go and restore assets, you know, as they were a while ago.

00:58:26.200 --> 00:58:35.800
There's always someone that's going to delete something or delete a dashboard or change it in ways that are that are destructive and people want to roll back. So it's possible to do that.

00:58:35.800 --> 00:58:36.020
Sure.

00:58:36.020 --> 00:58:54.060
It makes a lot of sense to have some kind of source control story. But at the same time, because it's kind of a SaaS thing, either self-hosted, your little baby SaaS or at a preset, it's kind of a shared asset that doesn't need to be synced and pushed and pulled and cloned as much to allow people to work on it. Right.

00:58:54.060 --> 00:59:04.880
Yeah. So there's different things. I think the Google Docs approach, which is to keep a GUI revision history and being able to see who changed what went is also valuable.

00:59:04.880 --> 00:59:12.820
And sure, we're going to see that in the future of superset too, being able to say, I want to look at the history of this dashboard from a GUI perspective.

00:59:13.460 --> 00:59:20.380
So that's something that has been requested and will have in the future. So call it your Google Docs kind of GUI.

00:59:20.380 --> 00:59:21.340
Yeah, that's true.

00:59:21.340 --> 00:59:34.940
The managing asset as code is different use case, right? If you have an embedded dashboard or if you publish a certain dashboard as part of your application, that's more the rigor. Like, I want to have it in source control. I want a version. I want it, you know.

00:59:34.940 --> 00:59:35.140
Sure.

00:59:35.140 --> 00:59:44.320
It's kind of like having a DevOps team versus, you know, someone keeps the server running. Right. Like there's different levels of maturity around different things and companies.

00:59:44.320 --> 00:59:58.120
Yeah, people want that flexibility too. Like infrastructure as code, for instance, is great. But that doesn't mean that everything should, like if you and I want to go and create a AWS account and spin off some resources, maybe we don't need to start with, you know, Terraform script.

00:59:58.120 --> 00:59:58.780
Yes, exactly.

00:59:58.780 --> 01:00:10.920
And then you can generate the code later. Maybe you can say like, hey, AWS, can I generate the Terraform code of all the stuff that I've done in the past three days so that, you know, GUI to code can flow.

01:00:10.920 --> 01:00:27.340
And then, yeah, the other way of like, you know, code to GUI. But yeah, it's important for this sort of tools, managing critical assets to have these workflows like GUI to code, code to GUI, and be able to have the flexibility and best of both worlds as you go up in your maturity lifecycle.

01:00:27.340 --> 01:00:29.460
Yeah. And your need for rigor.

01:00:29.460 --> 01:00:43.440
Makes a lot of sense to me. All right. Well, we are well out of time here, Max. So, you know, congrats again on creating such a cool project. And I guess with Airflow as well, it's not even the first one. So very popular. It seems like it's definitely taken up. It's great.

01:00:43.440 --> 01:00:59.440
Yeah, it's been super exciting, like way beyond my expectation. And I think really often like the original creators get too much kind of recognition and reward compared to like the rest of the community, right? So what it takes for something like Superset to exist is it takes 800 people contributing and it takes an entire Slack community. And really often we give a lot of credit to the person who created the thing. But you should look at like how bad Superset was.

01:00:59.440 --> 01:01:05.560
What was the first person, what the only person working on it and what it really took off is when we saw like a set of really good contributors coming in and pushing it to the next level.

01:01:05.560 --> 01:01:18.960
Sure.

01:01:18.960 --> 01:01:18.960
Sure.

01:01:18.960 --> 01:01:20.440
That's been rewarding.

01:01:20.440 --> 01:01:27.000
That's awesome. I definitely got some people excited about it. One of the comments in the audience is this project has me stupid excited, which is lovely.

01:01:27.000 --> 01:01:39.800
Love to see that excitement, right? Like a lot of this validation comes through usage and value and people getting excited, contributing and more like just using here. Like we'd love to see people just say like, hey, we're using this. We're getting tons of value.

01:01:39.800 --> 01:01:41.640
Yeah. Go to the in the wild.

01:01:41.640 --> 01:01:42.120
Yeah.

01:01:42.120 --> 01:01:43.180
Put your scam on it, right?

01:01:43.180 --> 01:01:56.640
I get like the communistic, like, you know, like together we can build better things than vendors on their own can. It's just like open source, a better way to not only to collaborate and build software, it's a better way to discover software, adopt software.

01:01:56.940 --> 01:01:57.140
Yeah.

01:01:57.140 --> 01:01:59.820
And just like get to solutions, you know.

01:01:59.820 --> 01:02:05.300
Very cool. All right. Well, before you get out of here, final two quick questions. If you're going to write some code, what editor do you use?

01:02:05.300 --> 01:02:26.560
As I feel, I'm still a Vim person. I feel like I need to modernize. I'm not like, oh, Vim is better than all the IDEs. It's just muscle memory at this point. It was just very common line and very kind of into Vim and like my specific kind of tune up for Vim. And it's not because I think it's better. It's just like habit, you know.

01:02:26.880 --> 01:02:38.940
Cool. There's a funny guy who I think he's German. He does this YouTube series, making fun of different programming languages and communities. And one is this guy. He talks about how he fought in the Vim Emax wars. Yeah, it's pretty good.

01:02:40.360 --> 01:02:43.060
All right. So you're on the Vim side. Fantastic. And then.

01:02:43.060 --> 01:02:48.700
And I'm not on the side too. It's like, that's what I use. But at the same time, like encourage people to find something that works for them.

01:02:48.700 --> 01:02:48.800
Sure.

01:02:48.800 --> 01:03:09.760
Then I talked about the power of muscle memory, right? Like once you really know your tool set and the shortcuts, it's like your computer becomes an extension of your brain and your muscles. And there's just a beauty in that. So it's good to have a tool that enables you and to have that self-training. Like I'm going to train my muscle memory so I can do the things that I do all the time without things.

01:03:09.760 --> 01:03:17.040
Right. You think I want this to happen and then it happens and you don't have to be conscious of it happening, right? In your editor, like that's the way.

01:03:17.040 --> 01:03:25.600
Clicking around, I'm like, I'll do the sequence of like six clicks to do this thing. Photoshop all the time. Why can't you just do like command shift R plus V2?

01:03:25.600 --> 01:03:26.320
Yes. Exactly.

01:03:26.320 --> 01:03:29.000
You choose a sequence, it just happens magically.

01:03:29.160 --> 01:03:34.840
Yeah, absolutely. And then notable PyPI package or other library fan, you're like, this is awesome. People should know about.

01:03:34.840 --> 01:03:50.360
Yeah. So I wanted to talk about, so I live, you know, we use SQL very heavily, you know, as you saw. So if you're a data practitioner, you write a lot of SQL. I spend quite a bit of time writing tons of SQL in Airflow, a little bit in DBT too, more recently.

01:03:50.600 --> 01:03:58.160
There's this like SQL linter that came out. It's called SQL fluff. It's been around for a little while people. So check out PyPI SQL fluff.

01:03:58.160 --> 01:03:58.560
Right.

01:03:58.560 --> 01:04:15.500
There it is. And it's a very configurable SQL linter fixer. So, you know, like we all love like Pep 8 and things like Black that are very deterministic and opinionated. I think we're not there in the SQL world. People have not agreed on our Pep 8 equivalent for SQL.

01:04:15.500 --> 01:04:26.660
So this is like highly configurable. So you can agree with your team on the set of like linting rules for your repo and then it can fix a lot of stuff for you. So I think it helps.

01:04:26.660 --> 01:04:35.620
If we're going to manage mountains of SQL, I don't like SQL that much, but it seems like this generation of data teams is going to rely a lot on a lot of SQL.

01:04:35.620 --> 01:04:40.640
Then having a linter, you know, helps making that a little bit more bearable.

01:04:40.640 --> 01:04:45.380
Excellent. SQL fluff.com. Very cool. All right. Final call to action. People are excited.

01:04:45.380 --> 01:04:48.360
If you're excited about Superset, I want to get started. I want to play with it. What do you tell?

01:04:48.360 --> 01:04:53.040
Pip install Superset. I mean, come to the GitHub repo. Check out superset.apache.org.

01:04:53.040 --> 01:04:59.500
We haven't talked about the Apache Software Foundation too, but, you know, we're supported by the Apache Software Foundation in many ways.

01:04:59.500 --> 01:05:02.120
And then you should be able to find tons of resources.

01:05:02.120 --> 01:05:10.280
It is a little bit harder to get started than other things because it has such a broad piece of software that's like very, very layered.

01:05:10.380 --> 01:05:16.980
We have a Slack. So I think there is a type of issue that's probably called like starter issues.

01:05:16.980 --> 01:05:21.080
I forgot the exact name of it. So, and then we have a Slack to get involved.

01:05:21.080 --> 01:05:24.540
And I believe in Slack, there's a way to kind of introduce yourself.

01:05:24.540 --> 01:05:28.420
And there's a bunch of channels that are more like, how do I get started? How do I contribute?

01:05:28.920 --> 01:05:33.660
So there should be outlets for anyone who wants to get involved to get connected.

01:05:33.660 --> 01:05:38.680
If you fail at doing that, like you can probably reach out to me directly on Twitter or elsewhere.

01:05:38.680 --> 01:05:40.460
And I might be able to give you some pointers.

01:05:40.460 --> 01:05:44.580
Yeah, there's a few people commit to the project. So there's got to be a lot out there.

01:05:44.580 --> 01:05:53.300
That's like a thing though, like when the project gets bigger and there's more contributors, that doesn't mean it's necessarily more welcoming and easier to get into.

01:05:53.300 --> 01:06:03.420
There's more people, but sometimes there's not as clear of a, if you don't have it BDFL, sometimes a little bit harder to talk to a single person and get the exact pointer that you need.

01:06:03.420 --> 01:06:10.720
So I would say like, just get on the Slack, talk to a few people, find, think about how you want to get involved too, and be clear about your intentions.

01:06:10.720 --> 01:06:14.320
And then we'll be able to connect you in the right place, the right person.

01:06:14.320 --> 01:06:18.280
Fantastic. All right. Max, thank you for being here. Thanks for creating this cool project.

01:06:18.280 --> 01:06:20.280
Looks like tons of people are getting value from it.

01:06:20.280 --> 01:06:28.800
Yeah. Thank you for having me on the show too. And I'm going to go and look back at the episodes and kind of, you know, I'm always looking for good content too, and keeping in touch with the Python community too.

01:06:28.800 --> 01:06:31.140
So I'm going to go and dig in your archives there.

01:06:31.140 --> 01:06:31.580
Right on.

01:06:31.580 --> 01:06:33.480
And listen to a bunch of episodes.

01:06:33.480 --> 01:06:39.300
We have seven years, almost every single week. So there's a bunch of episodes back there. So yeah. Thanks so much.

01:06:39.300 --> 01:06:40.320
Yeah. See you later.

01:06:40.320 --> 01:06:40.920
Thank you.

01:06:40.920 --> 01:06:41.540
Take care.

01:06:41.540 --> 01:06:41.720
Bye.

01:06:42.280 --> 01:06:45.420
This has been another episode of Talk Python To Me.

01:06:45.420 --> 01:06:50.260
Thank you to our sponsors. Be sure to check out what they're offering. It really helps support the show.

01:06:50.260 --> 01:07:00.940
Join Sentry at their conference, Dex, Sort the Madness, the conference for every developer to join as they investigate the movement and trends for better and more reliable developer experiences.

01:07:00.940 --> 01:07:04.880
Save your seat now at talkpython.fm/Dex.

01:07:06.100 --> 01:07:11.380
Want to level up your Python? We have one of the largest catalogs of Python video courses over at Talk Python.

01:07:11.380 --> 01:07:16.560
Our content ranges from true beginners to deeply advanced topics like memory and async.

01:07:16.560 --> 01:07:22.140
And best of all, there's not a subscription in sight. Check it out for yourself at training.talkpython.fm.

01:07:22.140 --> 01:07:28.100
Be sure to subscribe to the show, open your favorite podcast app, and search for Python. We should be right at the top.

01:07:28.100 --> 01:07:37.480
You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:07:37.480 --> 01:07:40.900
We're live streaming most of our recordings these days.

01:07:40.900 --> 01:07:48.680
If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:07:48.680 --> 01:07:53.040
This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it.

01:07:53.040 --> 01:07:54.960
Now get out there and write some Python code.

01:07:54.960 --> 01:08:15.840
I'll see you next time.

