Apache Superset: Modern Data Exploration Platform

Episode #382, published Thu, Sep 22, 2022, recorded Mon, Sep 19, 2022

Episode Deep Dive Links Transcript

When you think data exploration using Python, Jupyter notebooks likely come to mind. They are excellent for those of us who gravitate towards Python. But what about your everyday power user? Think of that person who is really good at Excel but has never written a line of code? They can still harness the power of modern Python using a cool application called Superset.

This open source Python-based web app is all about connecting to live data and creating charts and dashboards based on it using only UI tools. It's super popular too with almost 50,000 GitHub stars. Its creator, Max Beauchemin is here to introduce it to us all.

Play on YouTube

Watch the live stream version

Episode Deep Dive

Guest Introduction and Background

Maxime Beauchemin joined the show to discuss Apache Superset, an open-source data exploration and visualization platform. Max has a diverse background in data engineering, having worked at major tech companies such as Ubisoft, Facebook, Airbnb, and Lyft. He created both Apache Superset and Apache Airflow, two widely used open-source data tools that help organizations process, visualize, and manage large volumes of data. Max also founded Preset, a company offering managed Superset services. His experience spans the full data stack: data warehousing, data modeling, analytics engineering, and driving robust BI (business intelligence) solutions.

What to Know If You're New to Python

Here are a few key ideas to help you get the most from this conversation if you’re newer to the Python ecosystem:

You don’t need an extensive computer science background to start using Python or Superset; familiarity with SQL is often enough to gain insights from data.
Python’s large ecosystem (Flask, Celery, SQLAlchemy, etc.) underpins many advanced data tools like Superset, so understanding some of these libraries might help.
Superset itself doesn’t require you to write Python code—though if you do know Python or plan to learn, you can extend and customize it more easily.
Explore how Superset leverages SQLAlchemy (the standard database toolkit in Python) so you can connect to and explore data across many SQL databases.

Key Points and Takeaways

Apache Superset: A Modern, Open-Source BI Platform Superset is a web-based tool for exploring and visualizing data. It connects to live databases, allowing you to build interactive dashboards and charts directly on top of SQL data sources—no need to export your data elsewhere. It’s open source under the Apache Software Foundation, and tens of thousands of users and organizations rely on it for robust BI.
- Links / Tools:
No-Code / Low-Code for Business Users While Python developers can extend Superset, one of its key goals is to serve non-programmers, especially those coming from tools like Excel. Its drag-and-drop interface lets “power users” build and share dashboards without writing any code, yet it still supports powerful SQL queries under the hood.
- Links / Tools:
  - Apache Superset “Explore” UI
SQL as the Core Language Despite being written in Python (Flask, Celery, etc.), Superset emphasizes SQL for data exploration. It uses SQLAlchemy for database connections, letting you query just about any SQL-speaking database. Users can toggle between a pure GUI approach and a full SQL IDE.
- Links / Tools:
  - SQLAlchemy
  - PEP 249: Python DB-API
Architecture: Docker, Celery, Redis, and More Superset typically runs with a modern stack: Flask for the web layer, Celery as a task queue for asynchronous jobs like generating chart thumbnails, Redis as a message broker and cache, plus a metadata database for Superset’s internal storage (e.g., PostgreSQL). You can spin up a Docker Compose environment for quick setup or install it via pip.
- Links / Tools:
Working with Large Data and Data Warehouses Many teams connect Superset to cloud data warehouses (Snowflake, BigQuery, Redshift). Because the queries stay within the database, you can visualize massive datasets without pulling them into Excel or a desktop tool.
- Links / Tools:
BI-as-Code and Version Control Superset supports a notion of “Headless BI,” letting teams manage dashboards and other BI assets in source control. You can export them as YAML configurations, check them into Git, and re-import them to keep track of changes or share them across environments.
- Links / Tools:
  - Superset CLI + YAML-based exports
Comparisons to Excel, Tableau, and Looker Excel is quick for local data or “what-if” analysis, but it struggles with collaborative, large-scale data. Tableau, Power BI, or Looker are established BI tools but typically come with high licensing costs and less flexibility. Superset offers a modern interface with an open-source framework that you can customize at every level.
- Mentioned Tools: Excel, Tableau, Looker, Power BI (discussed comparatively).
Open-Source GUI + Frontend Technology Superset is an example of a successful open-source web application with a rich frontend in TypeScript/React. It’s more than a Python library: It’s a full-stack solution. The high star count on GitHub (near 50,000) reflects a vibrant community pushing its UI and features forward.
- Links / Tools:
  - Superset on GitHub
Installing & Getting Started Superset can be installed locally via pip install apache-superset or set up with Docker Compose. You’ll see a default admin admin login, plus sample dashboards to explore. Then, connect your own database or the sample data to start building real charts.
- Links / Tools:
  - Installation Docs
SQL Fluff and Writing Clean Queries Max recommended a popular SQL linter, SQLFluff, to keep your SQL code tidy and consistent. It parallels Python’s code formatters (like Black), letting you define a style guide for SQL.

Links / Tools:
- SQLFluff

Superset’s Community and Support The Superset Slack community is large and active. People can ask questions, share dashboards, and even contribute to code or design discussions. Being part of the Apache Software Foundation means the project is maintained by a broad group of contributors with a healthy governance model.

Links / Tools:
- Slack Community Info
- Apache Software Foundation

Interesting Quotes and Stories

On Not Needing a CS Degree “You don’t really need a computer science background to be effective; you just need the skills you need to be successful and useful.” — Max describing how bootcamps and hands-on practice can suffice for data and Python work.

On the Growth of Superset “I never expected Apache Superset to end up with nearly 50,000 stars and thousands of organizations using it.” — Reflecting on Superset’s popularity and community-driven growth.

On Balancing Theory and Practice “You really start to get curious about how things work once you’ve actually spent time as a practitioner.” — Max explaining how tackling real problems fuels deeper study of data and computing theory.

Key Definitions and Terms

BI (Business Intelligence): Technologies and practices used to collect, integrate, analyze, and present business information, often through dashboards and reports.
SQLAlchemy: A Python toolkit and ORM that provides a consistent interface for SQL databases.
DBAPI (PEP 249): A Python standard for database driver interfaces, ensuring different databases share a consistent connection and querying API.
Apache Airflow: Another open-source project by Maxime Beauchemin, focused on workflow management and scheduling for data pipelines.
Docker Compose: A tool for defining and running multi-container Docker applications.
Celery: A distributed task queue in Python, often used for asynchronous jobs like background calculations or rendering.

Learning Resources

If you want to deepen your Python skills, especially for data analytics and bridging the gap from spreadsheets to Python, here are some resources to explore:

Python for Absolute Beginners: An excellent entry point if you are entirely new to Python or programming.
Move from Excel to Python with Pandas: Ideal for Excel users who want to modernize their data workflows in Python.
Data Science Jumpstart with 10 Projects: A hands-on way to learn Python’s data ecosystem and quickly see how to apply it to real-world scenarios.

Overall Takeaway

Apache Superset shows how powerful open-source data visualization can be, especially for teams seeking a unified data exploration tool. Whether you are a seasoned Python developer or an Excel-savvy analyst, Superset offers a scalable, flexible, and collaborative platform for building dashboards and sharing insights. Maxime Beauchemin’s journey highlights that contributing to and leading open-source projects is often a community-driven effort—and that with the right tools and a bit of curiosity, anyone can learn to harness data more effectively.

Links from the show

Max on Twitter: @mistercrunch
Superset: superset.apache.org
60 notebook environments: talkpython.fm
SQL Fluff linter: sqlfluff.com
DB API PEP: peps.python.org
Preset Company: preset.io
Watch this episode on YouTube: youtube.com
Episode #382 deep-dive: talkpython.fm/382
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #382 deep-dive: talkpython.fm/382

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 When you think data exploration using Python, Jupyter Notebooks likely come to mind.

00:04 They are excellent for those of us who gravitate towards Python.

00:08 But what about your everyday power user?

00:10 Think of that person who is really good at Excel but has never written a line of code.

00:14 They can still harness the power of modern Python using a cool application called Superset.

00:20 This open-source Python-based web app is all about connecting to live data

00:25 and creating charts and dashboards based on it using only UI tools.

00:28 It's super popular too, with almost 50,000 GitHub stars.

00:32 Its creator, Max Bushman, is here to introduce it to us all.

00:36 This is Talk Python To Me, episode 382, recorded September 19th, 2022.

00:41 Welcome to Talk Python To Me, a weekly podcast on Python.

00:57 This is your host, Michael Kennedy.

00:59 Follow me on Twitter where I'm @mkennedy, and keep up with the show and listen to past episodes at talkpython.fm.

01:04 And follow the show on Twitter via at Talk Python.

01:08 We've started streaming most of our episodes live on YouTube.

01:11 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:20 Did you know one of the best ways to support the show is by taking one or more of our courses?

01:24 In fact, we have one of the largest libraries of Python courses out there with over 240 hours of videos.

01:32 Before we get to the conversation, I want to quickly let you know that we just released three new ones.

01:37 Django Getting Started, Getting Started with pytest, and Python Data Visualization.

01:42 All three are excellent courses, and their landing pages each have a video introducing the course.

01:47 Visit talkpython.fm and click on Courses in the nav bar to learn more.

01:52 Thank you for making Talk Python training part of your career journey.

01:56 Now on to the show.

01:57 Max, welcome to Talk Python To Me.

02:00 Well, thank you.

02:00 Exciting to be here.

02:01 Love to talk about Apache Superset, so hit me up.

02:05 Yeah, it's quite the thing that you've created, and it looks like it's really going strong.

02:10 So we're going to talk about tools for data exploration in general, and then we'll dive in and focus on Superset, which is what you've created.

02:18 So I'm really excited to do that.

02:19 Excited to do it, too.

02:20 I've been kind of baiting and kind of swimming in this old world of data, data orchestration, exploration, visualization for the past 20 years or so.

02:29 So that's been really my focus, so I should have a lot to say about everything related to this space.

02:35 Yeah, fantastic.

02:36 And you've got a lot of experience at many of the big tech companies that people would think of as having lots of interesting data to look at.

02:43 So we can dive into that just a bit at the beginning here.

02:45 But before we get to any of those things, let me kick it off with a beginning question.

02:49 How did you get into programming and Python and all these things?

02:52 Oh, goodness.

02:52 Yes, I did a decline of an associate degree back around like in the late 90s.

02:59 So that kind of, you know, says about how long I've been doing this.

03:03 But I never finished it, too.

03:04 So I never finished and got my actual diploma for it, too.

03:08 So I got into an internship to join a company called Ubisoft.

03:12 So it's a video game company.

03:13 It's one of the major video game companies out there.

03:16 And I went on to my first internship and never looked back and never finished a program.

03:20 So that's where my career started.

03:22 That's awesome.

03:23 This program was like very, it's called a Technic Online Filmatics.

03:27 I'm from Quebec City originally.

03:28 So I grew up speaking French and I was a program in French.

03:31 And it's a technical program.

03:33 The goal of the program was to send, you know, technicians, as they call them, people that are very technical, really focused.

03:40 And then give them the skills that they need to be effective joining companies.

03:44 So some programming, some data modeling, a little bit of SQL.

03:49 And then really, like, the skills that you need to get started and start coding.

03:54 Not necessarily thinking about, like, computer science and, like, data structures.

03:57 Like, much more, like, what do you need to get started?

04:01 Let me interrupt you for just a second there.

04:02 And we can maybe talk just a bit about this.

04:05 I feel like a lot of people looking in from the outside feel like, oh, I need a computer science degree in order to do X, Y, or Z.

04:12 Whatever it is, you know, create APIs, create a business, do data science or whatever.

04:16 And so much of the focus of CS degree seems to be on algorithms, on operating systems.

04:23 And while those are really good to know, they're not necessarily the skills you sit down and go, let me remember my algorithms things.

04:30 Like, you just call a function on a data structure.

04:33 Or let me remember my operating system stuff.

04:35 Like, you just run the code.

04:36 I mean, it's helpful to have that.

04:37 But I don't feel like it's that necessary.

04:39 I don't want people out there listening to think, oh, I've got to go get a CS degree or I'm not going to do anything, right?

04:44 Yeah, I think, you know, the boot camp, I've been told, like, flip that upside down to say, like, oh, no, all you need is the technical skills to get started to, I don't know, build an app, you know?

04:54 And then you don't need those fundamentals or maybe the premise that you need them later.

04:59 I think there needs to be some balance there, you know?

05:02 So the CS approach, like, let's start with the foundation and how we got here.

05:06 And then, you know, the rest should follow.

05:08 I don't think that's right.

05:10 So to me, you don't really have that curiosity about how you got there until you've been a practitioner.

05:16 So to me, I'm like, hey, teach people the skills they need to be successful and useful to employers.

05:22 It seems like the way that university in general or education should be oriented towards.

05:27 Let's teach people the skills they need to be to contribute effectively to the market.

05:32 And then I think maybe the CS constructs is something you would learn, that wisdom you would build, you know, as you learn.

05:41 Some of you would pick it up and some of you are like, I really need to know.

05:43 I've been doing this for a year now.

05:45 I need to know how this thing works.

05:46 And you'll dive into it, right?

05:47 But when you're motivated and you're kind of, you have that experience already, yeah.

05:52 Yes, it's like when you have to solve the problem, certain problems.

05:55 And then maybe at that point, I don't know, if you're writing a bunch of SQL and you're building a lot of data structure, maybe you need to understand like data modeling construct.

06:04 And that's a good time to go and understand the history of the different approaches to data modeling.

06:08 But maybe you don't start from the theory, right?

06:11 Yeah.

06:12 But yeah, so like going back to your question, so I, so then I joined a, so I was a web developer kind of building internal apps for like a year.

06:20 And then very quickly I got into data and got into using, building data warehouse at Ubisoft and then using the business intelligence kind of tool set toolkit to build all sorts of ports, dashboard kind of self-service things.

06:33 So people could consume data.

06:35 So very quickly got into that.

06:37 And it's a little bit later that I learned, I started doing more scripting.

06:41 So when I joined Yahoo in 2007, I believe that was like the birth of Hadoop and Yahoo had some, some pearl, you know, so I learned a script a little bit and kind of interpreted languages more.

06:54 And then by the time I think I started, I started building more, more, more website for personal projects.

07:00 So learn a bunch of Python there.

07:01 I did a lot with Django.

07:03 And by the time I, by the time I joined Facebook in 2012, I knew Python very well.

07:09 And then that became kind of my, my main kind of my main language, you know, and that's really, you know, what we use internally for a lot of things at Facebook.

07:18 And that just became like more and more of the established kind of language for everything data related right around that time.

07:25 What a cool set of experience.

07:27 You know, you were at, was it Lyft, Airbnb?

07:30 Yeah.

07:31 Facebook.

07:31 Yeah.

07:32 Facebook.

07:32 Yeah.

07:33 A lot, a lot of.

07:34 And Ubisoft.

07:34 Yeah.

07:35 So Ubisoft is interesting.

07:37 They're a Canadian company, right?

07:38 They are a French company.

07:39 So their headquarters is in Montreux, like very, like next to Paris, or I think they were actually like they're from Bretagne, so Brittany somewhere.

07:47 And.

07:47 Okay.

07:48 They're a French company.

07:50 They have a huge studio in Montreal, though.

07:52 There's like amazing tax breaks in Quebec and Canada.

07:55 And they decided to build like one of the biggest, if not the biggest video game studio in the world in Montreal.

08:01 So that's where I started my career.

08:02 Yeah.

08:02 Well, the reason I'm bringing it up is I want to ask you about what it's like working at a game company versus a more traditional.

08:09 I don't know if you call Yahoo traditional, but like standard.

08:12 You know, a lot of people dream of being at these game companies.

08:15 And that's even maybe why they got into programming.

08:17 And I don't know.

08:18 Tell us what your experience was like there.

08:20 Yeah.

08:20 So it's a dated experience, right?

08:22 I don't know what it is.

08:23 Like I left Ubisoft in 2007.

08:26 So it's like a pretty dated.

08:28 It's 15 years ago.

08:29 I can say about like what it was like at the time.

08:31 It's a mix of like super fun.

08:33 It was like super young, a bit bro-y too.

08:36 And a lot of ways in a very masculine environment.

08:38 Also, like, you know, some of it is because it's, you know, 15, 15 to 20 years ago.

08:43 I think it was a slightly different world.

08:45 And a lot of things that were maybe dubious back then are definitely not okay anymore.

08:51 I mean, I think there's that culture.

08:53 People talk about, I think Electronic Arts has been famous for that.

08:55 And a lot of the big video game companies is having like these work environments that were

09:00 really like, you know, dubious in some ways.

09:03 But I think Ubisoft was a great place to be, I think, at the time.

09:07 And I think like maybe one of the better ones kind of bringing it back to where it should be

09:13 and ahead of its time, perhaps.

09:14 But my experience at Ubisoft was so interesting because it's difficult for me to talk about

09:19 what is Ubisoft because I work at three different studios, Montreal, Paris.

09:23 I was in Montreal for about a year until I was in Ubisoft Paris for about three years.

09:29 And then Ubisoft San Francisco for another three years.

09:33 So the three different offices were vastly different.

09:36 And I think, you know, the things that kind of plagued the video game business are like long hours,

09:42 kind of low pay, I think that just like grinding people out sort of thing.

09:46 Yeah.

09:47 Yeah.

09:47 Kind of like there's never enough, a lot of crunch time all the time.

09:51 And then kind of a great place maybe to start your career.

09:55 But then as people mature, they tend to go other.

09:58 So at least I think it has changed so much.

10:02 The whole world culture has changed a lot.

10:04 Yeah.

10:05 As you have relationships and families and you want to see them, things like that.

10:10 Yeah.

10:11 Yeah.

10:11 That might evolve.

10:12 Yeah.

10:12 Seriously.

10:13 You're going to age out of working out all the time or just by necessity too.

10:17 Yeah, exactly.

10:18 Let's kick off our conversation focusing on data exploration, I think.

10:24 So when I think about data exploration, not from a developer or data science, but in like

10:29 the super broad sense, I don't know what comes to mind for you, but Excel, I feel like most

10:36 people are like, I've got some data.

10:37 I need to maybe think about it a little bit more analytically than just a bunch of numbers.

10:42 Let me throw it in Excel and see what I can do with it.

10:44 Yeah.

10:44 I think Excel is like a super open.

10:47 If you think about like Excel as a playground or as a framework, you know, it's super open

10:52 ended.

10:52 You can do so much in there and there's not a lot of constraints, right?

10:56 So the constraint that exists in an Excel file, the ones that you make for yourself.

11:00 And then maybe one constraint is like, it used to be like, you know, I forgot what it was,

11:04 but it's like, you know, 65,000 rows, you know, for a long time.

11:08 And I think now there's no such limits anymore, but there's still like a limit of how much your

11:13 laptop is going to be able to, you know, in terms of the size of a pivot table.

11:17 And the past companies where I was at, there's no way you could bring the dimensionality

11:22 and kind of the raw data that you need in Excel.

11:24 So you need to kind of prepare and extract of the stuff you're going to play with.

11:27 Yeah.

11:27 First, you got to be in Excel.

11:28 And then there's like things that, you know, BI historically has not been really good at

11:33 is a what if analysis, creating different scenarios, doing forecasting.

11:37 So I think like that is an area where our spreadsheet dominate will keep dominating, right?

11:43 If you want to tie certain things to variables and change the numbers and see how other like,

11:48 you know, charts and models are.

11:50 So modeling kind of is a really good case, I think, for Excel.

11:53 Then, you know, the downside is like, oh, how do you collaborate on these things?

11:58 And like the version is kind of a mess where you end up with like another file, SharePoint.

12:03 Even the files are binary, so you can't, you can't diff them or anything easily, right?

12:07 Oh yeah.

12:07 They're not in source control.

12:09 And then you don't know, like there's no introspection as to like how things got there.

12:13 There are a mix of like data that are from a source and then kind of made up stuff sometimes.

12:18 Like, I'm going to tweak this.

12:19 I'm going to change that.

12:20 So you don't know what is the list of all the changes that were applied to the source data.

12:25 Yeah.

12:26 So I don't know.

12:26 I think it's a good tool, but it's definitely incomplete, right?

12:29 It's part of what the impact will always be, you know?

12:32 I don't bring it up as a recommendation.

12:33 I bring it up as I feel like a lot of people are starting here.

12:36 And so like, how can, how can we look around and see maybe what is a better option out there,

12:41 you know?

12:41 Yeah.

12:42 And it's, I think if you've used Excel a lot in your organization or personally, I think

12:46 people discover kind of hit their head on the limitations and the problems that come with

12:51 such an open framework.

12:54 This portion of Talk Python To Me is brought to you by Sentry.

12:57 You know, Sentry as a longtime sponsor of this podcast.

13:00 They offer great error monitoring software that I've told you about many times.

13:04 It's even software that we use on our own web apps.

13:07 But this time I want to tell you about a fun conference they have coming up.

13:10 Sentry is hosting DEX, Sort the Madness.

13:13 The conference for every developer to join as we investigate the movement and trends for a

13:19 better and more reliable developer experience.

13:22 What is this madness, you ask?

13:23 It's the never-ending need to deploy stable code quickly.

13:28 Come to DEX to engage with developers who will share their epic fails and their glorious saves.

13:33 Sentry can't fix the madness, but they can start sorting through it with you.

13:37 Register today to join in San Francisco or attend virtually on September 28th at talkpython.fm

13:44 slash DEX.

13:45 That's talkpython.fm/DEX.

13:48 The link is in your show notes.

13:50 Thank you to Sentry for supporting Talk Python To Me.

13:55 Out in the audience, Ollie says, my local data extraction people default to Excel and they

13:59 seem limited by the number of sheets available in a workbook.

14:02 Yeah.

14:03 Well, I guess now that it's not the number of lines in a file, I guess the number of sheets.

14:09 That's right.

14:10 So, you know, sort of stepping up a level from this, I feel like maybe heading down a more

14:15 structured way.

14:16 Like, one of the problems with Excel is how do I talk to databases and APIs?

14:20 Like, how do I bring in other more live data is really, really limited.

14:23 I know there's like BI stuff, but not really.

14:26 Sort of the next step up.

14:27 What do you think?

14:28 Is that, is that Jupyter or like, where's the next level here?

14:30 I don't know.

14:30 Of course.

14:31 We're talking about consumption now in some ways, but I feel like in a lot of ways we

14:36 should be talking about data engineering too.

14:38 So where is your data, right?

14:40 Is the first question.

14:41 So first your data is not, or maybe some data lives in Excel, but that's not where your data

14:46 lives nowadays in the SaaS application you use, right?

14:49 So the modern, like just even any startup or company uses hundreds of SaaS applications.

14:55 Yeah.

14:55 CRMs, applicant tracking systems, you know, GitHub and just a million different data sources.

15:02 And it feels like one of the first thing you need to do is to bring that data together,

15:08 right?

15:08 Like in a central place or into some sort of like inside, you know, either data marts

15:14 or data warehouse.

15:15 I think like an early construct that you need as an organization because data is most useful

15:20 when it's put alongside the other data you have in your organization.

15:24 It does make sense to hoard all this data and bring it all to a central place.

15:29 If you want to do consumption, otherwise consumption is going to be kind of a stitching story, right?

15:33 So let's say you're in Excel or you're in a, you're in the local database or whatever,

15:37 whatever it might be.

15:38 The first thing you have to do is bring the things that are related in one place.

15:43 So you can do that visualization consumption analysis, right?

15:47 How do you join on a thing that's partly in an API and partly an air table or something?

15:52 Right.

15:53 Even if you, let's say we take a notebook, so super open-ended, right?

15:56 What is a notebook?

15:57 It's just like, you know, it's a script with REPL and, you know, where you can run, you

16:01 know, chunks of the script sequentially and you have a persistent kernel or interpreter

16:06 kind of supporting what you're doing at any point in time.

16:10 But the first thing, if you don't have a data warehouse or your data all in one place, you're

16:15 going to try to do some data engineering is probably the first thing you're going to do

16:17 within your notebook is to say, how do I get the data that I need?

16:21 The source or sources that are interesting to me and the notebook, you know, will enable

16:27 you for sure to do this, but then is it, you know, can other people build on top of the

16:33 work that you did in a notebook?

16:34 Probably not or not as easily as you'd want them to.

16:38 So I think the data warehousing kind of approach of saying like, hey, let's bring data that we

16:42 need in our organization to a central place and try to stitch it together there so it can

16:47 then best be used for consumption.

16:49 And analysis is still a very important step in the process.

16:55 Sure. I totally agree. And, you know, Jupyter gets Jupyter and JupyterLab gets a lot of the

17:00 mindshare, but there are many, many choices. I interviewed Sam Lau and he did a research project

17:06 where they categorized over 60 different notebook environments where Jupyter was one of them.

17:11 It's just, it's off the hook. So there's a lot of choices out there and so on, but let's focus on

17:17 superset.

17:17 I'd love to talk about outputs. It's like, why do we need to set 60 different notebooks? I feel like I

17:21 missed a step of like the evolution of notebooks. I'm very familiar with Jupyter,

17:26 deploy Jupyter hub at Airbnb a while ago, but then, you know, followed Hex a little bit.

17:33 That's one of the players in space. Also followed. So at Lyft, we kind of built our own little

17:39 notebook service, right? So we had a Kubernetes cluster. We kind of say like, I want this Docker

17:46 image base for my notebook. You'd pick like, I want the AI and now package, or I want, you know,

17:51 basically what's the base for your notebook. And then you could pick some hardware, like I need GPUs

17:56 or I need a big machine or small machine. And then we'd spin off these environments for people, but

18:01 try to understand like, what are the different, like, why is there 60 notebooks? And what's,

18:05 what are the different flavors? Or like, how do they all differentiate from each other? Is this dubious

18:09 question? It was crazy. I was kind of blown away by this. And if you look, they have a,

18:14 it seems like it always differs on some axis. Like, well, we want more collaboration like Google Docs,

18:19 or we want it to run into a different place, like PyIodip. We want that to run in the front end,

18:24 rather than, you know, with like some sort of Python in the browser. And like, there's just all these crazy

18:30 variations. So I think there's a lot, I just kind of only highlight that to point out, like,

18:35 it's not just Jupyter. There's like a ton of these things where, where Jupyter is the main

18:40 environment that kind of lives in a web browser where people go and explore data. And I feel like

18:46 Superset, it's a pretty modern, interesting player in that space of many choices.

18:51 Yeah. Happy to talk about Superset too, and trying to introduce it in the context of what we're talking

18:57 about before. Yeah. But think about Superset, right? Tell us about Superset.

19:02 So Superset is essentially very much like a data exploration, dashboarding, visualization tool that's very much

19:10 like catering to organization, right? So we, Superset solves like challenges or the problem space of data

19:18 consumption for entire teams. So we're not necessarily focused on people who know Python or people who are,

19:25 you know, data scientists or data analysts or data engineers, like we very much cater to the entire team.

19:30 And the idea there is to have a single place to explore data, visualize it, interact with it,

19:36 share, create dashboard. And then we have a SQL IDE on top of that too. I think like on the GitHub page,

19:43 I don't know here if we have good screenshots to, I think an image is worth a thousand word. And I know

19:47 not everyone is like looking at what we're looking at, but here we have the drag and drop kind of

19:53 explore. I think the screenshots a little bit dated there might be a little bit more recent on a

19:58 GitHub on the GitHub page too, where you can see like, and we have this drag and drop interface,

20:03 very similar to what people are familiar with in, in business intelligence, right? Like you, where you

20:09 have access to your data set and you drag and drop your metrics and dimensions and pick your visualization

20:14 type, get to the exact chart that you want. And you can assemble these charts into interactive dashboards

20:20 with like, you know, dynamic filtering on the dashboard and expose that to, to, to business users,

20:26 right? So they can explore on their own, they can create their own dashboard. They can answer their

20:30 own questions. Yeah. This sort of thing.

20:32 It lives in a really, really interesting space. And that's why I brought up Excel as well as because

20:38 Excel is not meant for programmers, but it's meant for people who are trying to do serious stuff with

20:44 it. They kind of, well, maybe the right equals and they'll find a formula they can put in there or,

20:49 you know, they'll do like a V lookup or they'll, they're kind of trying to go more than just like,

20:53 I need a grid of stuff. And while Jupyter and those things are awesome, superset feels like it caters a

20:59 little bit more to like a power user type of person that has Python extension capabilities,

21:05 but you don't have to start as a Python developer to get into it. Is that right?

21:09 No, actually not. Right. Like, so the premise is like, you don't need to have any Python skills.

21:14 The skills that may help if you want to go deeper inside superset is, you know, knowing some SQL,

21:19 knowing SQL, it's not a requirement. So think about like, if you think about Tableau,

21:24 people familiar with Tableau or Looker, right? That's really the space that we're in. So it's,

21:29 platforming in a sense that, okay, you can access, you can, you access your database connection,

21:34 you interact with datasets, but then, you know, think about the experience of a consumer,

21:39 someone just consuming a dashboard. You just, you open a dashboard, it's a collection of chart,

21:45 maybe it's titled like financial forecast for, you know, 2023. And you really need that little

21:51 technical skills to, to use to, you need business knowledge mainly to consume a dashboard. These

21:57 dashboards are interactive. So that means you'll be able to apply a filter on a specific quarter,

22:01 a specific like customer type of market, right. And then, interact with the dashboard in that way.

22:07 But primarily like the dashboard interface caters to the bit like the business user or anyone that

22:13 is trying to understand.

22:13 I see almost like a more of a BI type of a user person rather than...

22:18 It is BI tool. Super set is a BI tool to be, to be there. It's a BI tool that maybe is modern in

22:26 many ways and assumes. So if you want to get the way you get deeper, say in the Explorer, and I don't

22:32 know if you can, if you can click on the upper left on the Explorer. So here for context, we're looking

22:37 at more of the drag and drop place, and super set where you pick metrics and dimension and

22:41 visualization type you want to look at. So it's your typical kind of tableau like interface. And here you

22:47 can essentially just drag and drop, but if you don't do no SQL, you're able to create your own metrics and

22:53 express them as SQL expressions, for instance.

22:55 Right.

22:56 Right.

22:56 Calculated metrics.

22:57 Exactly. You can have computed columns and aggregation and stuff like that. Right.

23:02 Exactly. Yeah. So you'll define metrics as aggregate, as SQL aggregatable expression. So sum of this

23:08 divided by the count, this thing of that, and it has to be a valid SQL expression. But yeah, so for people

23:15 who are a little bit more technical, maybe understand the data better and a little bit of knowledge of

23:19 SQL, they can, they can, they don't have to, but they can use SQL as part of a

23:26 that exploration experience. So you, for instance, if you pick a filter here, you'll be able to pick

23:31 a column, an operator, like customer ID in, and then go pick the customer IDs in a

23:37 GUI type setting. But you can also go to the little SQL editor and a filter pop over and then write a more

23:43 complex SQL expression if you want to. So we wanted to not necessarily bury SQL as we feel like more and

23:48 more people are learning SQL. It's becoming like the lingua franqua of data. We feel like there's going to be,

23:54 you know, a certain percentage of the workforce in the next decade that's going to become

23:58 more data literate. And that's in part by learning SQL and understanding, you know, understanding

24:03 data set, data structures, and what data sets in their particular organizations are and are made out of.

24:11 Right. And by using SQL, it means you can connect to different data sources and you can connect to live

24:17 data, right?

24:18 That's right.

24:19 You don't have to do some kind of export or whatever. You just connect to Postgres or you

24:22 connect to whatever and then go from there.

24:24 Exactly. Yeah. So the, you know, the way things work in superset. So you go and create your database

24:29 connection or connections to whatever, you know, SQL speaking databases you use as a data warehouse,

24:36 as a data store. Things that are really popular right now are, you know, the big cloud data warehouse,

24:41 Snowflake, BigQuery, but there's still a lot of Postgres and MySQL, even for analytical use cases,

24:47 right? And people, so you connect to that database and then you go and you have different ways to get

24:53 started. One is to go and start exploring the tables that exist already, tables or views,

24:59 or you have this SQL IDE that you're kind of pointing to now, so it's possible for you to go and,

25:04 you know, step down to that level that's more interacting at the SQL level. And here you can

25:10 also create data sets, right? And create what we call virtual data sets that are essentially views for

25:15 people are familiar with the database construct of a view. And that allows people to go and explore

25:21 that data set of virtual data set, assemble dashboard, create visualization, collaborate with others,

25:27 right? Share links on Slack and, you know, annotate, add comments, that kind of thing.

25:31 Yeah. I want to dive into the data sources more, but I want to make sure that we highlight this for

25:35 people listening who don't know about superset. Two things, and you've hinted pretty strongly at one

25:41 already. First of all, when I go to Excel, I don't see a fork me on GitHub. I mean, I'm looking,

25:46 I don't see anywhere on this page that it says fork me on GitHub. Over on Apache slash superset on GitHub.

25:54 Yeah, clearly right there you can. So this is one, it's open source and two, very popular. It's

26:01 almost got 50,000 stars and 10,000 forks. Like that's Django flask level of popularity for people,

26:08 keeping score, I guess. Yeah, that's right. Yeah. And depending on, you know, and stars are just like,

26:13 some sort of proxy for, for hype or interest. Yeah. Right. And fork or like, it's good proxy for how

26:19 many people have kind of, you know, wanted to play with a code, which is also a proxy for a different

26:23 kind of hype and interest. But yeah, it's up there, you know, probably in the top 50 to 100

26:28 source projects of all time in terms of like value delivered and just popular.

26:33 Yeah. Which is like way beyond what I expected, you know, in 2015 when I started abroad. Same with

26:39 Apache Airflows. I also started Apache Airflow. That's also like very, very, very popular and used

26:45 in like tens of thousands of organizations. I think it's similar. It speaks to like the scale and the,

26:51 and just like how like the problem space is super validated. Like everyone needs to visualize data,

26:57 explore data, create dashboard, you know, write SQL, see results, you know, visualize results.

27:02 Yeah.

27:02 So very popular, definitely the leading open source project in this space, you know, of,

27:08 call it business intelligence data consumption. And it's a very mature project, right? So it's

27:13 used by thousands of people at places like Airbnb, Microsoft, Tesla, people have forked the project or

27:20 use it super heavily internally. That in the wild section that you're pointing to, which is kind of

27:26 trying to list out the people who use the project is very limited kind of version, the tip of the

27:31 iceberg type thing of the people who self-reported using the product.

27:35 Yeah. So you have a link in the GitHub repo called in the wild, and it just lists out under these

27:40 different verticals, you'll find these companies using them, which is, you know, on one hand,

27:45 it doesn't matter if these other companies are using it or not. But then if you're trying to sell

27:50 it to your organization or just trying to decide if you can trust it, like, well, if you know,

27:54 you're in education and it works for brilliant.org and it works for you to me and the Wikimedia

27:59 Foundation, maybe it'll work for you. You know, like that's a, it's a bit of validation, right?

28:03 Yeah. And then, you know, especially looking at like, those are people that, you know, open

28:07 a pull request to add their name to this like hidden file on the repo. All right. It shows how like

28:13 two percent of the iceberg it is. But I think one thing I've been telling people on, in the context of

28:18 this podcast is it makes sense. It's like, if you want to contribute to open source, there's a lot of ways you can

28:23 contribute. And the obvious one is to, you know, use the software, open a pull request. But the less

28:28 obvious one is to let the world know, like the very, the most basic and the very minimum, maybe when you

28:34 use a, when your organization is getting significant value from an open source project, just to be public

28:41 about it. Let the world know, you know, if you work at Uber and you get tons of value from, I don't know,

28:46 Gatsby or like when, like whatever, let the world know that you do. And that's a vote of confidence.

28:52 And it speaks to the scale of the community and kind of to work for other others. It probably,

28:57 you know, the chances it's going to work for you are much greater.

29:00 Yeah. Another thing that's interesting about the GitHub repo source code, really, I guess,

29:04 is what I'm thinking of. Two things here. One is it's, it's super active, right? If you go in here

29:08 and you look around, like sometimes you'll see, you know, last change two years ago or whatever,

29:12 right. But oh yeah. Last change seven hours ago, a couple of days ago, two days ago, right?

29:18 There's a lot of, a lot of activity here, right? Yeah. It's, it's super intense in terms of like

29:24 how many people work on a project. There's like a contributors tab. You might be able to click on to

29:28 on the right there, click contributors. So 832 people have contributed today. And that's just looking at

29:34 code contributions. Here's possible to see the history of who's contributed. Something that's interesting

29:39 is like, we distributed on PyPI, but, and the project was largely Python code. It looks like we

29:46 have too much data and the GitHub UI is struggling to render. We're going to break GitHub. Sorry, GitHub.

29:50 Yeah. We're breaking out GitHub right now because we have too many contributions. Here you can kind of

29:55 see the scale contribution. You can also see how I've been selling into my CEO role and less in the

30:02 code. A bunch of people have contributed over time. But yeah, I was going to say we decided to

30:10 distribute on PyPI originally and, you know, was largely a Python project from the get go,

30:17 like more and more. I feel like at the code distribution, a lot of the code is in TypeScript,

30:23 JavaScript now because the nature of the project is so, such a front end project. And something that's

30:29 interesting about open source is we have seen less like application GUI type app, like up the stack

30:36 type projects really succeeding at scale. And superset is definitely one of those, like very much a

30:43 front end application type product that's open source and then succeeding at a massive scale too,

30:49 where typically in open source, we see libraries, we see backends and frameworks,

30:55 right. Like being really massively successful. But you know, that was part of the reason that I

31:01 really wanted to, I wanted to prove that superset that that superset and, and that, you know,

31:06 open source can succeed up the stack too. And we've been working very, very actively on that in this,

31:11 in this community. Yeah. It's a super good point because it's clearly open source has won on the

31:16 frameworks in the libraries level, but there's fewer examples of it creating beautiful user interface

31:23 experiences and types of applications. And yeah. And I pretty good theory on that too. Like,

31:29 why is it the case that we've seen this? Why do you think?

31:31 So I think like open source has been very much playground for engineers, right? Like the,

31:36 the tool set and, you know, GitHub and Git and source control on the pull requests and issues,

31:43 like all of these things have been historically the way that engineers build software. And it's

31:48 been a little bit hostile to PMs and designers, not hostile and like, oh, you know, actively hostile,

31:53 but just, it was not, not welcoming. Yeah. Or yeah. Not it's just built by engineers for engineers,

31:58 like GitHub and Git was built by engineers for engineers. And we never really thought of like,

32:02 how do we include product designers and product managers to the workflows there? And then the interest,

32:09 I think a lot of engineers have like this great image of open source and see it as an outlet for

32:15 their careers. And then they love the idea of working in the open that does not exist that drive of

32:20 working in the open, you know, with the designers. So we've been thinking about how do we create and

32:25 large our community and open up our community to very much welcome PMs and product designers as part of

32:31 this community. And it's been, I think we've made some headways. We should blog about how we did this

32:38 in the superset project, but we opened up and we created some processes where, where we also do design

32:44 review. We do, you know, product reviews that our PMs get together with other people in the community to,

32:51 to kind of design beyond, you know, technical solutions.

32:54 Yeah. There's a ton of visualizations here for people who haven't seen it yet. Just visit the

32:59 website and you'll see right away. There's a primarily a visual tool, the tool for visualizing

33:04 data, right?

33:04 Yeah. So it is like a GUI tool in a lot of ways. But, but I think what's interesting too,

33:09 it's a GUI tool first, right? So it's a BI tool in the sense that, you know,

33:13 a lot of what you do is point and click and drag and drop and, you know, hit a save button. But because

33:18 we're open source, we also have, we're pushing the APIs and SDKs very strongly too. So it's probably

33:26 the most platformy BI tool around because of our open source.

33:30 Oh yeah. That's cool.

33:31 Because we started from the ground up. So say the visualization is a plugin system,

33:35 so you can create your own visualizations and distribute them. The backend and Python is like,

33:41 you know, the coverage of the API is like a hundred percent, right? It's like all over. So everything you

33:45 can do in the GUI, you can do as code too.

33:47 Okay. Yeah. Right now the audience is asking, does it expose an API to your data?

33:51 You know, which is-

33:52 Yes. And it should be in the docs, right? So if you go to, in the docs that,

33:56 here somewhere in there, there should be, maybe it's API at the bottom there. I don't know how

34:02 well documented it is here. It should be, it looks like it's not rendering right on like 480 by 320 pixel.

34:09 Yeah.

34:09 Here we go. How's that?

34:11 Oh, there you go. So command minus to scale this, but yeah.

34:15 Exactly.

34:15 So very good API coverage and well-managed, you know, API behind the scenes.

34:21 Yeah. It looks like you even expose some directly, some of the open API Swagger type of documentation,

34:26 which you could maybe even auto-generate some stuff. Does it have like a library of a Python package that

34:32 talks to the API, anything along those lines? Or is it just HTTP?

34:35 I think it's a open API and then Swagger.

34:39 Yeah.

34:39 Right. I think I set up the first version of that a long time ago, but yes, it's a self-documenting

34:44 thing. So if you put the right decorators and the right doc strings and it's self-document.

34:48 I think we do Marshmallow too, and other things to do like schema definition of what can come in and

34:54 out. And that dictates, I think that's self-documenting too, in terms of like the input and expected

34:58 output schemas too.

35:00 Mm-hmm.

35:00 Sound very neat.

35:01 Yeah.

35:01 Or it could be like Python, Python three, like type annotations too. I think it gets picked up properly,

35:08 which is great. Beyond that, there's more like there's JavaScript stuff. There's a plugin. I think

35:12 if you were to Google superset plugin examples, you'll find all sorts of resources, maybe out of the-

35:19 There you go.

35:19 Oh, there's even a whole collection of them. Yeah. Look at that.

35:22 Yeah. So it's managed a different reef.

35:24 I didn't Google, I kagied it. I don't know what the word googling with kaggy is, but there you go.

35:28 Got it. Yeah. And then we have a good blog post on the preset blogs. If you go preset io/blog,

35:34 we have like how to get started and write your first superset plugin. That's a much more like

35:39 JavaScript. That's a hundred percent, you know, TypeScript, JavaScript, front end code to build plugins.

35:44 It has to be, right? Yeah.

35:46 You don't want to be in the backend trying to figure out how to, you know, lay things out or use

35:51 the Python library to do interactive visualization. It just doesn't work super well. So the plugin

35:58 framework is all, it's all front end code.

36:00 Yeah, makes sense.

36:01 Beyond that, there's more API than there's component libraries as part of superset. And there's other,

36:07 SDKs and component libraries.

36:13 Yeah. So the first thing I wanted to point out about the source code and the GitHub repo is just

36:18 the popularity and all the contributors and whatnot there. The other is that this, while not necessarily

36:26 made for Python people, the way that Jupyter would be made for BI users, but it is open source and in Python,

36:33 built on Flask and tools like that. Right. And you talk about the extensions on the backend and pieces

36:39 along there. So maybe just talk about for people that want to dig in from a Python side, what can they find?

36:44 Yeah, we could try to open the requirements folder. Because at this point, it's not even a requirements

36:49 at TXT file here.

36:50 You have a whole project for setting it up. Okay.

36:54 Our consider requirements.

36:56 Yeah.

36:56 Oh, you guys using pip-tools here? Nice.

37:00 I believe it's pip-tools and a pip compile, you know.

37:03 Yeah. I love working that way. That's my way these days. It's great.

37:07 Yeah. Because we need to pin the versions. And so we have to, for people to know with it,

37:11 you define an in file that's like your version ranges, and then you can kind of pick, compile your

37:17 version. And then that turns into like kind of frozen, like libraries, like specific numbers.

37:23 So it's, you know, you can have like everything pinned out. We use so much stuff here and we use

37:28 stuff that uses a lot of stuff. So if you import, you know, just Flask, like Flask itself is likely

37:33 to import a bunch of things. So once you kind of on, you kind of recurse to that dependency tree and

37:39 expand it, it's a massive dependency tree on the Python side. It's also a massive dependency

37:44 tree on the JavaScript side. Oh yeah. Big application made out of, you know, hundreds

37:50 of open source packages because we kind of need it all to build this full, like this, this application.

37:57 So dependency management, it's a little bit of a struggle when you manage such a big piece

38:01 of software that's connected to everything. Yeah. There's no joke. There's a lot of dependencies

38:05 here. So, but there's ways you can run it without worrying too much about that, right?

38:09 Yeah. I mean, you can definitely like just run the Docker container. You can tip install superset.

38:13 There's, you know, somewhat straightforward way to set it up locally and get things going.

38:19 Yeah. Yeah. It's, it's kind of interesting how like building application nowadays, if you think

38:24 about the dependency tree that go behind any kind of solution or application, that's not just a library.

38:30 Like library should, should have like very minimal requirements.

38:33 Kind of dependency trees, right? This should be self-contained and kind of focused, I think. But

38:38 here, I think for, to build such a large scale application, we just need to have a lot of

38:44 the dependencies. And then these dependencies have often a fair amount of dependency. I'm

38:48 surprised to see like, now we're looking at click for people, not necessarily looking,

38:52 but just click itself probably as a lot of like its own, like sub packages now too.

38:56 Exactly. And there's a lot of things that it wouldn't click to be into your, in your dependencies

39:01 here. I guess there's a, we'll talk about running in just a minute. And there's a lot of architectural

39:06 layers at play here. So you've got superset, but you've also got celery, you've got Redis,

39:13 you've got some database layers. There's a lot of technologies that people would know working

39:16 as a group that luckily Docker just takes care of for us. More like Docker compose.

39:20 Yeah. Docker, Docker compose and, you know, help charts. I think we have like a helm chart too.

39:27 It was always important for me to keep it such that you can kind of just pick install superset

39:32 and run a few commands and get it running locally. So you don't need to have, you know, Redis out of

39:38 the box and celery out of the box, similar to airflow and that way where I wanted to have like a very

39:44 self-contained thing at first. But then if you want to run any modern web app, you know, that does serious

39:52 kind of work and solve some real problem. It's likely that you need, okay, you need to have web servers and

39:57 application server, but you need to have the whole front end stack, right? Like something like when pack

40:02 and you probably have like front end infrastructure, just on the, like how you build your front end,

40:07 it gets pretty complicated quickly. Then you probably need async workers. So also then you need something like

40:12 Maccelery and something like Redis to as a message queue to talk to the async workers. Then you probably

40:18 want to start caching some things. So you need a caching back end for, for certain things. And then

40:23 you need to support an open source. You probably need to support different databases. So some people

40:29 might want to use MySQL as a backend or Postgres or some, some more other things. So then you need to

40:36 like optionally support these things through abstraction layer. So it gets complicated really quickly.

40:41 Yeah. I was really cool though, that you can just pip install it. And there's a more lightweight

40:46 version without going through all the details. Let's talk about getting, getting going, get it

40:50 running, exploring it a bit and hosting it. But before we do, I said like 15 minutes ago, two quick

40:55 comments before we talk about databases, let's just talk about the database thing real quick here.

41:00 Sure.

41:00 Over here at the bottom, you've, you know, obviously where your data comes from, we opened this,

41:06 you know, I'm pointed out that Excel is bad at getting data from different data sources. You know,

41:11 people have operational data, they have data warehouses, they have data lakes, whatever you call

41:16 them. Yeah.

41:17 Things like this, right? So there's a lot of different places people are putting data. Maybe just touch a bit on

41:23 on the database integration here. Yeah. And I think like in the context of this pod, this Python podcast,

41:28 too. So for us, we use SQLAlchemy very heavily. So SQLAlchemy is a SQL toolkit first, and then an

41:35 ORM built on top of it. And probably much more than that. But the way that we support first,

41:42 I would say the metadata database for superset, right? So in superset, when you save dashboards,

41:49 save visualization, save queries, that goes to metadata database. And we tend to recommend

41:54 Postgres and MySQL as the backend for the app, just to keep the state of the app somewhere in a proper

42:01 relational database. Sure.

42:02 That's one. And then we connect to all these databases to do analytics on them, right? And that's

42:07 what we're looking at here, the supported databases in the sense of like, what can we build charts off of?

42:12 And what can we enable data exploration around of? And then this is powered by SQLAlchemy. So that

42:19 means that anything that has a DBAPI driver and a SQLAlchemy dialect, and then maybe that's an

42:26 opportunity to talk a little bit more about the database abstraction and the Python world since we

42:31 have a Python centric audience. So DBAPI is a spec is one of the peps out there. I forgot the number of

42:38 the DBAPI peps, but I was like, you know, just a common interface for all the databases and Python.

42:44 So that's called DBAPI. And then SQLAlchemy, the SQL toolkit knows how to speak certain dialects and

42:51 builds an RM on top of things. And this is a PEP 249. So it came a little bit later in the story of

42:58 Python. I don't know what's the number. What's the latest PEP number? They're pretty high these

43:02 days, although they seem to be organized by my concept. Let's see. We've got some of the 8000 here.

43:09 Oh, I guess there's there's like some encoding in the number.

43:13 There's some kind of grouping. Yeah, I'm not sure exactly what it is. But it's 3,000 for a specific

43:18 thing in the 8,000.

43:19 I think so. But yeah, don't hold me to it. Yeah, I think so.

43:22 Okay. Anyhow, what you need in order to basically for Superset to connect to any flavor of database is a

43:29 viable DBAPI driver. And once that's built, SQLAlchemy dialects, SQLAlchemy dialects are fairly

43:35 easy. Like we've written in a bunch of API drivers and SQLAlchemy dialects in the past. They're not that

43:41 hard to implement. So that means it's pretty much anything that speaks SQL out there, you know,

43:45 we can talk to essentially. Yeah. Yeah. So we've got the standard MySQL, Postgres,

43:50 Microsoft SQL Server is probably a big one in the BI space because a lot of enterprises are back with

43:56 that. But it also has more unique ones like Snowflake and Druid and Google BigQuery and Firebird,

44:04 a lot of different places that people can talk to. Yeah. When we see, you know,

44:08 like the Superset community and the preset customers. So I started a company three years ago that's

44:13 essentially commercializing Apache Superset and offering a managed service so you don't have to

44:17 run it. So we're on call. You're on call. There's a freemium to say if you want to try Superset,

44:22 you can pip install Superset and kind of struggle with Docker and all this stuff. Or you can try it

44:26 directly at preset so you can just like start for free and see if it works for you. Then you can kind of

44:31 postpone the decision of like, do I want to run it on my own or do I want to use a managed service and kind

44:36 of pay as I go instead. But yeah. So what we see in terms of what our customers use, so a lot of Snowflake,

44:42 a big query, these cloud data warehouses can a no brainer nowadays. If you have a true analytical

44:49 workload, just put all your data and Snowflake, big query. And then there's still some Redshift and

44:54 there's still like, you know, all sorts of like, you know, database engines for whatever circle reasons

45:00 people have or they have constraints of them to run something, you know, on premise or in their cloud

45:05 and then Redshift. Right. Absolutely. So because it's open source, they can go and host it to their heart's

45:10 content or they can go SaaS style and work with you all. That's right. So for us, we do offer the

45:17 managed service as you know, the freemium and pay as you go can proceed. So it's like 20 bucks per user

45:22 for months. It's pretty, pretty straightforward and kind of easy to grow into and you pay as you go.

45:29 Then we have something called the managed private clouds. If you do want to run a managed service inside

45:34 your cloud because you don't want your data to leave. Maybe your data is not already in a cloud

45:38 data warehouse. Maybe it's inside your VPC and you want to keep it there. So we offer a service. It's

45:43 still a managed service with a centralized control plane, but that runs on your cloud. So we do offer

45:49 this and then you're always free to like run on your own. Right. Like to and there, the question is like,

45:53 can you know, you have to think the math of like running a piece of open source software versus

46:00 like running on your own versus like paying a vendor, like running Kafka or buying Confluent for events or

46:07 running Spark or Databricks is whether you're interested in the bells and whistle that the vendor uses. And

46:12 and then the constraint you have around like quality of service and think about total cost of ownership. So

46:18 the reality is like running something like superset at scale in your organization, if you want the latest,

46:25 greatest, secure, kind of patched up version of it is that it's pretty expensive to the total cost of

46:31 ownership of open source is fairly high. So often the vendors can do it at a better, better price and better quality.

46:38 To patch Celery and Redis and Memcash and databases and your servers hosting them and keeping them all

46:47 going. It's non-trivial. And then there's disaster recovery and failure. And you know, as soon as you

46:51 were thinking, well, maybe we should hire somebody to do this, then all of a sudden a paid service starts

46:56 to sound pretty appealing. Right? Oh yeah. I mean, when you think about like what it really takes to

47:02 manage a piece of software or collection of pieces of software, like, you know, superset and Kafka and

47:07 airflow and all these things, and you want it to be state of the art, you know, latest, greatest version

47:12 and kind of secured compliance, if compliance is a concern and all this stuff. Generally for,

47:18 for at least for smaller organizations, it makes tons of sense to like, you know, who's the best

47:23 people to run the software reliably is the people writing the software. Yeah. You know, even on things

47:28 like I preset, we have a multi-tenant version of superset that we run where you can't really have that if you

47:34 run out on your own. So that means we, how much we pay per cycle in terms of infrastructure costs

47:40 is going to be much cheaper than what you can get to running superset on your own. Sure. Not every user

47:45 is asking a, an active BI question all the time. So you got extra resources to share. And then you

47:51 provision for peak. It's a little bit the same with infrastructure, right? Like you, if you run a

47:56 database server on your own, you have to provision for peak access, where if there's a cloud service, then you

48:01 you have to provision the cloud vendor as the provision for the total peak across all the

48:06 customers. So yeah, there's tons of economies of scale there. And we passed that on to our customers.

48:11 Cool. All right. Well, let's talk about maybe getting started and just the first touch type of

48:16 experience before we run out of time here. You have a nice doc that says installing and using superset.

48:23 And I went for the easy way. So on my Mac, I have Docker for Mac already set up, which means I have

48:28 Docker and Docker compose. And so basically that's clone the repo, the superset repo, go in there and

48:36 then just run Docker compose pull and then Docker compose up on a certain definition file, configuration

48:42 file. And then pray that there should be a comment that says pray and hope for the best.

48:47 One thing that's really interesting is like, and I'm sure a lot of other like open source leaders can,

48:54 can kind of relate to that is that no one agrees on the best way to run something for production use

49:02 cases for sandbox use cases, and even in developer mode. Right. Like, so for me, I'm like, I freaking,

49:09 I hate Docker and Docker compose because I don't have enough control and I'll tend to just kind of run my own set of my

49:15 own environment. I run Tmux and I do my own builds and I prefer having more control instead of trying to understand

49:23 that abstraction layer that Docker and Docker compose is. So there's an alternative, I think, documentation

49:29 somewhere. And there's a big contributing MD that's more geared towards people like, how do I run my setup if I want to

49:37 actually develop on the tool? So somewhere on the superset repo, there's a computer contributing MD file that says,

49:43 here's if you want to develop with Docker, Docker compose, you do this. If you want to develop using more kind of a different,

49:50 like more raw level, how do you think?

49:52 Create your own virtual environment and go.

49:53 Yeah, that's it. If you want to, and there's some people use pyenv, pyenv. Some people prefer using like virtual and more directly.

50:00 So it's really hard to come up with a, we got a prescribed way to do it with a good documentation.

50:06 But then, you know, half of the people are going to go their own way anyway.

50:10 So say Docker compose here too, is like a lot of people prefer helm charts for Kubernetes.

50:14 So then we have helm charts, we have a Docker compose construct, then we do have other documentation as to how to do it.

50:22 But it's been really difficult to have a very clear prescribed way to do it and then maintain the different ways

50:28 that we can do it individually and keep them all working.

50:30 Sure. So as much as I'm not a huge fan of developing code in Docker, I do think this is a nice way for a low effort, first touch experience.

50:40 You're like, I just want to run it and log into the web app and see how it feels and play with it.

50:44 And you get, you know, all the various moving parts or you get celery and Redis and whatnot, which is pretty cool.

50:50 That's also kind of a map of how to run it on your own, right? So maybe you're like, oh, I don't like Docker compose.

50:55 I prefer, you know, my own version of something else, but I'm going to look at the Docker compose to see what it's playing and have the recipe.

51:02 That recipe is still very useful for people at different ways.

51:05 Sure. Or just knowing, look, there has to be, or maybe it's good if there's a Redis server.

51:10 Okay, well, I have Redis. Let me just set up a, set it up to connect to that one, for example, right?

51:15 That's it. Yes. I'm just going to change that part of the recipe because I already have that ingredient, you know, run.

51:19 Yeah, exactly. Exactly. When you run the Docker container, it says, you wait a moment.

51:25 It says, everything worked. Go over to localhost 8088 and log in.

51:30 The super secure default password and the username is admin admin.

51:34 So you're going to change that, but, but, you know, it's an easy way to get in there.

51:38 And what you get is you get some example dashboards and some example charts, right?

51:42 You want to just maybe tell us about the things we find when we get here so people know how to go explore when they get started?

51:47 Right. And you probably want to zoom out a little bit because like the rendering here, it's going to look a little bit better.

51:52 It's kind of interesting too, because you don't, out of the box, you don't get the thumbnail backend.

51:56 So you don't get the pretty thumbnails, you know, that will have a preset or that you can set up if you spend a little bit more time on setting up your salary backend and getting all the thumbnails to compute in the backend.

52:07 Yeah. So what you get out of the box is a set of very small data sets and charts and dashboard built on top of that.

52:13 You can navigate and play with.

52:15 If you really want to get value and get a real POC, you probably want to connect to your real data warehouse, probably not out of your local, but get to a slightly more, or maybe you have a copy of your data warehouse or some data you want to play with.

52:28 And you can connect here. If you were to look at.

52:30 It's data sets, right?

52:32 So data sets are like coming from your database connection. So somewhere in the upper right, you have settings.

52:38 I see.

52:38 Right. So you could connect database connections right here. So you could create a new database connection on the upper right. If you click, you'll see, you know, just a screen to connect to your database.

52:47 So you pick the database you want to connect to, add your connection string, and then you can start playing with your own data.

52:52 If you don't want to play with your own data, you can play with the data we provide. It's fairly limited.

52:57 We've spent a lot of cycles working on adding the latest, most fun data sets to play with, with the best dashboard examples, but allows you to get started and get a sense for what superset can do.

53:07 Yeah. So we have a couple of major building blocks. We have dashboards, we have charts, we have data sets, and we have the SQL IDE thing.

53:15 That's right.

53:16 I'll pull up a, here, we'll pull up a sales dashboard. Nothing screams BI more than sales dashboard.

53:23 That's right. We've started to have a, you know, an example there, but it loads like a few bar charts and it's not like the best design dashboard. It shows that we support.

53:32 But it looks good. Like there's a lot of, there's some beautiful stuff here.

53:35 Yeah. You can do so much more. I feel like our examples are dated. You can do so much better with like superset. If you actually take a little bit more time, we should work on our examples as a community.

53:44 You know, it's like really compelling data sets to play with, but it gives a good overview. And here, if you click on the dot, dot, dot for any chart. So here for people who can't see the visual support and click on edit chart.

53:56 So that will send you to our Explorer. So we're in the dashboard. We're looking at a specific chart. Now we just move to our chart editor. That's very much like your exploration. So here you can click on a metric. You can drag and drop different metrics.

54:10 Change my sum to max and see what happens. There we go. Look at that. Biggest sales instead of most.

54:15 Yeah. So you can update the charts and if you were to click on view all charts, I don't know if you see that somewhere at the top metal somewhere.

54:23 There's all sorts of visualizations that are supported here. You get a big list of all the visualization plugins that ship with superset today.

54:30 So all your common charts, but also some geospatial stuff and some more, some more advanced and complicated charts.

54:36 Nice.

54:38 All sorts of things. Yeah. So that's here. Maybe like just to do a little bit of the flow of the demo.

54:44 Apologies for people not watching and just look at it. So hit cancel and then click on the upper right dot, dot, dot.

54:51 So not settings, but the dot, dot, dot here. So you can say view the query or run in SQL lab will allow you to go kind of step deeper.

55:00 Where now the SQL that happened to be running behind this chart. Now you can, you can alter and, you know, push your own analysis.

55:08 Oh yeah. That's cool.

55:10 So we went from a dashboard to kind of your exploration session and into a SQL IDE. You can go deeper here and just like run your own analysis.

55:18 It's a big playground for data, you know?

55:20 Yeah. And you get a little, pull up your table or your SQLAlchemy model. Maybe it is. I'm not sure.

55:26 And call it like a schema navigator. So in this case, it's very much like you're navigating your database, your database kind of object. Right.

55:33 So you can navigate your schemas and see the tables and the views. And then there's good autocompletes. It's very much an IDE. If you start typing, you know, it will autocomplete the table names and then the column names.

55:45 Yeah. Super cool. You also get a query history. That seems nice for if you're like playing around, you're like, oh, five versions ago of typing in this. I had the picture I wanted and I know where'd it go, right?

55:55 Yeah, totally. So I think that for the people who speak SQL, you know, they can go deeper and run more complex analysis.

56:01 Sure. Yeah. Very neat. All right. Well, maybe let's close it out with a quick conversation on this and then I know we're out of time.

56:07 I picked on Excel for having very poor source control options. What about, what's the story here about versioning and sharing and collaboration?

56:16 There's this thing in BI called Headless BI. It's the ability to manage BI assets as code.

56:22 Add preset. We have built a CLI on top of the superset API. It allows you to import and export objects from into the BI tool to the file system.

56:32 So it's really easy to say, I want to store this dashboard or this set of dashboard or this set of object and manage them as code.

56:40 So there's a CLI that allows you to push and pull from the BI tool from superset into Git and GitHub.

56:47 All right. Let me see if I got this right. So I might create a folder in it that is a GitHub repo or a Git repo rather.

56:53 Then I would export all my stuff, commit that, and then I would just like write over it and keep committing.

56:58 Those would sort of track my changes. And if I ever need to, I can reinstantiate or rehydrate that thing out of the file set into superset.

57:06 Yeah. So there's more to it than that. And I'm going to try to explain it well. But once you say hit the eject button, which would be exporting the BI assets as code, then you get a collection of YAML files that represents your chart, your data sets and your database connection definition.

57:22 Right. So your dashboard then is represented as code. When you push things. Well, so first you can template eyes things at CML. So you can use Jinja, which is a great Python package to template files.

57:34 So you can inject some templates into your BI tool for if you were to say broadcast this object to multiple superset instances or to say I'm going to do permutation of variation on a theme.

57:47 You can do that through templating.

57:48 Okay.

57:49 And that's through the preset CLI and you can push. As you push, then there's a flag. I believe the flag is on by default where it will prevent people from updating the object in the GUI saying this object is managed as code.

58:02 The source code lives here for reference. So you can click and go see the code on GitHub. But then you can't save it because it's essentially read only and managed by source control.

58:13 I think in the future, we're looking to have a companion for each superset workspace on preset to be able to have the full history over time of what has changed. So you can go and restore assets, you know, as they were a while ago.

58:26 There's always someone that's going to delete something or delete a dashboard or change it in ways that are that are destructive and people want to roll back. So it's possible to do that.

58:35 Sure.

58:36 It makes a lot of sense to have some kind of source control story. But at the same time, because it's kind of a SaaS thing, either self-hosted, your little baby SaaS or at a preset, it's kind of a shared asset that doesn't need to be synced and pushed and pulled and cloned as much to allow people to work on it. Right.

58:54 Yeah. So there's different things. I think the Google Docs approach, which is to keep a GUI revision history and being able to see who changed what went is also valuable.

59:04 And sure, we're going to see that in the future of superset too, being able to say, I want to look at the history of this dashboard from a GUI perspective.

59:13 So that's something that has been requested and will have in the future. So call it your Google Docs kind of GUI.

59:20 Yeah, that's true.

59:21 The managing asset as code is different use case, right? If you have an embedded dashboard or if you publish a certain dashboard as part of your application, that's more the rigor. Like, I want to have it in source control. I want a version. I want it, you know.

59:34 Sure.

59:35 It's kind of like having a DevOps team versus, you know, someone keeps the server running. Right. Like there's different levels of maturity around different things and companies.

59:44 Yeah, people want that flexibility too. Like infrastructure as code, for instance, is great. But that doesn't mean that everything should, like if you and I want to go and create a AWS account and spin off some resources, maybe we don't need to start with, you know, Terraform script.

59:58 Yes, exactly.

59:58 And then you can generate the code later. Maybe you can say like, hey, AWS, can I generate the Terraform code of all the stuff that I've done in the past three days so that, you know, GUI to code can flow.

01:00:10 And then, yeah, the other way of like, you know, code to GUI. But yeah, it's important for this sort of tools, managing critical assets to have these workflows like GUI to code, code to GUI, and be able to have the flexibility and best of both worlds as you go up in your maturity lifecycle.

01:00:27 Yeah. And your need for rigor.

01:00:29 Makes a lot of sense to me. All right. Well, we are well out of time here, Max. So, you know, congrats again on creating such a cool project. And I guess with Airflow as well, it's not even the first one. So very popular. It seems like it's definitely taken up. It's great.

01:00:43 Yeah, it's been super exciting, like way beyond my expectation. And I think really often like the original creators get too much kind of recognition and reward compared to like the rest of the community, right? So what it takes for something like Superset to exist is it takes 800 people contributing and it takes an entire Slack community. And really often we give a lot of credit to the person who created the thing. But you should look at like how bad Superset was.

01:00:59 What was the first person, what the only person working on it and what it really took off is when we saw like a set of really good contributors coming in and pushing it to the next level.

01:01:05 Sure.

01:01:18 Sure.

01:01:18 That's been rewarding.

01:01:20 That's awesome. I definitely got some people excited about it. One of the comments in the audience is this project has me stupid excited, which is lovely.

01:01:27 Love to see that excitement, right? Like a lot of this validation comes through usage and value and people getting excited, contributing and more like just using here. Like we'd love to see people just say like, hey, we're using this. We're getting tons of value.

01:01:39 Yeah. Go to the in the wild.

01:01:41 Yeah.

01:01:42 Put your scam on it, right?

01:01:43 I get like the communistic, like, you know, like together we can build better things than vendors on their own can. It's just like open source, a better way to not only to collaborate and build software, it's a better way to discover software, adopt software.

01:01:56 Yeah.

01:01:57 And just like get to solutions, you know.

01:01:59 Very cool. All right. Well, before you get out of here, final two quick questions. If you're going to write some code, what editor do you use?

01:02:05 As I feel, I'm still a Vim person. I feel like I need to modernize. I'm not like, oh, Vim is better than all the IDEs. It's just muscle memory at this point. It was just very common line and very kind of into Vim and like my specific kind of tune up for Vim. And it's not because I think it's better. It's just like habit, you know.

01:02:26 Cool. There's a funny guy who I think he's German. He does this YouTube series, making fun of different programming languages and communities. And one is this guy. He talks about how he fought in the Vim Emax wars. Yeah, it's pretty good.

01:02:40 All right. So you're on the Vim side. Fantastic. And then.

01:02:43 And I'm not on the side too. It's like, that's what I use. But at the same time, like encourage people to find something that works for them.

01:02:48 Sure.

01:02:48 Then I talked about the power of muscle memory, right? Like once you really know your tool set and the shortcuts, it's like your computer becomes an extension of your brain and your muscles. And there's just a beauty in that. So it's good to have a tool that enables you and to have that self-training. Like I'm going to train my muscle memory so I can do the things that I do all the time without things.

01:03:09 Right. You think I want this to happen and then it happens and you don't have to be conscious of it happening, right? In your editor, like that's the way.

01:03:17 Clicking around, I'm like, I'll do the sequence of like six clicks to do this thing. Photoshop all the time. Why can't you just do like command shift R plus V2?

01:03:25 Yes. Exactly.

01:03:26 You choose a sequence, it just happens magically.

01:03:29 Yeah, absolutely. And then notable PyPI package or other library fan, you're like, this is awesome. People should know about.

01:03:34 Yeah. So I wanted to talk about, so I live, you know, we use SQL very heavily, you know, as you saw. So if you're a data practitioner, you write a lot of SQL. I spend quite a bit of time writing tons of SQL in Airflow, a little bit in DBT too, more recently.

01:03:50 There's this like SQL linter that came out. It's called SQL fluff. It's been around for a little while people. So check out PyPI SQL fluff.

01:03:58 Right.

01:03:58 There it is. And it's a very configurable SQL linter fixer. So, you know, like we all love like Pep 8 and things like Black that are very deterministic and opinionated. I think we're not there in the SQL world. People have not agreed on our Pep 8 equivalent for SQL.

01:04:15 So this is like highly configurable. So you can agree with your team on the set of like linting rules for your repo and then it can fix a lot of stuff for you. So I think it helps.

01:04:26 If we're going to manage mountains of SQL, I don't like SQL that much, but it seems like this generation of data teams is going to rely a lot on a lot of SQL.

01:04:35 Then having a linter, you know, helps making that a little bit more bearable.

01:04:40 Excellent. SQL fluff.com. Very cool. All right. Final call to action. People are excited.

01:04:45 If you're excited about Superset, I want to get started. I want to play with it. What do you tell?

01:04:48 pip install Superset. I mean, come to the GitHub repo. Check out superset.apache.org.

01:04:53 We haven't talked about the Apache Software Foundation too, but, you know, we're supported by the Apache Software Foundation in many ways.

01:04:59 And then you should be able to find tons of resources.

01:05:02 It is a little bit harder to get started than other things because it has such a broad piece of software that's like very, very layered.

01:05:10 We have a Slack. So I think there is a type of issue that's probably called like starter issues.

01:05:16 I forgot the exact name of it. So, and then we have a Slack to get involved.

01:05:21 And I believe in Slack, there's a way to kind of introduce yourself.

01:05:24 And there's a bunch of channels that are more like, how do I get started? How do I contribute?

01:05:28 So there should be outlets for anyone who wants to get involved to get connected.

01:05:33 If you fail at doing that, like you can probably reach out to me directly on Twitter or elsewhere.

01:05:38 And I might be able to give you some pointers.

01:05:40 Yeah, there's a few people commit to the project. So there's got to be a lot out there.

01:05:44 That's like a thing though, like when the project gets bigger and there's more contributors, that doesn't mean it's necessarily more welcoming and easier to get into.

01:05:53 There's more people, but sometimes there's not as clear of a, if you don't have it BDFL, sometimes a little bit harder to talk to a single person and get the exact pointer that you need.

01:06:03 So I would say like, just get on the Slack, talk to a few people, find, think about how you want to get involved too, and be clear about your intentions.

01:06:10 And then we'll be able to connect you in the right place, the right person.

01:06:14 Fantastic. All right. Max, thank you for being here. Thanks for creating this cool project.

01:06:18 Looks like tons of people are getting value from it.

01:06:20 Yeah. Thank you for having me on the show too. And I'm going to go and look back at the episodes and kind of, you know, I'm always looking for good content too, and keeping in touch with the Python community too.

01:06:28 So I'm going to go and dig in your archives there.

01:06:31 Right on.

01:06:31 And listen to a bunch of episodes.

01:06:33 We have seven years, almost every single week. So there's a bunch of episodes back there. So yeah. Thanks so much.

01:06:39 Yeah. See you later.

01:06:40 Thank you.

01:06:40 Take care.

01:06:41 Bye.

01:06:42 This has been another episode of Talk Python To Me.

01:06:45 Thank you to our sponsors. Be sure to check out what they're offering. It really helps support the show.

01:06:50 Join Sentry at their conference, Dex, Sort the Madness, the conference for every developer to join as they investigate the movement and trends for better and more reliable developer experiences.

01:07:00 Save your seat now at talkpython.fm/Dex.

01:07:06 Want to level up your Python? We have one of the largest catalogs of Python video courses over at Talk Python.

01:07:11 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:07:16 And best of all, there's not a subscription in sight. Check it out for yourself at training.talkpython.fm.

01:07:22 Be sure to subscribe to the show, open your favorite podcast app, and search for Python. We should be right at the top.

01:07:28 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:07:37 We're live streaming most of our recordings these days.

01:07:40 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:07:48 This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it.

01:07:53 Now get out there and write some Python code.

01:07:54 I'll see you next time.