Data Science from the Command Line
On this episode, you'll meed Jeroen Janssens. He wrote the book Data Science on The Command Line Book and there are a bunch of fun and useful small utilities that will make your life simpler that you can run immediately in the terminal. For example, you can query a CSV file with SQL right from the command line.
Episode Deep Dive
Guest Introduction and Background
Jeroen Janssens is a seasoned data scientist and the author of Data Science at the Command Line. Jeroen has a broad background in data, having worked with many programming languages including Python, R, and more. He’s deeply involved in the data science community, often speaking about empowering developers to leverage the terminal for efficient data workflows.
What to Know If You're New to Python
Below are a few tips to help you dive into this episode's discussion about using Python on the command line. These points will clarify the basics so you can focus on how Python interacts with tools mentioned throughout the show.
- Python Environments: Know how to create or activate a virtual environment (e.g.
venv
) so that installing Python packages (like CSVKit) won’t interfere with your system Python. - CLI Arguments: Many Python scripts accept parameters via the command line (e.g.
sys.argv
). Familiarize yourself with passing arguments rather than hardcoding values. - Bash vs Zsh: Mac users may see Zsh or Bash as their default shell. Just note they’re both shells, and Python commands or scripts usually work the same way under either.
- Subprocess Module: You may hear about automating tasks by calling other commands in Python. This is typically done using
subprocess.run(...)
.
Key Points and Takeaways
- Why the Command Line for Data Science?
The command line is often seen as old-school, but it shines as a “super glue” for automating data pipelines and orchestrating small yet powerful utilities. By chaining commands (via pipes) or using scripting languages like Python, you can perform data science tasks quickly without heavy frameworks.
- Tools and Resources:
- Shell Customization and Productivity
Simple tweaks—like switching from the default Terminal on macOS to iTerm2 and from Bash to Zsh or Fish—can transform your productivity. The conversation highlighted the importance of short aliases, color prompts, and commands like McFly for better command history.
- Tools and Resources:
- Oh My Zsh
- FASD for quick directory and file navigation
- Oh My Posh (especially for Windows + PowerShell)
- Tools and Resources:
- Creating Python-powered CLI Tools
Turning Python code into CLI executables can simplify repetitive tasks or data transformations. You just add a “shebang” (
#!
) at the top and make it executable withchmod +x myscript.py
. Coupling that with argument parsers like Click or Typer leads to maintainable command-line utilities. - CSVKit and Data Wrangling
While Python’s pandas library is powerful, you can handle many CSV tasks with command-line utilities. The show featured CSVKit, which includes tools like
csvcut
,csvsql
, andcsvstat
that understand the structure of CSVs.- Tools and Resources:
- CSVKit GitHub Repo
- XSV (fast CSV tool in Rust)
- Tools and Resources:
- Querying CSV with SQL
One standout CSVKit tool is
csvsql
, which runs SQL queries directly on CSV data. This technique is helpful if you’re comfortable with SQL but don’t want to spin up a full database or open Python for quick data exploration.- Tools and Resources:
csvsql
in CSVKit- SQLite Docs (conceptually similar approach)
- Tools and Resources:
- Parallelizing Tasks with GNU Parallel
Python’s GIL limits multithreading, but spinning up multiple processes with commands like GNU Parallel is a straightforward workaround. You can process large sets of files or data chunks in parallel by chaining Python scripts or other shell commands.
- Tools and Resources:
- GNU Parallel
- Docker (to isolate workloads)
- Tools and Resources:
- Subprocess and Polyglot Data Science
The conversation covered the
subprocess
module for calling out to other utilities from Python. You can mix languages—like Ruby or R-based command-line programs—so long as they accept text in and out, bridging everything with Python for orchestration.- Tools and Resources:
- subprocess.run(...) Docs
- explain shell for clarifying commands
- Tools and Resources:
- Jupyter Meets the Terminal
Jupyter notebooks can call shell commands by prefixing them with
!
. Meanwhile,%%bash
cells allow multi-line shell scripts. It’s a nifty way to merge interactive Python data analysis with classic terminal commands.- Tools and Resources:
- Visualizing on the Command Line
While not always the first choice, the terminal can display charts if you integrate the right tools. Jeroen mentioned using R's
ggplot2
with special backends, or bridging Python’s plotting libraries via small wrappers for quick visual checks.- Tools and Resources:
- ggplot2 in R
- Plotnine (Python port of ggplot2)
- Tools and Resources:
- Docker for Experimentation Docker containers are an easy way to isolate your environment when you’re experimenting with new command-line tools. You can spin up containers preloaded with your data science tools, ensuring you don’t accidentally break your main setup.
- Tools and Resources:
Interesting Quotes and Stories
“I still find it an interesting juxtaposition of these two terms, data science and the command line.” – Jeroen
“It can be very efficient to just whip up a command on the command line using a couple of tools if it solves the job.” – Jeroen
Key Definitions and Terms
- Shebang (
#!
): A special line at the top of a script indicating which interpreter (e.g.,/usr/bin/env python3
) should execute the file. - POSIX: A family of standards that define interoperability between Unix-like operating systems, shaping many shell behaviors.
- Alias: A short command (e.g.,
alias ll='ls -la'
) that saves keystrokes in the shell. - Pipe (
|
): A command-line feature that feeds the output of one command into the input of another. - GNU Parallel: A command-line tool that executes jobs in parallel using multiple CPU cores or remote machines.
- CSVKit: A collection of utilities for cleaning, transforming, and analyzing CSV data from the terminal.
Learning Resources
If you want to go deeper into Python, data science, or bridging the gap between spreadsheets and Pythonic workflows, consider these courses from Talk Python Training:
- Python for Absolute Beginners – Ideal if you’re brand new to programming in Python.
- Move from Excel to Python with Pandas – Learn to handle CSV files and data workflows with Python instead of spreadsheets.
- Data Science Jumpstart with 10 Projects – A great primer on essential data science concepts in Python.
Overall Takeaway
Embracing the terminal can significantly boost your workflow for data analysis and beyond. By combining quick command-line utilities with Python’s flexibility, you can handle data manipulation, visualization, and automation in a more streamlined and composable way. Whether you’re just getting started with Python or you’re looking to sharpen your command-line chops, remembering that “the shell doesn’t care which language you use” opens up enormous possibilities for creativity and efficiency.
Links from the show
Jeroen on LinkedIn: linkedin.com
Jeroen cohort-based course, Embrace the Command Line. Listeners can use coupon code TALKPYTHON20 for a 20% discount: maven.com
Data Science on The Command Line Book: datascienceatthecommandline.com
McFly Shell History Tool: github.com
Explain Shell: explainshell.com
CSVKit: csvkit.readthedocs.io
sql2csv: csvkit.readthedocs.io
pipx: github.com
PyProject.toml to add entry points: github.com
rich-cli: github.com
Typer: typer.tiangolo.com
FasD: github.com
Nerd Fonts: nerdfonts.com
Xonsh: xon.sh
iTerm: iterm2.com
Windows Terminal: microsoft.com
ohmyposh: ohmyposh.dev
ohmyz: ohmyz.sh
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy