Learn Python with Talk Python's 270 hours of courses

Data Science from the Command Line

Episode #392, published Fri, Dec 2, 2022, recorded Mon, Nov 28, 2022

When you think data science, Jupyter notebooks and associated tools probably come to mind. But I want to broaden your toolset a bit and encourage you to look around at other tools that are literally at your fingertips. The terminal and shell command line tools.

On this episode, you'll meed Jeroen Janssens. He wrote the book Data Science on The Command Line Book and there are a bunch of fun and useful small utilities that will make your life simpler that you can run immediately in the terminal. For example, you can query a CSV file with SQL right from the command line.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guest Introduction and Background

Jeroen Janssens is a seasoned data scientist and the author of Data Science at the Command Line. Jeroen has a broad background in data, having worked with many programming languages including Python, R, and more. He’s deeply involved in the data science community, often speaking about empowering developers to leverage the terminal for efficient data workflows.

What to Know If You're New to Python

Below are a few tips to help you dive into this episode's discussion about using Python on the command line. These points will clarify the basics so you can focus on how Python interacts with tools mentioned throughout the show.

  • Python Environments: Know how to create or activate a virtual environment (e.g. venv) so that installing Python packages (like CSVKit) won’t interfere with your system Python.
  • CLI Arguments: Many Python scripts accept parameters via the command line (e.g. sys.argv). Familiarize yourself with passing arguments rather than hardcoding values.
  • Bash vs Zsh: Mac users may see Zsh or Bash as their default shell. Just note they’re both shells, and Python commands or scripts usually work the same way under either.
  • Subprocess Module: You may hear about automating tasks by calling other commands in Python. This is typically done using subprocess.run(...).

Key Points and Takeaways

  1. Why the Command Line for Data Science? The command line is often seen as old-school, but it shines as a “super glue” for automating data pipelines and orchestrating small yet powerful utilities. By chaining commands (via pipes) or using scripting languages like Python, you can perform data science tasks quickly without heavy frameworks.
  2. Shell Customization and Productivity Simple tweaks—like switching from the default Terminal on macOS to iTerm2 and from Bash to Zsh or Fish—can transform your productivity. The conversation highlighted the importance of short aliases, color prompts, and commands like McFly for better command history.
    • Tools and Resources:
  3. Creating Python-powered CLI Tools Turning Python code into CLI executables can simplify repetitive tasks or data transformations. You just add a “shebang” (#!) at the top and make it executable with chmod +x myscript.py. Coupling that with argument parsers like Click or Typer leads to maintainable command-line utilities.
  4. CSVKit and Data Wrangling While Python’s pandas library is powerful, you can handle many CSV tasks with command-line utilities. The show featured CSVKit, which includes tools like csvcut, csvsql, and csvstat that understand the structure of CSVs.
  5. Querying CSV with SQL One standout CSVKit tool is csvsql, which runs SQL queries directly on CSV data. This technique is helpful if you’re comfortable with SQL but don’t want to spin up a full database or open Python for quick data exploration.
  6. Parallelizing Tasks with GNU Parallel Python’s GIL limits multithreading, but spinning up multiple processes with commands like GNU Parallel is a straightforward workaround. You can process large sets of files or data chunks in parallel by chaining Python scripts or other shell commands.
  7. Subprocess and Polyglot Data Science The conversation covered the subprocess module for calling out to other utilities from Python. You can mix languages—like Ruby or R-based command-line programs—so long as they accept text in and out, bridging everything with Python for orchestration.
  8. Jupyter Meets the Terminal Jupyter notebooks can call shell commands by prefixing them with !. Meanwhile, %%bash cells allow multi-line shell scripts. It’s a nifty way to merge interactive Python data analysis with classic terminal commands.
  9. Visualizing on the Command Line While not always the first choice, the terminal can display charts if you integrate the right tools. Jeroen mentioned using R's ggplot2 with special backends, or bridging Python’s plotting libraries via small wrappers for quick visual checks.
  10. Docker for Experimentation Docker containers are an easy way to isolate your environment when you’re experimenting with new command-line tools. You can spin up containers preloaded with your data science tools, ensuring you don’t accidentally break your main setup.

Interesting Quotes and Stories

“I still find it an interesting juxtaposition of these two terms, data science and the command line.” – Jeroen

“It can be very efficient to just whip up a command on the command line using a couple of tools if it solves the job.” – Jeroen

Key Definitions and Terms

  • Shebang (#!): A special line at the top of a script indicating which interpreter (e.g., /usr/bin/env python3) should execute the file.
  • POSIX: A family of standards that define interoperability between Unix-like operating systems, shaping many shell behaviors.
  • Alias: A short command (e.g., alias ll='ls -la') that saves keystrokes in the shell.
  • Pipe (|): A command-line feature that feeds the output of one command into the input of another.
  • GNU Parallel: A command-line tool that executes jobs in parallel using multiple CPU cores or remote machines.
  • CSVKit: A collection of utilities for cleaning, transforming, and analyzing CSV data from the terminal.

Learning Resources

If you want to go deeper into Python, data science, or bridging the gap between spreadsheets and Pythonic workflows, consider these courses from Talk Python Training:

Overall Takeaway

Embracing the terminal can significantly boost your workflow for data analysis and beyond. By combining quick command-line utilities with Python’s flexibility, you can handle data manipulation, visualization, and automation in a more streamlined and composable way. Whether you’re just getting started with Python or you’re looking to sharpen your command-line chops, remembering that “the shell doesn’t care which language you use” opens up enormous possibilities for creativity and efficiency.

Links from the show

Jeroen's Website: jeroenjanssens.com
Jeroen on LinkedIn: linkedin.com
Jeroen cohort-based course, Embrace the Command Line. Listeners can use coupon code TALKPYTHON20 for a 20% discount: maven.com

Data Science on The Command Line Book: datascienceatthecommandline.com
McFly Shell History Tool: github.com
Explain Shell: explainshell.com
CSVKit: csvkit.readthedocs.io
sql2csv: csvkit.readthedocs.io
pipx: github.com
PyProject.toml to add entry points: github.com
rich-cli: github.com
Typer: typer.tiangolo.com
FasD: github.com
Nerd Fonts: nerdfonts.com
Xonsh: xon.sh
iTerm: iterm2.com
Windows Terminal: microsoft.com
ohmyposh: ohmyposh.dev
ohmyz: ohmyz.sh
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon