Paths into a data science career

Episode #139, published Wed, Nov 22, 2017, recorded Tue, Nov 7, 2017

Episode Deep Dive Links Transcript

Data science is one of the fastest growing segments of software development. It takes a slightly different set of skills than your average full-stack development job. This means there's a big opportunity to get into data science. But how do you get into the industry?

That's what Hugo Bowne-Anderson is here to tell us all about.

Episode Deep Dive

Guest introduction and background

Hugo Bowne-Anderson is a data scientist who has worked in cell biology and biophysics research, later transitioning to a core data science role. During his time as a postdoc in Germany, he became interested in programming for data analysis and statistics, gradually focusing on Python and R. Hugo eventually joined DataCamp, where he worked on designing data science curricula, building and collaborating on dozens of Python courses, and later took on a role doing data science advocacy, content creation, and podcasting. In this episode, he shares perspectives on how aspiring data scientists can build the right skills, portfolio, and professional network to enter the field successfully.

What to Know If You're New to Python

If you are just starting your Python journey, here are some pointers from the conversation:

Familiarize yourself with Python's basic syntax, including functions, loops, and data structures.
Explore Jupyter Notebooks or similar interactive environments to visualize code results immediately.
Focus on simple data projects (such as analyzing your own data) to solidify core Python skills before adding more advanced concepts like machine learning.
If you need a structured introduction to Python, consider Python for Absolute Beginners: A course designed to teach programming fundamentals from the ground up.

Key points and takeaways

Building a Public Profile and Portfolio Demonstrating your data science skills is essential. Sharing real projects on GitHub, writing blog posts, and working on public datasets can provide tangible evidence of your abilities. This helps potential employers see your coding style, data analysis approach, and how you solve problems.
- GitHub: github.com
- Kaggle Competitions: kaggle.com
Engage with the Community and Conferences Attending meetups, hackathons, and conferences like ODSC (Open Data Science Conference) can expand your network and expose you to new ideas. Hiring managers often attend these events, making them prime opportunities to discuss open positions and industry needs. Being part of open source sprints at PyCon or other gatherings allows you to contribute to major Python projects.
- ODSC: odsc.com
- PyCon: us.pycon.org
- Local meetups: meetup.com
Combining Domain Expertise with Programming Hugo emphasized how blending domain knowledge (like biology, finance, or marketing) with data science skills creates powerful career opportunities. Companies seek candidates who understand their problem space as well as how to leverage data tools. If you have specialized knowledge, be it astronomy or urban planning, pairing that with Python-based data science can set you apart.
- AngelList (for startups): angel.co
- LinkedIn (for job discovery): linkedin.com
Core Python Data Science Stack You don’t need to be a software engineer, but you should know the core libraries: pandas for data manipulation, scikit-learn for machine learning, NumPy for numerical computing, and tools like Matplotlib or Seaborn for visualization. A broad grasp of these fundamentals will cover many industry needs.
- pandas: pandas.pydata.org
- scikit-learn: scikit-learn.org
- NumPy: numpy.org
- Seaborn: seaborn.pydata.org
Focus on Practical Statistics Rather than diving too deep into purely theoretical math, practical approaches, like calculating means, variances, simple hypothesis tests, correlation, and bootstrapping, are central to day-to-day data work. Many data scientists spend most of their time cleaning data and applying straightforward statistical methods before attempting complex models.
- statsmodels: www.statsmodels.org
- PyMC3: docs.pymc.io
Keep Machine Learning in Perspective Machine learning and deep learning are important but not the only aspects of data science. Many impactful analyses can happen with data cleaning, descriptive analysis, and classical statistical approaches. Know your fundamentals and apply more advanced techniques where they provide real value.
- scikit-learn: scikit-learn.org
- TensorFlow (mentioned generally in ML contexts, not specifically in the episode's transcript)
Iterative Learning through Projects The best way to learn data science technologies, whether it's Bash for server tasks or advanced libraries, is to tie them to a real project. By solving actual problems, you build skill and context faster than by studying tools in isolation. Start small and expand your knowledge as needs arise.
Writing and Communication Skills Being able to explain and visualize your analysis is crucial. Tools like Jupyter Notebooks allow you to blend code, outputs, and commentary to tell a compelling data story. Sharing your notebooks (and any helpful write-ups) can also attract recruiters and collaborators.
- Jupyter Notebook: jupyter.org
- JupyterLab: github.com/jupyterlab/jupyterlab
Use a Consistent Workflow and Version Control Even basic Git usage can make you more productive and help potential employers trust your code quality. Following style guides like PEP 8, refactoring your scripts into functions, and systematically organizing your work will save you headaches, and impress recruiters.
Don’t Fear Impostor Syndrome Impostor syndrome affects everyone from newcomers to core project maintainers. Remember that data science involves continuous learning and pivoting as technologies evolve. Acknowledging what you don’t know and focusing on growth is a strength, not a weakness.

Interesting quotes and stories

"If everyone I hired knew the ins and outs of support vector machines, that would be a horrible team." -- Hugo referencing the importance of diversity in data science skill sets.

"A picture is worth a thousand lines of code." -- Hugo emphasizing the power of data visualization and telling stories with plots.

Key definitions and terms

Bootstrapping: A method in statistics that resamples a given dataset repeatedly (with replacement) to estimate the variability or distribution of a particular statistic (e.g., the mean).
PEP 8: Python Enhancement Proposal 8, which provides coding style conventions for the Python codebase to make it more readable and consistent.
Version Control (Git): A system that records changes to files over time, allowing you to revert, branch, merge, and collaborate effectively.

Learning resources

Here are some recommended courses from Talk Python Training if you want to deepen your skills around data science and Python:

Data Science Jumpstart with 10 Projects: Perfect for building real, hands-on experience across essential data analysis, cleaning, and visualization tasks.
Python Data Visualization: Learn to create compelling, clear visuals that help you and others understand complex data.
Getting Started with NLP and spaCy: If you want to explore natural language processing and expand your data science skill set.
Move from Excel to Python with Pandas: Especially relevant for those who have used Excel for data tasks and are ready to transition into scalable Python workflows.

Overall takeaway

Data science offers countless opportunities, from basic data cleaning to advanced machine learning. Aspiring data scientists who build a public profile, participate in real-world projects, and develop solid fundamentals in statistics and Python libraries will stand out in a competitive market. Most importantly, stay curious, be authentic, engage in the community, and let your passion for solving data-driven problems shine through.

Links from the show

Hugo on twitter: @hugobowne
DataCamp:: datacamp.com
DataFramed (Hugo's podcast): datacamp.com

Conferences to check out
All pydata conferences!: pydata.org (forgot to mention these)
Odsc: odsc.com
Aggregated lists (see what interests you): E.g. kdnuggets.com/meetings

Select Data Science blogs/online resources
DataCamp community: datacamp.com/community/tutorials
Kdnuggets: kdnuggets.com
O'Reilly data science blog: oreilly.com/topics/data
Fast Forward Labs blog: blog.fastforwardlabs.com
Episode #139 deep-dive: talkpython.fm/139
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #139 deep-dive: talkpython.fm/139

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Data science is one of the fastest growing segments of software development,

00:03 but it takes a slightly different set of skills than your average full-stack development job.

00:07 This means there's a big opportunity to get into data science, but how do you do it? How do you get

00:13 into the industry? Well, that's what Hugo Brown Anderson is here to tell us all about.

00:18 This is Talk Python To Me, episode 139, recorded November 7th, 2017.

00:23 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem,

00:43 and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm at

00:48 mkennedy. Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on

00:53 Twitter via at Talk Python. Talk Python To Me is partially supported by our training courses.

00:58 Have you tried to learn Python but got stuck or lost focus? We know how it feels to try and jam

01:04 fact after fact, loop construct after turn hair expression, into your head. At best, it's boring.

01:10 At worst, it can turn you off programming altogether. That's why we built our course,

01:15 Python Jumpstart by Building 10 Apps. This course guides you through carefully planned applications.

01:20 It starts simple but progresses to quite real apps. Best of all, you won't be learning dry

01:26 facts. You'll be learning like the pros do by building real applications and learning in

01:31 context. If you want to start building with Python, just visit talkpython.fm/course to get

01:37 started. Hugo, welcome to Talk Python.

01:40 Thanks, Michael. Great to be on the show.

01:42 It's fabulous to have you here. I think it's time that we do a dive into how people become

01:49 data scientists and how they get into data science. And really, I've done a couple of shows on becoming

01:55 programmers, but that's not exactly the same thing as becoming a data scientist in the sense. So,

01:59 I'm super excited to talk to you about all these different paths into data science and how people

02:04 can kind of level up in that space.

02:06 I'm really excited also because I've been thinking about this a lot lately, of course.

02:11 Yeah, of course. It's a really important topic. I mean, data science, I really attribute Python's

02:16 meteoric growth over the last three years to data science. I know it's growing in other areas and

02:21 it's playing important parts all over. But data science, the rise of Python in data science and the

02:27 rise of Python becoming more popular, those two graphs seem like the same.

02:31 Absolutely. And you can see the Python community embracing this as well. I mean, I was at PyCon in

02:36 Portland where you are this year and we had two keynotes, although many keynotes, but Jake

02:41 Vanderplass and Catherine Huff seeing, you know, such data science luminaries and thought leaders being

02:47 invited to something like PyCon to give keynotes more and more is really exciting.

02:53 Yeah. And Jake Vanderplass's keynote especially struck a nerve with me because it really opened

02:59 my eyes to, you know, basically his message was this Python ecosystem is a mosaic and there's many

03:06 different ways in which people are using Python and basically many different things that Python means

03:12 to different people. And the way that maybe a web developer working on a large scale web app works

03:18 is really different than the way a data scientist exploring astronomical data would work.

03:23 But these are both super valid and reinforcing ways to work. And I really liked his message.

03:28 For sure.

03:28 Yeah. Awesome. Okay. So before we get into pass into data science, let's get your story. How did you get

03:35 into programming in Python?

03:36 Okay. So at grad school, I'm from Australia. I went to the University of New South Wales for grad school.

03:42 I did pure math there and I did a bit of MATLAB. I'd done some Maple as an undergrad, but all of this was

03:48 relatively minimal. When I started my postdoc, I moved away from pure math and went into more applied math that I'd done as part of my undergrad.

03:58 I was working in cell biology, in fact, in Germany, and I was working in biophysics. So thinking about kind of the physical,

04:06 mechanical principles of how cells grow, reproduce, that type of stuff.

04:11 It sounds really interesting.

04:12 It was incredible. And I was actually working in an institute of maybe 400 cell biologists and theorists dedicated. So it wasn't on a university campus. It was an incredible environment.

04:24 But I was hired ostensibly to do mathematical modeling. But the biologists I work with kept on coming and asking me the same questions with respect to data analysis, statistical inference, this type of stuff, which I didn't know a great deal about at the time.

04:39 But with my quantitative training, I really could pick it up on the fly. And so I started working more on the data analysis and statistical inference in conjunction with the modeling.

04:49 Of course, to do that in to do that today, you need to be able to program. Right. Because data sets are so large. I mean, you can't you can't do it with pen and paper like they they used to.

05:00 So I started learning Python and and R to to do this. I learned online by online courses, a lot of web web resources.

05:10 And in fact, you know, the open source community in Python and R really embracing. So any any questions I had, I could pick up on the fly.

05:19 Yeah, that's really cool. And I definitely see this working with scientists or in these types of areas. Very important.

05:27 How how much programming did the biologists do? Like, did they program in MATLAB? Were they just Excel people?

05:35 Like, how much were they kind of taking care of themselves? And how much were you solving their like science problems?

05:41 So the answer then, which this was now, well, seven years ago, I didn't quite realize that.

05:46 The answer then was MATLAB. Grad students would come to me and say, I need to learn how to do this image analysis in MATLAB.

05:58 Or how do we estimate these statistical parameters? How do I get the mean out of this data set using using MATLAB?

06:05 Something I saw.

06:07 Maybe three or four years ago was a conversion in which more and more students came and started asking to learn Python and and R.

06:17 biology is R a lot of the time, but more and more Python in physics and at R in biology.

06:24 And I think people just started seeing seeing the value.

06:27 Also, I think there's there's a challenge that MATLAB is incredible for a number of things.

06:34 But part of their their business means they're embedded in institutions.

06:39 And it's it's really tough for institutions to break away.

06:42 It's generational in a lot of ways, actually, because, you know, the guys at the top who they MATLAB work worked for them in a number of respects.

06:50 But seeing kind of the resurgence of this, this these open source libraries for academic research is really exciting.

06:56 It's super exciting. And to be fair, you know, the world looked really different 25 years ago in scientific computing than it does now.

07:05 Like open source was not so much of a thing.

07:07 It was, you know, your alternative was probably like C++ or something. Right.

07:11 Absolutely. I mean, maybe Fortran. Right.

07:13 But it was not they were not great tools.

07:16 So it was a super clear choice to choose MATLAB.

07:19 But you're right now. It's the senior professors that have been doing MATLAB for 30 years and all their work is in MATLAB.

07:24 You know, there's probably a tendency to just kind of stick with that.

07:27 Your your students come on to help you like, hey, you got to learn MATLAB because that's where I work.

07:31 Right. Something like this. That's correct.

07:35 But I do think there's some really interesting.

07:37 You know, growth in around sort of trying to displace MATLAB.

07:43 I mean, there's SageMath also from Seattle, you know, similar similar place to where Jake van der Plassen, the e-science institute is up there.

07:52 But yeah, I think it's it's really powerful to see people coming in learning Python.

07:57 I think one of the major advantages that people get is you can take that into general industry when you're done.

08:04 Like if you go and study applied math, but then you actually don't become a professor.

08:07 What are you going to do? Right.

08:09 Well, the knowing Python has a lot more doors that get open for you than knowing MATLAB.

08:16 Yeah. And you can collaborate with anyone around the world as well.

08:19 Somebody to to read and execute your code doesn't require them to have a proprietary license.

08:24 Yeah, that's a really good point.

08:26 I think, you know, the cost of these proprietary systems like MATLAB or Maple, like you mentioned, are really problematic.

08:32 Right. Like I did some work with Wavelet decomposition.

08:37 And that was like a two thousand dollar add on to MATLAB.

08:40 I mean, that's a crazy amount of money for one license.

08:43 Absolutely. Right.

08:44 That's probably like pip install something now.

08:46 That's right.

08:47 And so I mentioned Katie Huff's keynote at PyCon.

08:51 She in this keynote did a wonderful thing where she laid out a series of points of what scientific research and scientific methodology has been historically and needs to be and demonstrated that open source communities actually are far better at all these scientific principles than most other communities that have existed since since the ancient Greeks.

09:11 So things such as version control, reproducibility, absolutely open code bases, this type of stuff is exactly what science needs now, particularly as, you know, we're all this buzz term reproducibility crisis.

09:24 It's incredibly important that all of our tools and techniques are open.

09:27 Yeah, that's a really interesting point that open source kind of is very much in the same has the same zen as the principles of scientific research and exploration.

09:36 Right. Yeah.

09:37 Yeah, that's awesome.

09:38 So that was how you got into Python and programming.

09:41 How about now?

09:42 What do you do today?

09:42 So I work for a startup called DataCamp and we do online education for data science.

09:49 So we have an in-browser platform and now we have a mobile app, actually, where people can come and learn and practice and apply data science.

09:57 Until recently, so I've recently changed positions in the company.

10:02 Until recently, I was working on curriculum building.

10:04 So when I joined the company, we had two Python courses.

10:07 And over the past year and a half, I built it out with colleagues and external instructors to around 30 courses.

10:13 So that job was really ideation, what courses will look like, high level view of a curriculum, figuring out what data sets to use in courses, techniques to teach, whether it's scikit-learn, pandas, how to approach these APIs.

10:28 We've taught them with wonderful people such as Andreas Muller of scikit-learn, great courses with the people at Anaconda, spend my days writing code and explanations for courses, marketing material, being on calls and on GitHub with instructors, which was so much fun to work.

10:44 I think this is, you know, I actually did basically the same job, but for a different focus curriculum.

10:51 But for a long time, I was sort of head of curriculum at a company called Developmentor, which just got acquired and don't do that anymore.

10:59 But it was really fun to sort of look broadly at a technology, think about how do people get started?

11:07 How do they become experts?

11:08 What are the important parts?

11:09 And really try to piece that together as like a jigsaw puzzle.

11:12 It's a fun job.

11:13 It's super fun.

11:13 And particularly, just trying to match up all those different parts of curriculum building.

11:18 So just making sure that, you know, five days in a row, I'm not in the weeds of figuring out which data sets to use.

11:26 So mixing up high level curriculum building with being in the weeds.

11:29 Yeah, that's awesome.

11:31 It's definitely a social side of programming.

11:33 For sure.

11:34 And now I've transitioned to a job working as a data scientist, a data science advocate and evangelist for DataCamp.

11:42 So I'm doing data science on a daily basis, writing articles about data science for our community, pedagogical articles, technical blog posts, topical.

11:52 So currently I'm writing and developing an analysis of the Me Too movement on Twitter.

11:59 So seeing how that is developed using the Twitter API and Python and a package called Tweepy.

12:04 Yeah, so doing data science.

12:06 For example, the Twitter analysis I just spoke about.

12:10 I'm also doing data science on student data to see how we can cluster users and cluster our students and see best learning techniques for students.

12:20 So I think that's a lot of things.

12:25 I think that's a lot of things that I'm doing.

12:26 I think that's a lot of things that I'm doing.

12:27 I think that's a lot of things that I'm doing.

12:28 I think that's a lot of things that I'm doing.

12:37 I think that's a lot of things that I'm doing.

12:39 I think that's a lot of things that I'm doing.

12:41 I think that's a lot of things that I'm doing.

12:43 I think that's a lot of things that I'm doing.

12:44 I think that's a lot of things that I'm doing.

12:45 a student imports pandas as PD.

12:48 What are the top three mistakes they make straight away?

12:52 And these types of things will be of interest to us, to our students, and also to the open

12:56 source community at large.

12:58 Yeah, I think that's really interesting.

13:00 These ideas of helping people with that first step.

13:04 Because a lot of times getting into a new technology or a new library, it's those first steps that

13:09 are the hardest to take.

13:11 Absolutely.

13:12 Nice.

13:12 And are you also doing a podcast?

13:14 Exactly.

13:15 I'm currently developing a podcast called Data Framed, which is about data science,

13:20 about what data scientists do on a daily basis, and about the societal impact of data science,

13:25 which is really exciting.

13:27 Because I think it can be of great value to our students, and it can be of great value to

13:32 a lot of working data scientists.

13:33 So for example, data scientists working at Uber really may have no idea what data scientists

13:38 working at Netflix or data scientists working in astronomy do on a daily basis.

13:43 Because it's a term that encompasses so much of what working professionals do.

13:48 I think it's really exciting for me and will be for the community as well.

13:52 Yeah, I think that's awesome to shine a light on these different areas.

13:55 I mean, the stuff that you're doing at Uber, like you said, is really, really different than maybe

14:00 trying, if you're working at, say, a police department, trying to understand how police violence or violence

14:10 against police happens.

14:11 And these are really, really different.

14:12 But maybe there's lessons to be learned from one to be applied to the other.

14:16 Absolutely.

14:16 And all the way from that to city transit data, how I live in New York, and New York

14:22 transit has a huge API, the MTA, where you can go and access data with respect to how the subway

14:31 works and how decisions are implemented around that.

14:33 Yeah, that's really interesting.

14:35 I bet there's some awesome data science stories to be told out of the public transit of these major cities.

14:41 Absolutely.

14:41 There's a blog called iQuantNY, which is all about getting access to public New York data and seeing what you can find in it.

14:51 So that's a great blog to check out.

14:53 Yeah, there's probably a lot of data science going on there in New York around the stock market as well.

14:58 Yeah, absolutely.

15:00 Oh, man.

15:00 Of course, they don't share that so much.

15:03 Yeah.

15:03 Once you find something that works, they keep that quiet, right?

15:06 Yeah, that's it.

15:07 The other thing I've been doing recently is these Facebook Live Code Along sessions.

15:13 So I'm a huge fan of live coding.

15:17 I know it's probably slightly masochistic, but I don't have a huge problem.

15:21 One of my favorite things about live coding, one of the most valuable moments for me and people

15:25 coding along is when I make a mistake that I can't figure out and I need to go and use a search engine

15:32 and go and stack overflow.

15:33 And people see me kind of figure it out in real time.

15:36 And actually, I think Jake VanderPlus one up that one.

15:40 So where he was doing a live coding session and found a bug in scikit-learn and went in

15:47 and issued a PR in the coding session that fixed the bug.

15:52 And that's online somewhere.

15:54 That is awesome.

15:55 Yeah.

15:56 But the Facebook stuff's great because Facebook's really pushing their live sessions at the moment.

16:01 So everyone who follows us, we've got now, I think, 330,000-odd followers on Facebook.

16:09 When I start a live code along session, they all get notified and a whole bunch of them jump on and

16:14 interact and it's super fun.

16:16 And they can comment just below the video.

16:18 I've got a colleague there who filters the questions and I answer some of them.

16:22 And that's also a lot of fun.

16:23 That's a really interesting trend.

16:25 I haven't seen a whole lot of that previously, but I did just talk about on Python bytes last week,

16:32 this AI framework that you can basically plug your AI into almost any game and then teach it to play that game.

16:43 And the guy who runs that, he actually has a Twitch channel.

16:47 And some of his Twitch code along building up these AIs and teaching them to play various games sessions are like six hours long.

16:55 That's really cool.

16:55 And it's, it was really, I'd never really watched one of those.

16:58 It's really quite interesting, actually.

16:59 Yeah.

16:59 It's a whole different world, isn't it?

17:02 Yeah, it sure is.

17:03 I mean, a lot of the stuff that's online is really polished or somewhat polished,

17:07 but it's at least intended to be polished and like packaged up for like, here's a 20 minute little thing.

17:11 Whereas, you know, those are more like, let's just explore this until we have the answer.

17:15 That's cool.

17:16 Yeah, exactly.

17:17 Nice.

17:18 All right.

17:18 So careers into data science, career paths into data science.

17:22 Let's, let's talk about those.

17:23 I think we'll, in this conversation, move to maybe more technical, more specifically data sciencey material.

17:31 But the first thing I wanted to state very passionately is that, as with anything, but perhaps more so in data science, be active, be curious, and be part of a community.

17:42 There are lots of budding data scientists, aspiring data scientists, working data scientists, hiring managers out there.

17:50 And getting in touch with them and putting yourself out there is incredibly important.

17:56 So to that end, I'd really suggest starting on some basic data science projects, if there's your first foray into it.

18:03 And we can talk about what that could look like in a second and create a public profile.

18:07 Get yourself a GitHub account to do that.

18:10 Maybe have a little blog where every now and then you post some analysis you've done.

18:14 Even if it's a basic exploratory data analysis, that's great.

18:20 And put some words in there, put some images and figure out how to communicate around this.

18:26 Go to conferences.

18:27 Go to meetups and talk to people.

18:30 Hackathons are also fantastic.

18:31 Yeah.

18:32 Hackathons, that's definitely a nice way to meet people who are, you know, more than just sitting next to you at a presentation.

18:40 But actually, you know, you're kind of working together a little bit.

18:42 That's cool.

18:43 Yeah.

18:44 I think I definitely encourage people to create some kind of blog or write stuff.

18:49 I think that that's really valuable.

18:50 And it doesn't have to be, you don't have to wait until you're an expert for sure at like something.

18:55 It could be you're solving some problem and you couldn't find a way online how to solve that particular problem.

19:01 So, you know, blog about that, right?

19:03 Talk about what you tried, what didn't work.

19:05 There's a lot of people who would be interested in following along this I'm getting started sort of story.

19:11 Absolutely.

19:12 And also, do a bit of self-promotion or marketing.

19:16 I'm not suggesting like, you know, get your paid ads on Facebook.

19:19 But, you know, if somebody asks a question on Reddit or Quora or Stack Overflow and you think your response may be helpful, get out there and put it there.

19:30 There are also a number of blogs that have really wide distribution where you can write analyses for them as well.

19:37 I mean, DataCamp, we've got a community section where we solicit external contributions.

19:43 The Open Data Science Conference, ODSC, has a blog where they do the same.

19:47 So, once you feel a bit more comfortable with your material, definitely put it out there.

19:54 And I know that that can be difficult as well.

19:57 So, there's certainly a bit of a loss of ego that needs to occur in this scenario.

20:03 But just remember, there's a lot of interesting stuff going on out there and you can be part of it.

20:08 Absolutely. And, well, I think it's really, really important on how you frame what you've created and presented.

20:13 If you say, I'm the expert in this thing and then you're not, well, then, you know, people may find that out and that's going to go badly.

20:18 But if you're really upfront, like, look, I'm really just getting started, everyone.

20:21 But here is something that I couldn't find any help with.

20:23 And here's what I figured out and I thought it was awesome.

20:25 Like, nobody's going to knock you for that, right?

20:28 Absolutely.

20:28 Well, except for maybe on Reddit, they might send something angry.

20:32 But you've got to have to take it.

20:35 There'll always be at least one troll, right?

20:36 That's right. That's right.

20:38 But, you know, it's totally, totally worth it.

20:40 This portion of Talk Python To Me has been brought to you by Rollbar.

20:44 One of the frustrating things about being a developer is dealing with errors.

20:48 Relying on users to report errors, digging through log files, trying to debug issues,

20:53 or getting millions of alerts just flooding your inbox and ruining your day.

20:57 With Rollbar's full-stack error monitoring, you get the context, insight,

21:01 and control you need to find and fix bugs faster.

21:03 Adding Rollbar to your Python app is as easy as pip install Rollbar.

21:07 You can start tracking production errors and deployments in eight minutes or less.

21:12 Are you considering self-hosting tools for security or compliance reasons?

21:16 Then you should really check out Rollbar's Compliance SaaS option.

21:19 Get advanced security features and meet compliance without the hassle of self-hosting,

21:24 including HIPAA, ISO 27001, Privacy Shield, and more.

21:29 They'd love to give you a demo.

21:30 Give Rollbar a try today.

21:32 Go to talkpython.fm/Rollbar and check them out.

21:36 The other thing you brought up is a GitHub repo or a GitHub profile.

21:41 And I think that that's super important.

21:42 One of the things about GitHub is you can't fake the commit history over time very easily,

21:49 right?

21:49 Like if you say, I've been doing this for three years, but your GitHub repo only has like

21:53 a week of activity, that's not a great sign.

21:56 So if you're planning this, like, you know, do that stuff early so that it can, you know,

22:02 sort of create this history that is proof of what you've been doing.

22:06 Absolutely.

22:07 And something I did when I was starting up my profile on GitHub, I had a sticker, a sticky,

22:13 like a literal sticky on my computer screen, not the app stickies.

22:17 But I had a sticky that said commit to GitHub today.

22:20 Just like I didn't actually do it every day.

22:22 But you can actually see before I joined DataCamp that there was a lot of public activity I was

22:29 doing.

22:29 One, because I really enjoyed it.

22:31 But two, because I made an active decision to put myself out there.

22:35 Yeah, it makes a huge, huge difference.

22:37 So conferences, what are some of the data science conferences that people should go to?

22:42 I really like the Open Data Science Conference, ODSC, which I mentioned earlier.

22:48 And in fact, that's where I met the DataCamp people.

22:51 And I had two, well, three, let's say two and a half job offers from going to the going

22:57 to ODSC.

22:58 I also, I also think not only conferences, but but meetups are incredibly useful.

23:06 I know it depends which city you're in.

23:08 But in New York, there are a lot of interesting meetups.

23:11 A lot of a lot of people go there after work because they love data science.

23:15 And even more so, you have hiring managers and recruiters get up.

23:19 Literally, the organizers at these meetups say at the start or the end, hey, anyone who has

23:23 a job, stand up and tell us what it is.

23:25 Yeah, I see that definitely in the Python programming meetups as well.

23:28 And I agree.

23:29 That's a great way to get connected with your local people, not just people in the industry,

23:36 but people down the street, right?

23:38 Absolutely.

23:39 And the great thing about data science recruiters and data science people in HR and managers

23:45 is that there are a significant number of jobs out there.

23:49 So they're really interested in the conversation.

23:52 As someone approaching data science at the moment, you are in a relative position of power.

23:58 I mean, it is competitive.

23:59 But compared to the recruiters, they'll be definitely up for a conversation in a way that they wouldn't

24:04 in other industries currently.

24:06 And I remember I had a conversation with this great guy from Goldman Sachs where I just asked

24:11 him up front, you know, what are mistakes that you've had people make in interviews that

24:16 I should not make?

24:18 And he gave me lots of great feedback.

24:19 One example, he said, if you don't know something, just admit you don't know it and say that's

24:24 a gap and I'm looking forward to filling that.

24:26 He had guys where it was one guy he asked what the bias variance trade-off meant.

24:30 And it was on a call and he heard the guy start typing and then answered the question.

24:35 Okay.

24:35 Yeah.

24:36 Yeah.

24:37 Pro tip, use a touch pad, some kind of touch device if you're going to Google during your

24:43 interview.

24:44 Exactly.

24:45 The other thing, when you go to conferences and hackathons and this type of stuff, conferences

24:50 are also great because they have sprints.

24:53 A lot of the big, you know, packages, whether it be scikit-learn, pandas, gen-sim, project

24:59 Jupyter, which we'll talk about later on, I think.

25:02 They have sprints when the conference ends where you can go and help contribute to the

25:06 project.

25:07 The communities are super open.

25:09 You can start.

25:10 They actually encourage you to start by just helping out with documentation, which is a

25:15 huge bottleneck at a certain point in open source software development.

25:19 So you can actually be an active member of these development communities immediately without

25:24 being like, oh, I don't know how to, you know, define this class correctly.

25:30 Yeah.

25:30 Well, I think another huge benefit of that is if you do want to have your public profile

25:36 have, you know, PRs against say pandas or scikit-learn or something like that, those are

25:43 mature, polished libraries that are hard to just get into yourself.

25:47 But if you go to a sprint and sit down with somebody who's an expert and you guys do it

25:50 together, well, there's a pretty quick way to get up to speed to where that you can start

25:55 doing those things if you want.

25:56 That's one of your paths you want to follow.

25:58 And you're right.

25:59 You're there at these sprints and you're able, you know, best case scenario to be pair

26:02 programming with core developers on pandas or scikit-learn or numpy, right?

26:09 That's crazy.

26:10 Yeah.

26:11 And so when you go to that job interview and they say, well, how does it really work inside

26:14 pandas when you do this?

26:15 Like, which would be better?

26:17 Should I do this or that?

26:17 You're like, well, internally it does this.

26:19 And so here's why you do that.

26:20 Like, that's an incredible answer.

26:21 And you could totally get those kinds of insights from these sprints.

26:25 I agree.

26:25 Absolutely.

26:25 And what that also demonstrates is that you're entrepreneurial, which I think, you know,

26:30 a lot of people are looking for these days, someone who will, you know, take responsibility

26:34 and run with it.

26:36 Yeah.

26:36 That puts you in a pretty thin group already, which is great.

26:40 Yeah.

26:42 You also said that reading blogs and things like that pretty helpful.

26:47 Absolutely.

26:48 Read as widely as possible.

26:50 I think reading blogs, getting on newsletters, following people on Twitter is one of my greatest,

26:57 greatest resources.

26:58 So we've chatted about Jake Vanderplass.

27:01 On the R side, you have Mara Averick, Hadley Wickham, Hilary Mason's great, Dave Robinson on the R side.

27:07 I follow all these people, so you may as well follow me, Hugo Bowne, because I retweet a lot of this stuff.

27:12 Catches the important retweets, right?

27:14 As well.

27:15 We're really arcing up on the Datacamp community at the moment.

27:21 And as I said, ODSC has a fantastic blog.

27:24 Python Weekly.

27:26 There are so many different places.

27:28 And I'll include a significant number of links in the notes of this podcast on this stuff as well.

27:35 Yeah.

27:36 I find that Twitter is super, super valuable.

27:38 I also find Reddit, actually, if you don't mind, a few angry comments every now and then.

27:43 Certainly, the Reddit community is great and really smart.

27:47 So you can drop in on the data science one or the Python one and pick up a lot there.

27:53 Yeah.

27:54 Cool.

27:55 And so this kind of sets the stage for you to be prepared to get a job, to make the connections to get a job.

28:02 But eventually, probably most people's goals are to go get some kind of working data science job, right?

28:10 Yes.

28:10 So you already brought up recruiters.

28:13 And I think that's certainly one of the possibilities.

28:16 Probably one of the least effective ways to get a job is to just go to the career page and just apply by filling out the online form.

28:26 You know, like a recruiter can help you get inside.

28:30 If you have a friend that you know works at that place, ask for an introduction, right?

28:35 I think most jobs that are really great jobs start looking for someone to fill it by saying, all right, team, who knows somebody who would be awesome for this job?

28:47 Anyone?

28:47 And then it becomes this open search, right?

28:50 So how do you get inside this first round before it becomes just posted on the career page?

28:56 I actually think hackathons are a great way to do that because you actually start coding with people there, do a bit of pair programming, and you get to meet people there.

29:07 When there are jobs going around, there are a lot of working data scientists from all levels at these hackathons.

29:14 I also think more specific online platforms, AngelList, if you want to work with startups, there's a lot of stuff happening there.

29:25 And LinkedIn, in North America anyway, making your LinkedIn profile as attractive as possible will definitely help.

29:34 And you'll get inbound mail coming as opposed to needing to go to the apply page.

29:39 Yeah, and you're in a much better place when people are reaching out to you rather than the other way around, for sure.

29:43 I think that's totally right.

29:45 Yeah.

29:46 The other, this is general advice to anyone applying for a job, and maybe everyone knows this, but when I heard it a few years ago, it blew my mind.

29:53 And if you're applying for a job and sending a cover letter, use the same font and the same colors as that company's website.

30:00 How interesting.

30:01 Yeah, that's pretty easy to do, right?

30:03 Yeah, exactly.

30:04 And generally, they love it.

30:06 We got one recently at Datacamp.

30:07 We were like, wow, this looks really nice.

30:09 And then we're like, wait a second.

30:10 Oh, they've done that.

30:11 And when we realized that they'd done that, it was even stronger.

30:14 Yeah.

30:15 Well, I think, you know, I did, I was on the receiving end of people applying for jobs for quite a while.

30:21 Yeah.

30:22 To me, when I saw something come in and it was just a standard resume or like, here, I'm applying for this job.

30:30 Here's my info.

30:31 If it wasn't, I think your company is amazing because, and I want to work with you to do X, like that went straight in the trash.

30:39 Like if there was not something about the job, the place, the, you know, if it was just like, here's a copy of my Word document.

30:48 It was like, well, here's a copy of my recycle bin.

30:51 Next.

30:51 Exactly.

30:52 And it's the same when recruiters reach out to you.

30:54 I mean, you know, I get recruiter mail on LinkedIn, which is like, your skill set matches our company.

30:59 And I'm like, come on.

31:00 Right?

31:00 Yeah, exactly.

31:01 It's not even, hey, you've done this cool stuff in Python and whatever it may be.

31:05 But this actually speaks to something else, which is making it particular to the company and also making it particular to yourself.

31:12 So being yourself when doing data science or trying to build your portfolio is incredibly important.

31:18 I think playing to your own strengths, a lot of aspiring data scientists feel they need to be a data science unicorn so that they can, you know, they can do the data munging, data collection, data manipulation, machine learning, statistical inference, Bayesian methods, data visualization, data, you know, like this is crazy.

31:38 Right?

31:39 And when you're trying to teach people data science and they feel that that's, that's totally overwhelming.

31:44 I'm actually overwhelmed by that, that sentence I just, just stated.

31:47 Sounds like a PhD in math plus programming, right?

31:50 Exactly right.

31:51 And you don't need to be an expert at machine learning algorithms.

31:54 For example, to be an effective data scientist, that will make you some sort of effective data scientist.

31:59 But playing to your own strengths and realizing that data scientists work in teams.

32:03 So I've, I've worked on a course recently with an educator and data scientist, Sergey Fogelson, who he manages a four person data science team at Viacom here at Times Square.

32:16 And on his, I was chatting with him about his team.

32:19 And he said, if everyone he hired, like knew the ins and outs of support vector machines, that would be a horrible team.

32:25 He's got one person who is great at statistical data visualization.

32:31 He has one person who's a data engineer and fantastic at that.

32:34 He has one person who does the machine learning stuff and also has a background in math and physics.

32:40 So he can actually explain the ins and outs of these algorithms to, to the rest of the team.

32:44 I actually forget what the fourth, fourth person does.

32:48 But that, that speaks to the fact that managers are aware that when they hire in teams, they're going to hire people with, with different strengths.

32:55 And for that reason, I'd suggest to anyone entering data science to do things that interest you, have a play around.

33:02 Like when developing your portfolio, you'll see, you've got to do different steps in the data science pipeline, figure out what you enjoy the most, and then apply for those jobs as well.

33:11 Yeah, I totally agree with you.

33:13 And I think one of the underlying things you're touching on here is authenticity.

33:17 Because if you feel like someone is reaching out to you and they're being super authentic, like you said earlier about that, well, you know, I honestly have no idea what that term means, but I'm super excited to learn it if it's important.

33:27 Like I would love, like I'm not against, you know, not against it.

33:30 I just don't know every single little detail about this.

33:33 I think when people are hiring, if you see the enthusiasm, you see some real problem solving skills and some authenticity, it really goes a long ways.

33:43 Yeah.

33:43 And being able to adapt, pivot and learn as well.

33:47 So being able to say, hey, this is what I've learned in the past year.

33:51 I have no idea what that means, Mrs. Hiring Manager, but I'm willing to learn that is incredibly important in this space.

33:59 Because in all honesty, in five years, it might not be Python with the, you know, Julia may come up.

34:04 R may really, really blast in again.

34:07 So the ability to learn and relearn, I think, is incredibly important and demonstrating that.

34:14 Yeah, absolutely.

34:15 Because at a minimum, you have to learn the details and the ins out of like that actual problem set and that industry that maybe you don't have.

34:22 So another thing you touched on was do what interests you.

34:27 Because then you have the enthusiasm and that really is super powerful as well.

34:31 And I'm a big fan of combining what you're interested in or what you have expertise in plus programming, plus data science.

34:40 And I think it really gives you like this superpower.

34:43 Like you talked about this cell biology project that you had.

34:47 Like they were probably like, you know, go to Hugo.

34:50 He can solve the problem because he both owns, he controls the magic of programming and he can do this biology stuff.

34:56 And so there's this, this really unique set of skills.

35:01 Like you don't, you don't go from like a million data scientists and how do you differentiate yourself from them?

35:06 You're like, I'm, I'm the data scientist that also understands wind power like nobody else.

35:12 So if I'm trying to apply to like a renewable energy company, like, well, that's a clear win, right?

35:16 For sure.

35:18 And I definitely think you've got to be doing something you're interested in.

35:22 I think a lot of people may say, I'm going to do a Kaggle competition because that's what people do.

35:27 I think Kaggle competitions are great, but choose one that you're super interested in.

35:31 If, if you're interested in flight patterns in North America, do a Kaggle competition about, you know, how often flights are delayed, which airlines, which cities, that type of stuff.

35:39 If you're a movie buff, jump into the movie lens data set and try to develop a basic recommender systems, recommendation systems engine.

35:48 If you're into Yelp reviews, if, okay, if, if you hate Yelp reviews that don't give you enough information, try to learn a bit of natural language processing or natural language.

36:00 understanding by segmenting or filtering or clustering these, these Yelp reviews.

36:05 So doing things that interest you is incredibly powerful when developing your data science portfolio.

36:10 But also it makes sense, right?

36:12 In the sense that if someone's talking to you about something that they don't really care about, you're not that affected.

36:18 Whereas we've all, we all love listening to people who are passionate about something, right?

36:22 So that's, that's very powerful.

36:24 Another approach, I actually had this conversation with a data scientist and statistician.

36:30 in the R ecosystem, Mina Chetungkaya Rundle.

36:36 I'm sorry if I got that pronunciation wrong, but we were discussing this and she said, yeah, do stuff that interests you or stuff that you have to do.

36:44 And I said, what do you mean?

36:45 And she said, well, let's say you're doing, you're trying to learn data science and you're doing your budgets, your monthly family budgets in Excel.

36:52 Try to do that in R.

36:53 Try to develop a minimal dashboard or in Python and see how that goes.

36:58 If you wear a Fitbit, you know, get your Fitbit data out of CSVs and have a look at your own sleeping patterns and your own heart rate data and accelerometer data and that type of stuff.

37:08 And write something on your blog or on GitHub about that.

37:13 Right.

37:13 I think even, you know, companies get created out of those types of activities, right?

37:18 You're like, you know, I really wish I could do this thing better for myself.

37:21 And I'm like, wait a minute.

37:21 This seems like everybody must have this problem.

37:23 And this is a cool solution.

37:24 What can I do with that?

37:25 Right?

37:26 Exactly.

37:26 This portion of Talk Python To Me was brought to you by GoCD.

37:31 GoCD is an on-premise, open-source, continuous delivery tool to help you get better visibility into and control of your team's deployments.

37:39 With GoCD's comprehensive pipeline modeling, you can model complex workflows for multiple teams with ease.

37:46 And GoCD's value stream map lets you track changes from commit to deploy at a glance.

37:52 Say goodbye to deployment panic and hello to consistent, predictable deliveries.

37:56 We all know that continuous integration is super important to the code quality of your applications.

38:00 Choose the open-source local CI server, GoCD.

38:04 Learn more at talkpython.fm/gocd.

38:08 That's talkpython.fm/gocd.

38:11 I love that you spoke to this idea of creating superpowers by combining two or more areas of expertise because I think that will also help differentiate you.

38:21 You know, a lot of people are out there trying to get data science jobs.

38:24 But if you're data science plus, you differentiate yourself from everyone else who's speaking about data science.

38:31 So if you're interested in data science plus analyzing genomic data or data science plus analyzing, as we discussed, Yelp reviews, that type of stuff will help differentiate you from the masses.

38:42 Yeah, absolutely.

38:43 If I was on the hiring side and I saw this is a person who is a proper data scientist, but they also know my industry, like that goes right at the top.

38:51 That's great.

38:52 Exactly.

38:52 So let's talk about programming skills a little bit.

38:55 Love to.

38:56 Yeah.

38:56 So I'm familiar with the programming skills you need to be a web developer, but how about data scientists?

39:02 Like what do you think people should really focus on there?

39:05 Currently, I would learn at least one technology really well by applying it to projects, the types of projects we've just discussed.

39:13 I think the two most applicable technologies right now are Python and R.

39:18 So if you learn one of them really well by applying it to projects, I'm not necessarily going and saying going and learning all the ins and outs of object oriented programming in Python.

39:30 But the type of stuff you pick up when doing a project of, you know, analyzing social media trends using Twitter, you'll gain so much knowledge doing that.

39:40 I'd also suggest learning a bit about others to be able to speak the language.

39:44 So if you choose Python, I'd then learn a bit of R and not necessarily as much as you know in Python, but being able to speak that language will really help you in whatever roles you enter in the future.

39:57 Yeah, certainly having these multiple languages as your skill set to understand like, well, maybe over in R, there's this really cool way to do this one thing that's not so easy in Python.

40:07 And that can help you think of different ways to solve the problem, or maybe it's just not so obvious in Python how to do it, right?

40:13 So that can definitely open your mind to like different avenues of solving these problems.

40:17 And you maybe can grab a library that's important over there, port it over to Python and use it if you'd rather, right?

40:22 For sure.

40:23 And I think one great example of this is, so I use Python substantially more than I use R these days.

40:30 One case in which maybe I'll jump into R is doing some, you know, very basic exploratory data analysis and filtering and that type of stuff.

40:38 Because all these new tidyverse tools developed by Hadley Wickham, among other people, are incredibly useful for kind of rapid iteration of exploratory data analysis in a way that the more Pythonic tools perhaps are not.

40:55 Sure. Yeah, that's a good example.

40:57 So what do you think about, I'm not sure what the proper way to say, like sort of software engineering type of skills, like refactoring, design patterns, those kinds of ideas.

41:09 Like how important is that kind of stuff versus a good exploratory, just we're just going to find the answer.

41:15 We're just going to like rummage through this data type of programming.

41:18 That's an incredibly important question that I don't have a concrete answer to yet.

41:23 But I think what people need to do is, I mean, you don't want to go down the hole of becoming a developer.

41:31 You're trying to do a data sign.

41:33 I didn't actually mean it's a hole that you enter when you're becoming a developer.

41:36 But you don't want to go down the hole of, you know, developing software engineering best practices and only focusing on that.

41:42 But you do need basic programming best practices.

41:45 So the first things are, you know, having a style guide, Python, pepe, all the way, commenting your code, using version control, have a workflow.

41:55 And maybe you don't have this at the very start, but do exploratory data analysis and write exploratory code while it's working for you.

42:03 But when you start tripping over it, when it starts to become more inefficient, then perhaps start to refactor your code.

42:08 Have some, you know, put your functions in modules, in .py files, for example.

42:15 Have an editor that you use or notebooks.

42:20 Yeah.

42:20 One of the areas that I see that this kind of stuff becomes really important is people can do super important work, especially if they're coming more from the science side towards the programming rather from the software side towards the data.

42:32 Is they're really good at writing scripts that will answer their problem, but they're not super reusable.

42:40 Right.

42:40 They're kind of just like it goes through the steps that I need to solve my problem rather than here's the thing I could make an open source project.

42:45 And imagine if pandas was just crammed inside of some other application in a way that wasn't able to become this amazing thing.

42:53 Right.

42:53 Exactly.

42:54 And that's a huge bottleneck for working scientists.

42:58 I mean, the type of code I saw, I don't want to be too hard on the biologists, but the type of code I saw was really like we had to go through it in serious detail to figure out what was happening in there, even when it was published.

43:14 And of course, remember that you're writing code for other people to read.

43:18 But more importantly, you're writing code for future you to read.

43:21 Yes.

43:21 So be good on future you.

43:24 Yeah.

43:25 I often have this thought of like, if I do this, my future self will thank me in programming, but also just in like making coffee before I go to bed.

43:32 Right.

43:33 Get ready to press the button.

43:34 That's it.

43:35 And I also think there are a few other technologies which we've spoken to in some sense.

43:40 Git is incredibly useful.

43:41 There can be a slightly steep learning curve before you see the value there.

43:47 But I do think version control is incredibly necessary for data science moving forward.

43:55 Learning Bash, a bit of shell, is really useful.

43:58 If you're in a job and you need to spin up an AWS instance, you'll need to know a bit of that stuff.

44:05 I don't necessarily, you know, say, spend weeks or months using it.

44:09 And I know all of this can be quite overwhelming, all these different tools.

44:13 But if you know a bit of each, you'll be in good stead for getting into data science.

44:19 Yeah, what's worked for me a lot in these things is like, it's not like, well, I want to know Bash and Linux.

44:23 So I'm just going to like study them to death.

44:25 It's like, I have this problem I need to solve with Linux.

44:28 Let me learn enough to make that problem, to solve that problem.

44:31 And then you just keep doing this.

44:33 Like you build up enough to like kind of hit most of the important areas anyway.

44:37 Exactly.

44:38 And once again, you're speaking to doing projects, right?

44:41 Like having some particular project which you can do and learn tools around it.

44:46 And as we've discussed, putting that on your blog, having, you know, a blog post, how I use Linux to solve this part of this problem.

44:54 And if someone asks you about it, you can say, yeah, I know this and that about it.

44:58 And you can check out more content on my blog.

45:01 I think that's incredibly useful.

45:02 Or on my GitHub, right?

45:03 Yeah, yeah, absolutely.

45:04 It's super important.

45:05 So we talked about the programming stuff, kind of low level.

45:10 What are the core skills?

45:11 I mean, do I need to go and do I need a math degree to be a data scientist?

45:15 Do I need to be a scientist, a programmer?

45:17 Like what are the core skills?

45:18 So you definitely don't need a math degree to be an effective data scientist.

45:23 I do think, though, if you learn a bit along the way, let's say you're totally not into matrices and linear algebra and all of that jazz.

45:32 That's cool.

45:34 But if you do learn a bit along the way and try to not be scared of it, you know, you'll become probably a bit more effective.

45:41 So I'd suggest you to try and ease yourself into that stuff.

45:44 But the more important initial skills, being able to explore data, being able to read in a data set using Pandas, for example, or Data Table in R, and check it out.

45:57 Look at some figures.

45:58 Look at some summary statistics.

46:01 That type of stuff.

46:02 Very related to this is data cleaning and data manipulation.

46:05 Anyone who's, you know, there's the saying that 80% of my job is cleaning data and manipulating it.

46:12 And it's a joke because it's more like 95% of most people's jobs.

46:15 And I think this is incredibly important.

46:19 Statistics, I think, is really essential in data science.

46:25 But I need to be careful there because when I say statistics, I don't mean the central limit theorem.

46:31 I'm talking about applied statistics or practical statistics.

46:34 And actually, when I was wrapping up my postdoc, I was asked the same question so many times by students that I started running workshops in R and Python called An Introduction to Practical Statistics, where we'd take their data sets and see how we can find out stuff in them from Python and R.

46:50 So what I'm talking about there is, you know, how to compute the mean standard deviation, how to do basic statistical modeling, fitting polynomials, that type of stuff.

47:02 Right. And answering, is this a trend, you know, are these correlated or not?

47:07 Things like that, right?

47:08 Exactly. And thinking about how then that translates in to my initial question as well.

47:14 It's not only, you know, does this look linear?

47:17 It's what are the implications of this?

47:19 What can I tell to someone who doesn't know something about the Pearson correlation coefficient?

47:23 How can I explain this in human terms to a manager, for example?

47:28 Bootstrapping is an incredibly useful technique in statistics that I think everyone should know.

47:33 I might try to explain very briefly what bootstrapping is.

47:37 Yeah, yeah. Go for it, because I'm not entirely sure what it is myself.

47:40 So, and it means something different in the world you're from as well, I think.

47:43 Yes. There's two meanings of bootstrap that I know of already.

47:46 Neither of them are what I'm thinking of.

47:48 So I don't think what you're thinking of.

47:50 Think about this. You've got some data set, people's heights in a certain population.

47:54 And you have the average. So this is the average height of this data set.

47:59 But you know that, let's say you only have 10 data points or 20 data points.

48:02 You know that this won't actually be the average height of the entire population, right?

48:08 So what you do is, so the average height you've got has some sort of error bars associated with it.

48:14 And what you want to do is estimate those error bars.

48:17 And so what you do is, you resample from the sample you have.

48:22 So if you have 20 data points, you can resample 20 with replacement to get a slightly different average.

48:27 You can do that 100, 1000 times.

48:30 And then you get some sort of distribution of potential means or potential averages.

48:35 So that will tell you, that's the bootstrap of the average.

48:37 That will tell you kind of the spread of possible averages in the total population.

48:41 But the great thing is that this isn't just, this doesn't just apply to averages or means.

48:48 You know, you can do this with any statistic under certain scenarios.

48:50 And it gives you a pretty good idea of what you're looking at statistically.

48:54 That's really cool. It's like meta statistics, like statistics about statistics.

48:57 Exactly.

48:59 And the great thing is, once you have that distribution of means, you can visualize it, right?

49:03 So you get a distribution, you can have a look at it.

49:06 And that speaks to the next core skill that I think everyone, if you're not going to be a specialist in data visualization, that's fine.

49:12 But as a working data scientist, you'll be asked time and time again to explain your results.

49:16 And a picture is worth a thousand lines of code.

49:20 So I think that's incredibly important to become adept at data visualization.

49:25 I think the fifth point, which is a term on everyone's tongue, is machine learning, the related deep learning.

49:33 I think machine learning is incredibly important for working data scientists.

49:38 But I don't want aspiring data scientists or software engineers who are trying to enter the data science space to fall into the trap of thinking, if I can machine learn, in inverted commas, you know, that makes me a data scientist.

49:56 And I'd suggest that definitely learn a bit about deep learning, but don't get sucked in or too sucked in unless you want that to be your focus.

50:04 And then really do it, right?

50:06 Yeah, it's definitely one of the most mysterious and sort of new buzzy parts of data science.

50:13 Exactly.

50:13 And the way it's related to this, you know, kind of re-burgeoning concept of artificial intelligence is fascinating.

50:20 But it's, you know, there's also a potential for a bubble.

50:24 I don't want to be too harsh on it because it's incredibly important.

50:27 And the effects on, you know, on society and the way we live will be huge.

50:32 But we need to be careful as well.

50:33 Yeah, well, I think the probably the danger is that it can become the hammer where everything becomes a nail to hit it with.

50:39 There was this funny image I retweeted on Twitter yesterday.

50:42 I don't know where it came from originally, but there's this huge bulldozer thing.

50:46 Instead of having like a big scoop on the end of its arm, it had like a little regular person-sized shovel.

50:51 And it was like digging with it.

50:53 And the quote was something like, you know, machine learning solution when all you really needed was a few if statements.

50:59 It's something like that.

51:00 That's fantastic.

51:02 Yeah, and I do see that possibly being a danger, right?

51:05 It's not the only way to solve problems.

51:06 But the problems that they can solve are like they were unsolvable before.

51:10 So it really does have the possibility to open new doors.

51:14 All right.

51:14 But it's not the only only tool for it.

51:16 Yeah.

51:17 I mean, you know, the pendulum swings both ways.

51:20 And part of the reason it's really buzzy now is because it has been incredibly effective, as we've seen.

51:24 Yeah.

51:25 And these companies are saying, hey, we have tons of data and we don't fully understand it.

51:28 Could this maybe be our magic silver bullet to unlock something we didn't know about?

51:33 Yeah.

51:33 And also you said story.

51:35 Yeah.

51:35 Storytelling, right?

51:37 Storytelling is incredibly important.

51:38 And I think, you know, even when you're writing a chunk of code, you're telling a story to future you or someone else who's reading it and trying to interpret it.

51:48 But when developing a data science project, you're introducing them to a data set.

51:53 You're showing them exploratory data analysis.

51:55 You're potentially showing them some statistical inference, machine learning pipelines.

51:59 So being able to explain in a variety of terms what your data science story is, is incredibly important.

52:08 And to give takeaways at the end, to give an introduction, this type of stuff.

52:11 So considering it a story and also thinking who your target audience is.

52:15 If you want to, you know, write a blog post which a hiring manager can understand, that's one thing.

52:22 But if you want to write a blog post who's someone who's, you know, very well versed in machine learning can understand, they're very different things.

52:29 So just kind of think about that.

52:31 Practice that.

52:32 And read what other people do as well.

52:35 There's a website.

52:36 I can't remember what it is.

52:37 But it's called something like, you know, 100 Interesting Jupyter Notebooks in Data Science.

52:43 Yeah, I think I've seen that.

52:44 That's really cool.

52:45 Yeah.

52:45 Yeah, that definitely is a great place.

52:48 I think Jupyter Notebooks really are powerful and they've brought storytelling to code in a way that just wasn't there before.

52:55 Absolutely.

52:56 And the idea of being able to interactively write your code and see output straight away below the cell you've written in is really strong.

53:04 And this was actually one of Jake Vanderpluss's points, right, in his PyCon keynote where, you know, someone said to him,

53:09 Oh, I can't believe you use Jupyter.

53:11 It's so slow and beefy.

53:14 And he was like, Oh, I never thought about that.

53:16 But that doesn't affect my workflow.

53:19 It's really about, you know, speed of development for me, not speed of execution, I think, was his term.

53:26 And that he can go in there.

53:27 And we all can write some code, see the output, get some cool visualizations, move on, write some markdown in there in order to have some text and tell that story.

53:37 Now, one of the greatest things, of course, now is that has been for some time that GitHub renders Jupyter Notebooks as well.

53:44 So you can just give someone a link to your Jupyter Notebook on GitHub and they can go and check it out immediately without even needing to clone the repository.

53:53 Oh, I didn't know that.

53:53 That's awesome.

53:54 Yeah.

53:55 Very cool.

53:55 So I guess we're kind of getting near the end.

53:58 Probably we've got to wrap it up a bit.

54:00 But one of the final things we should focus on is, you know, it's a time of unparalleled information and learning resources.

54:10 I mean, 20 years ago, it was get a book or get a degree.

54:13 There's a whole lot more than that now, right?

54:16 Absolutely.

54:16 So you guys at DataCamp already have, you have a ton of courses for data scientists.

54:21 Yeah.

54:22 So I definitely think one way to keep up to date with what's happening in the field is online education.

54:27 And there are lots of platforms for this which offer different things.

54:30 So I think Coursera and edX, you know, open the world of online education, not only in data science and programming,

54:37 but everything from, you know, the humanities to space exploration to politics, you know, and it's an incredible platform.

54:45 Oh, sorry.

54:46 Both of them are.

54:48 Yeah.

54:48 What we do at DataCamp is we're building a vertical platform for people to learn data science.

54:55 And what we offer really, one of our major value propositions is it's more personalized in the sense that you get a shell and you get to write a script in the course.

55:07 And you get automated personalized feedback.

55:10 So let's say I try to import Pandas and then read in a CSV, but I pass the wrong argument to it or the wrong separator or something like that.

55:19 DataCamp will say, hey, you passed in this argument.

55:22 Why don't you try doing this instead in order to import it, read the CSV correctly?

55:29 So we have a mixture of videos and interactive coding sessions.

55:32 There are lots of other great places.

55:34 Kevin Markham has his data school, which is great for Pythonic data science.

55:39 Yeah, Kevin Markham.

55:40 Yeah, Kevin Markham is doing really awesome stuff.

55:41 Shout out to Kevin.

55:42 I was just talking to him yesterday, actually.

55:44 And he and I have done a little bit together.

55:46 He's got some really cool stuff for data science and Python for sure.

55:51 Absolutely.

55:52 And of course, your courses, your talk Python courses for pure Python.

55:56 Everyone should do this.

55:56 Well, thank you very much.

55:57 I appreciate the shout out.

55:58 That's awesome.

55:59 Of course.

56:00 All right.

56:00 Well, hopefully people who are getting started in data science or the programs, they want

56:05 to move into data science.

56:06 Hopefully this has been really helpful.

56:07 I think there's a pretty concrete roadmap of steps that you can take to get there.

56:13 So thanks for laying that out for us.

56:14 Absolutely.

56:16 And thanks for coming up with this idea for us to have this chat as well.

56:19 Yeah, it's been really cool.

56:20 Yeah, it's super fun.

56:21 I think everyone's going to enjoy it, I think.

56:22 All right.

56:23 So before I let you get out of here, though, you've got two questions to answer.

56:27 First of all, if you're going to write some code, namely Python code, really, what editor

56:30 do you open up?

56:31 When I use an editor, which I do for scripting, I'll use Atom.

56:34 But as we've said, for most data science, I do it in Jupyter Notebooks.

56:38 I love Jupyter Notebooks.

56:39 Also, I'd recommend very soon, or even now, people checking out JupyterLab.

56:45 I don't know JupyterLab.

56:46 Tell us about it.

56:47 JupyterLab's amazing.

56:49 It's really a modular infrastructure for data science and scientific computing.

56:56 So you open up your JupyterLab kernel, and you can have a Jupyter Notebook in there.

57:03 You can have a terminal in there.

57:04 You can have a markdown file, which you see rendered immediately.

57:08 You can even have notebooks.

57:10 You and I can open Jupyter Notebooks in our respective JupyterLab environments and collaborate

57:15 on them in real time.

57:17 And you can paste code into the chat that then I can paste into my notebook.

57:22 So it's really kind of a new modular infrastructure.

57:24 That's awesome.

57:25 It's like social Jupyter.

57:27 Yeah, absolutely.

57:29 That sounds great.

57:29 So that's super exciting.

57:31 And the development around that is really strong.

57:35 Nice.

57:36 Okay.

57:36 So notable PyPI package.

57:39 Okay.

57:39 So there are so many.

57:40 It's so hard.

57:42 It's like 120,000 almost now.

57:44 It's insane.

57:45 I actually, I'll mention one that I recently discovered, and I've only played around with,

57:49 but it seems super cool.

57:49 It's called Newspaper.

57:51 And I've been thinking about it a bit recently.

57:52 I spent a lot of my time trying to scrape HTML and prettify it.

57:58 So, and for that, I use, generally, I use requests and BeautifulSoup, which those are huge.

58:03 But that isn't, those aren't the ones I'm talking about at the moment.

58:05 This is called Newspaper.

58:06 And it's a really simple API for scraping articles and curating them and doing natural

58:10 language processing.

58:11 So you can, you know, get in touch with the New York Times or whatever it may be.

58:17 Scrape the article really easily.

58:18 There are some natural language processing methods, title methods, text methods, that type of stuff,

58:23 where it'll, you know, I think the method, I probably won't get this right, but it's something

58:27 like NLP method, and it spits out keywords and topics and that type of stuff.

58:32 So I've only played around with it.

58:33 Yeah, it's an incredible library.

58:35 Yeah, I just discovered it recently as well.

58:37 And basically, the idea is, instead of combining requests, plus BeautifulSoup, and then you

58:43 have to, you get the text and the semantic markup, and you got to do whatever you're going to

58:46 do.

58:46 It's like, you can just point it at an article and say, who was the author?

58:50 When was this published?

58:51 Exactly.

58:52 What are the keywords?

58:53 And you can point it at the homepage, like the homepage of the New York Times to say,

58:57 what are the articles on this page?

58:58 It's crazy.

58:59 It's awesome.

58:59 And it deals with date times in a really intuitive, nice way, which date times are the bane of

59:06 my existence a lot of the time.

59:07 Why are date times so hard?

59:09 They are, though.

59:10 It really is tricky.

59:11 So I think James Gleick has this thing where it's an article about how there should just be

59:16 one time zone.

59:17 I'm not going to go into that, but I'm just putting that out there.

59:20 It's not obvious who would be the center of that time zone, but...

59:23 Yeah, that's a big debate there, right?

59:25 But I wake up at two in the afternoon, and then I get up, right?

59:29 Like, that would totally simplify things.

59:31 His argument is that time zone is a historical artifact that we need to get rid of.

59:35 But that's my notable PyPI package.

59:38 I just wanted to give a few shout outs to a bunch of others from the data science Python

59:43 stack.

59:43 And this list is by no means exhaustive, but I use Pandas, scikit-learn, NumPy is huge.

59:50 Matplotlib, Seaborn, Altair, and Bokeh are all great for data viz.

59:54 DAS for distributed computing.

59:56 PyMC3, stats models.

59:58 These are all really interesting and core elements of the data science Python stack that I use and

01:00:06 love.

01:00:06 Yeah, those are all very, very good ones.

01:00:08 So awesome.

01:00:09 Yeah, newspaper.

01:00:10 Lots of fun with that one.

01:00:12 All right.

01:00:12 So here you go.

01:00:14 Final call to action.

01:00:14 People, they want to get into data science.

01:00:16 What do you say?

01:00:17 Get out there and do things.

01:00:19 Play to your own strengths.

01:00:20 Be brave.

01:00:21 And something we haven't really chatted about, realize that imposter syndrome is a real thing

01:00:27 for everybody.

01:00:27 So at the inaugural JupyterCon this year, Fernando Perez, the creator of iPython, for real, the

01:00:35 creator of iPython and co-leader of Project Jupyter, encouraged everyone to realize that

01:00:40 everyone has imposter syndrome and that he himself has imposter syndrome.

01:00:44 So anytime you think you're an imposter, remember that Fernando Perez feels the same way.

01:00:49 He's out there changing the world and so can you, right?

01:00:51 Exactly.

01:00:52 Awesome.

01:00:53 That's it.

01:00:53 All right.

01:00:53 Well, great to talk with you.

01:00:54 And thanks for coming on the show.

01:00:56 Such a pleasure.

01:00:57 Thank you.

01:00:59 This has been another episode of Talk Python To Me.

01:01:02 Today's guest has been Hugo Brown Anderson.

01:01:04 And this episode has been brought to you by Rollbar and GoCD.

01:01:08 Rollbar takes the pain out of errors.

01:01:11 They give you the context and insight you need to quickly locate and fix errors that might have

01:01:16 gone unnoticed until your users complain, of course.

01:01:19 As Talk Python To Me listeners, track a ridiculous number of errors for free at

01:01:23 rollbar.com slash Talk Python To Me.

01:01:26 GoCD is the on-premise, open-source, continuous delivery server.

01:01:30 Want to improve your deployment workflow but keep your code and builds in-house?

01:01:34 Check out GoCD at talkpython.fm/gocd and take control over your process.

01:01:41 Are you or a colleague trying to learn Python?

01:01:43 Have you tried books and videos that just left you bored by covering topics point by point?

01:01:48 Well, check out my online course, Python Jumpstart, by building 10 apps at

01:01:52 talkpython.fm/course to experience a more engaging way to learn Python.

01:01:56 And if you're looking for something a little more advanced, try my Write Pythonic Code course at talkpython.fm/pythonic.

01:02:04 Be sure to subscribe to the show.

01:02:06 Open your favorite podcatcher and search for Python.

01:02:09 We should be right at the top.

01:02:10 You can also find the iTunes feed at /itunes, Google Play feed at /play, and direct RSS feed at /rss.

01:02:18 on talkpython.fm.

01:02:19 This is your host, Michael Kennedy.

01:02:21 Thanks so much for listening.

01:02:22 I really appreciate it.

01:02:23 Now get out there and write some Python code.

01:02:25 I'll see you next time.

01:02:26 Bye.

01:02:26 Thank you.

01:02:46 Thank you.