Capturing human moments with AI and Python

Episode #135, published Fri, Oct 27, 2017, recorded Wed, Oct 25, 2017

Episode Deep Dive Links Transcript

We all have smartphones these days. And we take them with us everywhere we go. How much could you infer about a person (their stage in life, their driving style, their work / life balance) based on just a phone's motion and GPS data?

With the right mix of analytics and machine learning, turns out you can learn a lot about a person. Are they a dog-owning workaholic? Or an early rising parent of young children?

This week you'll meet Vincent Spruyt. He is the chief data scientist at Sentiance. A company building an SDK to answer these exact questions. You'll learn how they are using Python to make this happen and how they think this data could be used for the great good.

Episode Deep Dive

Guest Introduction and Background

Vincent Spruyt is the Chief Data Scientist at Sentience, an AI-focused company building products that analyze sensor data to infer human behavior and context. With a background in computer vision and machine learning, Vincent leads a team of data scientists and engineers who leverage Python, signal processing, and a broad range of AI techniques to transform raw smartphone sensor data into meaningful human insights.

What to Know If You're New to Python

Here are a few items to help you get the most out of this discussion:

You will hear terms like random forests, convolutional neural networks, and LSTMs, all are popular approaches within Python’s machine learning ecosystem.
Tools mentioned include scikit-learn, TensorFlow, Keras, and the ipython/jupyter notebook environment for experimentation.
Familiarity with Python’s packaging and microservice deployment stories (e.g., docker, DevPy, AWS) will help you follow the architectural pieces.

Key Points and Takeaways

Inferring Human Behavior from Sensor Data
- Vincent and his team at Sentience focus on translating smartphone sensor data, such as accelerometer, gyroscope, and GPS, into rich behavioral insights. Their work ranges from detecting simple events (walking vs. driving) to understanding deeper lifestyle patterns (e.g., “workaholic” or “morning person”).
- By training and deploying machine learning models in Python, Sentience can interpret an individual’s routines, preferences, and likely future actions in real time.
- Links / Tools:
  - Sentience
  - Python.org
Transport Mode Classification: From Random Forests to Convolutional Neural Nets
- Early on, Sentience used random forests and decision trees for classifying walking, biking, driving, and other transport modes by segmenting the sensor signal into small chunks.
- Over time, they migrated to more advanced models such as boosted trees (XGBoost) and eventually convolutional neural networks (CNNs) to reduce manual feature engineering and improve accuracy.
- Links / Tools:
Moments and Segments: LSTMs for Deeper Temporal Understanding
- Beyond identifying individual events (e.g., a five-minute walk), Sentience groups these into “moments” to understand context and intent (e.g., a commute vs. an errand).
- They use LSTMs (Long Short-Term Memory networks) to capture extended temporal dependencies, letting them predict the next activity with higher accuracy and explain why a person is doing what they’re doing.
- Links / Tools:
  - Recurrent Neural Networks / LSTMs
  - Keras
Semi-Supervised Learning for Advanced User Profiling
- Sentience tackles cases where labeled data is scarce (e.g., identifying whether someone is a parent or a pet owner). They employ semi-supervised learning and unsupervised representation learning to cluster similar users in a learned feature space.
- A small set of labeled examples (such as confirmed parents) helps classify nearby user representations, drastically reducing the need for manual labeling.
- Links / Tools:
  - Triplet loss / representation learning (general info)
Privacy and GDPR: Using Data for the Greater Good
- The team is acutely aware of potential privacy concerns in collecting and analyzing location and sensor data. They adhere to GDPR by siloing data and requiring explicit user consent.
- Sentience aims to partner with services that offer genuine value, like health and insurance benefits, so users benefit from more personalized and contextually aware experiences rather than being bombarded by ads.
- Links / Tools:
  - General Data Protection Regulation (GDPR)
Technical Stack and Architecture: Microservices, Docker, AWS
- Sentience’s machine learning pipelines run on AWS, with models packaged as microservices in Docker containers. This modular approach lets them scale individual components independently.
- They consciously avoid heavy lock-in with specific cloud services, ensuring they can migrate if needed.
- Links / Tools:
  - Docker
  - AWS
DevPy and Dependency Management
- The team uses DevPy, a self-hosted PyPI server, to resolve Python dependency challenges and maintain consistent package versions between development and production.
- CI pipelines (e.g., Jenkins) build wheels for both Mac and Linux, storing them in DevPy so that pip install consistently pulls the correct builds.
- Links / Tools:
  - DevPy on PyPI
  - Jenkins
Pragmatic Use of Deep Learning
- They advise caution in jumping straight to deep learning for every problem. Sometimes simple business rules or more classic machine learning (like linear classifiers or random forests) are all that’s needed.
- Complex neural networks are employed where they offer a clear advantage, particularly for tasks involving high-dimensional sensor data or sequential modeling.
- Links / Tools:
  - scikit-learn (classic ML algorithms)
  - PyTorch (mentioned as another popular framework)
Use Cases: From Ridesharing Drivers to Healthcare
- Sentience partners with a major ridesharing service to analyze driver safety via phone-sensor data (harsh turns, braking). They also collaborate with health-oriented apps to contextualize arrhythmias and lifestyle patterns.
- Combining heart rhythm data with daily routines can help doctors pinpoint triggers, like high-stress commutes or poor sleep patterns.
- Links / Tools:
  - FibriCheck (similar example mentioned)
  - Medup (behavior-based medication reminders)
Team Structure, Model Lifecycle, and Versioning

They organize into cross-functional squads mixing data scientists, data engineers, and mobile developers, each focused on specific features (e.g., signal processing vs. lifestyle modeling).
Version control and continuous integration become complex because retraining lower-layer models can break higher-layer tasks. They mitigate this by careful microservice separation or “mapping” layers to keep feature representations consistent.

Interesting Quotes and Stories

“Talking about AI taking over the world is a bit like talking about overpopulation on Mars.” -- Referencing Andrew Ng’s humorous take on AI existential worries.

“If you can solve a problem by a simple business rule, then that’s the way you should go.” -- On being pragmatic when applying deep learning versus simpler methods.

“We do a lot of data augmentation, adding noise, additive and multiplicative, to make our classifier generalize and not learn phone-specific noise.” -- Highlighting Sentience’s robust approach to sensor data processing.

Key Definitions and Terms

Random Forests: Ensembles of decision trees that reduce overfitting and improve prediction accuracy.
Convolutional Neural Network (CNN): A deep learning architecture, typically used for image or time-series data, that automatically extracts hierarchical features.
LSTM (Long Short-Term Memory): A type of recurrent neural network (RNN) well-suited for capturing long-range temporal dependencies.
Hidden Markov Model (HMM): A statistical model that infers hidden states from observed sequential data. Often used for smoothing or labeling time-series segments.
Semi-Supervised Learning: A technique that leverages both labeled and unlabeled data to improve model performance, particularly when labels are expensive or scarce.

Learning Resources

If you’re ready to dive deeper into Python or strengthen your foundations, here are some curated courses:

Python for Absolute Beginners: Perfect if you're brand-new to Python and want a solid, practical introduction.
Data Science Jumpstart with 10 Projects: Ideal for hands-on experience applying data-driven methods to real-world scenarios.
Build An Audio AI App: Explore creating an AI-powered application with modern Python libraries and an emphasis on production deployments.

Overall Takeaway

This episode reveals how a combination of AI, sensor data, and Python can unlock a remarkably detailed picture of human routines, all while confronting the technical and ethical challenges inherent in gathering such personal insights. By emphasizing transparency, user consent, and tangible benefits (e.g., safer driving, improved health), Vincent and the Sentience team illustrate how machine learning can truly improve lives. They demonstrate the power of a pragmatic approach: leveraging everything from classic ML to deep neural networks, deploying carefully on modern cloud infrastructure, and maintaining a sharp focus on generating genuine, user-centric value.

Links from the show

Vincent on Twitter: @vincent_spruyt
Sentiance blog: sentiance.com/blog
Job openings: sentiance.com/jobs
The demo app, Journeys: sentiance.com/demo
Explanation video on Sentiance: youtube.com/watch?v=9WHhGycwmew
And another video: youtube.com/watch?v=emTkGgQ-ejI
MIT Innovators under 35 Europe award: innovatorsunder35.com/innovator/vincent-spruyt
Vincent's personal blog: visiondummy.com

Libraries
TensorFlow: tensorflow.org
Keras: keras.io
XGboost: github.com/dmlc/xgboost
CNN Networks: wikipedia.org/wiki/Convolutional_neural_network
LSTM Networks: wikipedia.org/wiki/Long_short-term_memory
DevPi: devpi.net/docs/devpi
PyCharm: jetbrains.com/pycharm
Python-Flamegraph: github.com/evanhempel/python-flamegraph
Episode #135 deep-dive: talkpython.fm/135
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #135 deep-dive: talkpython.fm/135

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 We all have smartphones these days, and we take them with us everywhere we go.

00:03 How much could you infer about a person, their stage in life, their driving style,

00:08 their work-life balance, based on just a phone's motion and GPS data?

00:13 With the right mix of analytics and machine learning, it turns out you can learn a lot

00:17 about a person. Are they a dog-owning workaholic or an early rising parent of young children?

00:23 This week you'll meet Vincent Spert, who is the chief data scientist at Sentience,

00:28 a company building an SDK to answer these exact questions. You'll learn how they're using Python

00:34 to make this happen and how they think this data could be used for the greater good.

00:37 This is Talk Python To Me, episode 135, recorded October 25, 2017.

00:56 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the

01:02 ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter,

01:07 where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm,

01:12 and follow the show on Twitter via at Talk Python. This episode is brought to you by Datadog and

01:18 GoCD. Please check out what they're offering during their segments. It really helps support

01:21 the show. Vincent, welcome to Talk Python.

01:24 Thank you.

01:24 It's great to have you here. You guys have a super cool platform. You're doing some seriously

01:29 deep learning and AI and machine learning, and I think everyone's going to get a pretty cool look

01:35 at what you guys are doing and how you're doing it. There's a bunch of cool algorithms going on here.

01:39 But before we get into all that detail, let's talk about how you got into programming and Python,

01:44 things like that.

01:45 Yeah, sure. Cool. It started when I was about 14, and I started hacking around with some web design.

01:50 Remember those days where everyone used those marquee banners? There was no CSS and stuff like

01:55 that.

01:56 Like the blink tag. Yeah, yeah, that was wonderful. Those were good days.

01:58 Exactly. So over the years, I went more into network security and hacking, started a company

02:03 when I was around 18, and converted more to languages like Java and C++ throughout my PhD.

02:09 What was the company you started when you were 18?

02:11 It was a network security company, so quite different from what I'm doing today. It was

02:14 a lot about the... You know, back then, the internet was more or less the Wild West. Everything

02:20 was wide open, so interesting times.

02:22 Yeah, I remember back then that things like Windows XP was the most popular operating system,

02:28 and it had no firewall at all.

02:31 Yeah, yeah, exactly.

02:32 Right?

02:32 Exactly.

02:33 Exactly.

02:33 No, it was right on the internet. It was really bad. Yeah, cool. Okay, so yeah, those were interesting

02:38 Wild West days for sure. And then you said you moved on to...

02:41 C++ and Java?

02:42 Yeah, so during my PhD, I was mostly working on computer vision, machine learning, so that

02:48 was heavily focused on real-time processing, so most of the work was done in C++. And then

02:54 I joined Sentience about three years ago to start out a data science team. So back then, Sentience

02:59 was quite small. I think we were with five people. And we had to choose a main programming language

03:04 for the machine learning stuff, the data science part. And so I guess, coming a bit from an academic

03:09 background, I was looking for a language that both had the ease of use as that MATLAB or

03:15 R has. But on the other hand, also is a language that allows us to go to production and build

03:21 scalable systems. That's how we came up with Python. And so since then, Python has been

03:25 my go-to.

03:26 Yeah, that's really great. Three years ago, I think it was probably on the border of whether

03:30 Python was going to be one of the really dominant machine learning languages. I mean, now it's

03:37 like really a clear choice. But three years ago, it was just starting to be a clear choice,

03:42 right? What other things did you consider? And why, in the end, did you choose Python?

03:46 The thing is that obviously you don't want to use MATLAB in a production system. So then I guess the

03:53 choice was whether or not we would want to use a programming language like Java for the data science

03:59 work. And there has been some discussion around that because the data engineering team here at

04:04 Sentience does most of the stuff in Java, you know, making sure everything is scalable, all the

04:09 infrastructure stuff is Java. But for us, we noticed that although we love Java, doing rapid prototyping,

04:16 quickly coming up and testing your models, it's just much easier in Python. And actually,

04:20 although there is a lot of discussion on whether or not Python is a production-friendly language,

04:25 if you look at, for example, YouTube, they also use mainly Python for their whole platform. So

04:30 it's quite powerful.

04:31 Yeah, it's incredibly powerful. You look at some of the people or some of the companies doing

04:36 really interesting things. YouTube is a great example. YouTube handles like a million requests per

04:41 second. So that's a pretty insane level of web traffic right there. And yeah, there's some other

04:48 really cool use cases like that as well. So I guess the main takeaway is sort of the combination of

04:55 it's quick and easy, but also you can go fully to production, like all the way to like real scalable

05:01 levels of running and production, right?

05:04 Yeah, indeed. And of course, the machine learning communities is, these days is growing. And I mean,

05:09 the amount of libraries and support you have for machine learning in Python is just huge compared to

05:14 almost any other language.

05:15 Yeah, absolutely. So what do you do at Sentience now?

05:18 So I'm the chief data scientist at Sentience. So basically, my job is mostly focused on the research

05:24 part and on building the algorithms together with the team, of course. So Sentience is an AI company

05:30 first. So we are currently with about 50 people, and about 40 of them are actually technical, half of

05:36 them data engineering, and half of them data science. And with data science, we mostly mean, you know,

05:41 machine learning, signal processing, building the actual algorithms there.

05:45 Okay, that sounds, that sounds pretty awesome, like a pretty fun job to be part of, I'm sure.

05:50 So I guess, you know, probably a lot of the listeners don't know about Sentience, maybe give us

05:55 like a high level idea of what you guys do. I mean, you have this specific SDK that people can plug into

06:03 their mobile apps. And then you look at the behaviors and stuff. Tell us what the big idea is.

06:10 Well, the idea is that these days, I mean, everyone has a smartphone, to that extent that the

06:15 smartphone is almost like an extension of your body, you continuously use it. And every smartphone

06:20 is packed with sensors, you have accelerometer sensors measuring every small vibration of the

06:24 phone, you have the gyroscope, of course, you have the location subsystem. So what we do is we have an

06:29 SDK that plugs in into the app of our customers, which are companies. And once the SDK is in the app,

06:34 we start logging all that sensor data, accelerometer, gyroscope, location, we send that to our backend,

06:38 running on Amazon Cloud. And there we have a bunch of machine learning algorithms that extract behavioral

06:43 intelligence from it. So we learn about your behavior. What is your home? What is your work

06:46 location? What do you do every day? Why do you do it? Can we predict your future behavior, etc?

06:50 Wow. Okay. So I totally agree that, you know, it's a crazy world where we always have our phones with us.

06:59 Like I would rather accidentally leave my keys at home and leave my house unlocked and leave my house

07:04 without my phone. You know what I mean?

07:06 So if you can take this information about what people are doing just ambiently, I do more with it. That sounds

07:12 pretty cool. So you take all the sensor data, orientation, accelerometer, location, things like

07:19 that. And you say you extract intelligence from it. So I watched there's a video on your homepage,

07:25 I'll link to it in the show notes where you have these different layers, right? Like one layer comes in and

07:31 just takes in this ambient deformation. One tries to understand what you're doing. And then the other,

07:39 what's the final one to try to create like these moments or something? Tell us about that.

07:43 We start with a low level sensor data. And based on that, we do what we call event detection. So we

07:49 have a whole bunch of classifiers that take in the sensor data and for example, try to classify your mode

07:55 of transport. Based on the vibrations of the phone, you can figure out is this person walking,

07:59 biking in a train, on a subway, in a car, or for example, given your location, your current location,

08:05 we want to figure out where are you? I mean, are you visiting the bus stop that is five meters from

08:10 your location? Or maybe if you're there in the middle of the night for five hours in a row,

08:14 you're not visiting the bus stop, you're actually in the bar that is like 50 meters further, right?

08:18 And so we end up with this whole timeline of human behavior. But of course, then you know what the

08:22 user is doing, but not why he's doing it. So we feed this event timeline into our deep learning

08:28 based prediction model, we can predict what you're going to do next. And that allows us to explain why

08:32 you're doing something, you know, is this are you in a car because it's your commute or because it's

08:36 your shopping routine? Is this a business trip? Is it a leisure trip, etc. And then finally, indeed, we

08:41 aggregate all that data, all those timelines over weeks. So we can come up with with more like a profile.

08:47 Are you a shopaholic, a workaholic? Are you an aggressive driver? Do you have children? And that data we then

08:52 expose back to our customers.

08:53 That's pretty wild. It sounds really challenging, because like you said, there could be a bar and

08:58 then right outside the bar could be, you could be sitting like right in the front, you know, with

09:03 like maybe on a beautiful day, like the windows are open or something. And then right next to that is a

09:07 bus stop. And so determining whether you're trying to go on the bus, or you're trying to relax after

09:13 work. That sounds like a real interesting challenge.

09:15 Yeah, exactly. Even more, I mean, you can be in a bar because you work there, you can be in a bar because

09:20 you just like to go for a drink, you can be there because you live in the apartment on top of it. So

09:24 trying to figure out why a user is doing something is pretty cool.

09:28 Yeah, that's definitely taking it to another level. So what is your day to day look like? Do you do a

09:36 lot of research? What kind of tools do you do to come up with some of these models? Do you write

09:40 production code? What are you doing?

09:42 It's a mix of both research and actually writing production code indeed. So most, most of the time when we start

09:47 on a project, there is a research phase. I mean, there's a lot of reading papers, experimenting, usually just in

09:52 iPad and notebooks, you create your models, you validate them. But then at some point, indeed, we have to convert

09:58 this code into something that is production ready. So then it actually comes or boils down to cleaning up the code,

10:04 creating a nice object oriented framework. Do you need testing, regression testing, you know, performance profiling,

10:10 make sure that everything is scalable, and then encapsulate it into an API so that we can deploy

10:16 this as a microservice. And finally, try to get a microservice running in a Docker image. And that is

10:22 actually then when the data engineering team comes in. So the data engineers take this Docker image,

10:26 refine it, basically, they built a base Docker image. So we just customize it a little bit for our specific

10:31 projects. And they help us to make sure that we can deploy it in a scalable manner.

10:35 That's cool. And do you guys deploy it to your own system? Or do you deploy,

10:38 I forgot what AWS is container service is called? But do you deploy it to the container service? Or do

10:44 you have more control over where it lives and runs?

10:46 It's our own system. So we try, I mean, we love AWS, but we try to be not too dependent on their

10:53 specific services, mainly because we need to be able to move to different cloud infrastructures if needed.

10:58 Isn't that an interesting struggle? Like, there's so many features and AWS and Azure, those are the two that

11:05 have like, a ridiculous number of things that the cloud can do. But the more that you like, put those

11:12 hooks into your system and those dependencies, like the more you're locked in, I mean, people used to talk

11:18 about lock in like with Windows or with Mac or with iOS or whatever. But like, the cloud lock in is a whole

11:23 another level if you go all in, right? Yeah, exactly. And of course, that's their business model. I mean,

11:27 they try to get you to use those specific services.

11:31 Yeah. Yeah, that's cool. Okay, so go with your own container service. That makes a lot of sense.

11:37 And for the tooling for R&D stuff, this is like IPv3 notebooks. And what are the notable packages and

11:45 libraries using there? It depends on the project, I guess. Obviously, we just use a lot of typical

11:50 libraries like scikit-learn for most of the modeling. When we talk about deep learning,

11:54 it's usually TensorFlow and some Keras. For performance and memory profiling, we create

12:00 things like flame graphs. For unit testing, we use a pytest or nosetest. Yeah, nice. Are you using

12:05 like GPU clusters in AWS or is this pure CPU-based? For deployment at inference time, it's usually CPU-based.

12:13 For research, I mean, like training the models, we indeed use GPU machines because it can take a few

12:18 weeks before a model is trained. So that really speeds it up. Yeah. And how fast is it if you use

12:23 GPUs? Also weeks. And if it's not GPUs, it's even worse. Yeah, exactly. So the last model I trained a

12:28 few weeks ago, I started using, I started out using an AWS machine that was just a CPU machine with 36

12:35 cores. And eventually, I moved to a GPU machine where I trained about two weeks. And the training there,

12:42 I mean, the loss went down about 10 times as fast as when I was using the CPU machines.

12:46 Wow, that's really awesome. Yeah. And for performance and stuff, you just get that

12:51 basically dialed in, I guess. There's two parts, right? There's training and then there's answering

12:56 the question, the inference bits. How do you sort of balance those things? Like obviously,

13:00 you're just going to train as much data as you need. But then what do you do for performance at

13:06 that point? I mean, once you have the model built and you kind of have the library, you're using

13:10 TensorFlow or whatever, you kind of like how much flexibility do you have to make it go faster?

13:14 Well, that's a good question. And usually, it's really just trying to balance performance and

13:20 cost, I guess. I mean, you can have a model with 3 million parameters and have a gain of 1% in

13:26 accuracy compared to a model with 300,000 parameters, right? While the latter, of course, is much cheaper

13:31 to be used in production. So a lot of the time, it's just balancing out those two.

13:36 Okay, that's interesting. Do you sometimes build like super detailed models in the research phase and go,

13:42 I think we can take it down to 100,000 or whatever levels and then run that in production and get good

13:49 enough answers? Yeah, we do. Especially, of course, if you, some of our models started out on the cloud,

13:54 and then at some point, we realized, you know, we could actually deploy those on the mobile phone

13:59 itself. So then it's very important to reduce the number of parameters as much as possible without

14:03 losing too much accuracy. So yeah, that's really cool. And that's definitely a trend in the space to

14:09 not have these tremendously powerful cloud infrastructures, but to push it to the edge, right?

14:14 To the devices.

14:15 Yeah, indeed, indeed. It's especially with, you know, Apple in the new iPhone has this X11 chip that is like a

14:23 dedicated coprocessor for these kind of things. And at the same time, Google with the Pixel 2, they have this

14:28 separate chip specifically for like image processing, computer vision. So more and more phones will have

14:34 coprocessors that allow us to do edge computing without draining the battery too much.

14:38 Yeah, that's more or less like running on GPUs, right? Like these specialized hardware are way more efficient and

14:44 quick. So more reasonable to run it on these wimpy devices.

14:47 Exactly.

14:47 Yeah, yeah. Cool. So you said that you guys were about 50 people. What is the and if I remember the breakdown, right?

14:54 20 or so data scientists, 20 or so sort of data software website of things.

15:01 What's the team structure? Like how do you guys work together and things like that?

15:05 We have a model that is that is like loosely based on the Spotify model. They work with a or we work in a kind of a

15:10 matrix structure where horizontally, we have a set of functional teams. So there is a data science team, there is a data

15:15 engineering team, then there is a mobile SDK team, and then there is a solutions team. But we quickly realized as those

15:21 teams grow bigger and bigger, that it's very difficult to isolate. You don't want to want to be isolated in your team, you want

15:27 to work together with people from different backgrounds. So that's why vertically over those

15:32 teams, we define cross functional teams that we call squads. So cross functional team has a quite a

15:37 specific focus, it's kind of a of a mini startup. And it consists of a few data scientists, few data

15:43 engineers, few mobile guys, they build stuff from from concept to actually build, bringing it into

15:48 production.

15:49 That sounds really like a cool way to work, actually. So there's some new major feature or new

15:56 library you guys want to build and you put together these cross functional teams to build it, huh?

16:00 Yeah, and usually cross those cross functional teams or those squads are long lift. So it's not like they

16:05 are created and then and then disbanded quickly, because of course, we continuously try to improve

16:11 our products. So we have like a mob sense, what we call a mob sense squad, the squad that focuses on

16:15 everything around signal processing, and you know, the deep learning directly on the sensor data,

16:19 then we have like a lifestyle squad that focuses more on the moments in the segments. So there we use more

16:23 like NLP related techniques. And of course, we try to move people around. I mean, you don't stay in a

16:29 single squad forever.

16:29 Of course. That sounds really cool. So let's dig into the three layers of your SDK, the event

16:36 acquisition, the moments and segments, you call them, right? So there's some pretty interesting

16:42 algorithms and libraries you're using. So the first level is this idea of events. And the basic

16:50 question you're trying to answer is what is the user doing? So maybe we could talk about some of the

16:55 some of the algorithms and techniques you're using to determine like, are they driving? Are they walking?

17:00 Are they at a bar? Whatever.

17:02 Yeah, yeah, cool. Well, so transport mode detection itself is a is a cool problem. You know, both both

17:08 iOS and Android already have what what they call motion activities. So they give you an idea already about

17:13 transport mode, but it's quite limited. I think they support walking, biking, vehicle and idle,

17:19 something like that. Also, the occurrences are quite low, usually. So indeed, we had to build our

17:23 own model to get better occurrences, and especially to extend the number of transports we support like

17:28 bus and subway and running and stuff like that.

17:31 Sure. Can you still leverage these like motion chips at a lower level and not just ask like,

17:37 what are they doing? But you know, give me the actual measurements that you were going to use to

17:41 make that assessment?

17:42 Currently, we get 25 hertz accelerometer and gyroscope data from the phone. And based on that data,

17:47 well, first, of course, there is some pre processing, some signal processing, you have

17:50 to interpolate the samples because they don't come in on a regular on a regular grid, let's say,

17:56 you have to do some bandpass filtering to remove the high frequency components that usually contain

18:01 a lot of noise. And then after that, when you have like a signal that is more or less clean,

18:05 what we do then is a lot of data augmentation, we add some noise, you know, additive noise,

18:09 multiplicative noise. And that is mainly because every phone has different noise characteristics. And we

18:14 don't want our machine learning models to learn to recognize specific phones. So to undo those noise

18:20 characteristics, we basically deliberately add noise to our data, so that the classifiers learn to

18:25 generalize. And what we then do is we feed that sensor data. Well, maybe it's interesting to have a look at

18:31 the evolution. So today we use a neural net, a conf net, but we started out in a completely different

18:35 way. In the beginning, we actually chopped our sensor stream into pieces of several seconds. Those pieces for

18:41 those pieces of or segments of sensor data, we did a lot of manual feature engineering, like some Fourier

18:46 coefficients, like frequency domain features, time domain features. Those were fed into a random forest

18:52 back then. And the random forest then outputs class probabilities.

18:54 Maybe quickly define what a random forest is for people.

18:57 Yeah, yeah. So random forest is basically an ensemble of decision trees. So you can,

19:02 one of the most simple classifiers is a decision tree where it's just like a binary search tree where

19:07 you say, okay, if this feature is higher than a certain value, go to the left node, otherwise go

19:12 to the right node, and you go through the whole tree until you have a decision on what is the transport

19:16 mode. Problem is that decision tree is not very powerful. It quickly overfits to your data. So what

19:21 you can do is just build, you know, a thousand decision trees, all a little bit different, all on

19:26 different subsets of your data and your features, and then you end up with a random forest. So it's kind of

19:31 averaging out all those predictions.

19:33 Right. So it kind of somewhat combats against the overfitting problem you would run into.

19:37 Exactly. Exactly.

19:38 Okay.

19:39 The thing, of course, is by chopping up the sensor stream into pieces of several seconds,

19:43 you completely lose the temporal dependencies. It could be that one piece is correctly classified

19:48 as car, and the next piece maybe is incorrectly classified as walking. So you still want to do some

19:53 temporal smoothing. So what we did back then is we fed that information, those segments,

19:57 or actually the class probabilities, into a hidden Markov model. And a hidden Markov model,

20:01 is able to learn short-term temporal dependencies and kind of smooth out the end results. So that

20:06 was our first version of the transport classifier. And then over the past three years, we went through

20:11 several iterations. So the Renov Forest was replaced by boosted trees, XGBoost, which is used a lot,

20:16 for example, these days in the Kaggle competitions you read about. And now recently, we figured out that

20:22 actually just using a convolutional neural net with one-dimensional convolutions, because of course you don't have

20:28 images. It allows us to not only get an improved accuracy, but also come up with much smaller models

20:34 that more easily feed in memory compared to this huge random forest.

20:37 Okay. Yeah, that sounds really interesting. Thanks for sharing the evolution. I think that's pretty cool.

20:42 And so you've got all these events. Okay. Users driving, users at work, users walking, users at restaurant, users walking, users at work, things like that.

20:54 And then you try to create what you guys call moments, which is why are they doing this? Like, why are they walking? Oh, they're walking to lunch, things like that, right? So maybe talk about the analysis that you guys do there.

21:05 Well, similarly, there was an evolution on that level too. So the main idea is that if you can predict what users will be doing next, you can use that indeed to explain why he's doing what he's doing. So if, for example, the user is predicted to go to work, then the fact that he's in a car means he's in a commute.

21:21 While if he's predicted to go to a shop, the fact that he's in a car means he's probably in a shopping routine, right?

21:26 So the first step is to teach a model to be able to predict your next event. And there we started out with a Markov Chain-like approach.

21:33 Markov Chain basically just tries to learn transition probabilities. It just learns a distribution over, learns to predict the probability of your next event being event A, given your previous events.

21:47 So it learns very short-term dependencies. We quickly saw, though, that those short-term dependencies were not able to model complex human behavior.

21:55 So it worked in simple cases, like, especially if you include features like time and day, it worked in simple cases like going to work and going back home.

22:03 But what if suddenly, you know, you wake up an hour later than normally and your whole day shifts a little bit,

22:09 then suddenly the Markov Chain model completely kind of blacks out.

22:12 Hey, everyone. This is Michael. Let me tell you about Datadog. They're sponsoring this episode.

22:17 Performance and bottlenecks don't exist just in your application code.

22:21 Modern applications are systems built upon systems.

22:24 And Datadog lets you view the system as a whole.

22:27 Let's say you have a Python web app running Flask.

22:29 It's built upon MongoDB and hosted and scaled out on a set of Ubuntu servers running Nginx and Microwisgi.

22:35 Add Datadog and you can view and monitor and even get alerts across all of these systems.

22:40 Datadog has a great getting started tutorial that takes just a few moments.

22:44 And if you complete it, they'll send you a sweet Datadog t-shirt for free.

22:47 Don't hesitate. Visit talkpython.fm/Datadog and see what you've been missing.

22:52 That's talkpython.fm/Datadog.

22:56 What we use today there is, again, deep learning.

22:59 We use an LSTM.

23:00 What's an LSTM? Long, short-term memory?

23:03 Yeah, exactly. Exactly.

23:04 So an LSTM is a recurrent neural network that is able to...

23:08 So when you think about deep learning, convolutional neural nets, for example,

23:13 they are deep because you have a lot of layers.

23:15 LSTMs, or recurrent neural networks, they are deep not because you have, per se, a lot of layers,

23:21 but because they learn deep in the time dimension.

23:24 They learn a lot of temporal dependencies.

23:25 As opposed to a Markov chain where you only have a dependency on the previous event,

23:30 an LSTM can depend on 20, 30, 50 events back in the past, right?

23:35 It can say, okay, you know, 50 events ago, the user was in a car, and 20 events ago, he was at a shop.

23:40 And given all that behavior, the user is probably going to do this next.

23:44 That's cool, yeah. The longer you go without shopping, the more likely you are to shop.

23:49 And things like that for groceries, right?

23:51 Yeah, indeed, indeed.

23:52 And the cool thing is also that the Markov chain model, by nature, had to be trained specifically for each user, separately, on the user's data.

24:00 While the LSTM, we trained differently.

24:02 We trained one global LSTM, feeding it thousands and thousands of different timelines of different users.

24:08 The LSTM thereby kind of learned about general human behavior, and it learned what events from the past it has to pay attention to to predict something in the future.

24:20 And then, for a specific user that was never seen in a training set before,

24:24 we don't have to fine-tune or retrain the LSTM.

24:26 We just feed the past three weeks of events into the LSTM, and the LSTM already learned during the training phase how it should use that pass to predict the next event.

24:36 So the nice thing is that you don't...

24:37 I mean, if you have a million users on your platform, you cannot have a million deep learning models, right, that you have to retrain every second.

24:43 Yeah, of course.

24:44 And how do you mark them, right?

24:47 Like, all this stuff is happening without necessarily going, yes, I'm shopping now.

24:53 Yes, I'm doing this.

24:54 And then eventually it learns, okay, you're shopping, so I know what that means, right?

24:57 This is sort of all inference-based.

24:59 It's a combination.

25:00 So on the event level, on the lowest level, we do have a lot of labeled data.

25:04 So we spend a lot of time with customers or even, I mean, we paid a lot of students to go out on the road, take trains and trams and bus,

25:13 label the data.

25:15 And then internally we build some tools to clean up the data and make labels more accurate.

25:19 So we do have labeled data on that level.

25:21 Of course, when we go more to moments and segments, it becomes very difficult to get your hand on labeled data indeed.

25:27 So there we focus a lot on semi-supervised learning and things like transfer learning.

25:32 For example, using triplet loss function, we can learn this kind of high-dimensional feature space

25:38 in which two users with similar behavior are close to each other and two users with different behavior are far from each other.

25:44 And then in that feature space, you can build very simple classifiers using limited labeled data

25:49 to actually come up with user segments.

25:51 So that's kind of a transfer learning approach that allows us to cope with limited amounts of labeled data.

25:56 Sure.

25:56 Oh, okay.

25:57 That sounds like it's working out really well.

25:58 I've definitely been part of projects where it's like, all right, we're going to hire 100 students to do this for an hour.

26:05 Yeah.

26:06 And, you know, sometimes that's just what you got to do, right?

26:09 Yeah, exactly.

26:09 That's how we got started.

26:10 Yep.

26:12 But you can't pay a million students.

26:14 Well, not much anyway.

26:15 So, all right.

26:17 So, that's moments.

26:18 And you have your LSTM deep learning model there.

26:21 And then the final, like, the real end goal is to, I guess moments is already probably an end goal.

26:26 Like, they're at this store because also you want to classify people into groups, right?

26:31 What type of driver are they?

26:32 Do they work a lot?

26:33 Are they parents?

26:34 Are they teachers?

26:36 Why are they at the school?

26:37 Are they at the school?

26:37 Because they're teaching there?

26:39 Because they're a student?

26:39 Because they're a parent dropping off a kid?

26:41 Things like that, right?

26:42 So, tell us about the algorithms and stuff and segments.

26:45 That's a bit of what I was talking about earlier.

26:47 So, this feature space.

26:48 I think with segments is you, of course, some segments can be business rules, right?

26:53 I mean, you're a workaholic if you work more than a specific number of hours, right?

26:56 Right.

26:57 But some segments, like, are you a parent, for example?

26:59 That is less obvious.

27:00 Being a parent definitely influences your behavior.

27:04 I became a parent six months ago, and I'm a completely different person.

27:06 But how do you capture that behavior?

27:10 I mean, you cannot put that in a business rule, right?

27:13 Yeah.

27:13 So, tell us how you can determine someone is a parent, for example.

27:17 That's pretty interesting.

27:17 So, what we did there is we used deep learning to analyze, to compare, actually, the behavior of different people and to learn a feature representation,

27:27 like a feature vector consisting of 50 floating point numbers, where each dimension, each floating point number, encodes a different characteristic of the person.

27:36 Maybe the first number encodes your demographics.

27:39 Maybe the second number encodes how many times you go to sport.

27:44 The difference with traditional machine learning is that, in this case, we didn't manually define the meaning, semantic meaning of each of those 50 numbers.

27:52 Instead, we let our neural network figure out on itself which dimensions it should learn to capture human behavior.

28:00 And once you have that, you can actually take this timeline of events and code the whole event timeline into 50 floating points.

28:08 And then you have a rather small feature space with only 50 features on which you can easily build, you know, even linear classifiers, very simple classifiers,

28:18 using limited amounts of labeled data, people, for example, from which we know they are parents.

28:23 And it generalizes extremely well because your feature space is so expressive and because the feature space was learned using unsupervised learning.

28:30 So we can use all the data we gathered in the past to learn that representation.

28:34 Okay.

28:35 So you have these 50 classifiers or points that are sort of grouping people together.

28:42 How did you determine, like, this grouping means it's a parent?

28:46 Did you find some people you knew were parents and say, oh, they also have this feature that must mean they're a parent?

28:51 Or how did you, like, assign values to that?

28:53 Yeah, exactly.

28:54 So it's kind of the feature space just allows us to do user similarity modeling.

28:58 And then, indeed, we do still need labeled data, just not a whole lot of it.

29:02 We can find, we can ask 100 users to install our demo app, walk around with the data for a few weeks, with a SDK for a few weeks, tell us whether or not they are a parent.

29:11 And then in this feature space, if we look at those people, well, other parents will be very close to them.

29:18 And that's how we can then build a classifier to detect parents.

29:22 That's pretty awesome.

29:23 Do you feel like there are pieces that are missing?

29:26 Like, there's dimensions of human behavior that are not captured?

29:31 Probably.

29:31 Up till now, it works.

29:33 It's also all very, very new to us, because we started out also here with mostly business rules on top of our event sequences.

29:40 So most of the machine learning in the past was on the bottom layer, on the event layer.

29:43 It's only since recent times that we're also doing this unsupervised and semi-supervised learning on the segments and the moments.

29:51 But yeah, probably the difficulty, indeed, if you use representation learning, is that it's very difficult to control which dimensions the deep learning thinks are important to capture human behavior.

30:00 So I can imagine that not everything is captured there.

30:05 But in the end, you can easily solve that by fixing some of the bottom layers of the pre-trained network and then training it a little bit more on a smaller set of labeled data, fine-tuning the upper layers.

30:17 And that way, it still is able to learn those things.

30:20 Yeah.

30:20 Okay.

30:21 Very interesting.

30:21 The dependency that you're talking about here sounds like it could be really tricky.

30:25 Like, suppose you guys redesign your transportation mode detection.

30:32 And it turns out, some of the time you thought people were walking, they're just in traffic, but really slow traffic, or something like this, right?

30:40 Instead of every day taking a walk down I-5, the interstate highway, they're actually just driving in really bad traffic.

30:47 And that probably has knock-on effects for moments, which has knock-on effects for segments.

30:51 So how much of, like, if this training of networks takes weeks, potentially, how bad is it if, you know, you change the bottom layer?

30:59 That's indeed a very actual problem we encountered.

31:02 Especially now that we are more and more using, you know, representation learning to learn these feature spaces on the bottom layer.

31:09 Indeed, if you retrain one of the models, the resulting feature space could have a completely different meaning, which means that all the consuming models that follow in the cascade would have to be retrained.

31:19 And, of course, you don't want that.

31:20 We solve this, I guess, in different ways.

31:22 On the one hand, there is a decision you have to make between deploying a trained model as a microservice that is then consumed by other models in the cascade, versus actually just using the pre-trained model and fine-tuning it in a model that consumes that information.

31:39 If you do it the first way, if you put it in a microservice, then, indeed, if you retrain the first model, you have to retrain the second.

31:45 But if you do it in the other way, if you just combine both machine learning models eventually into one model, then, of course, you don't have this dependency.

31:53 So in this sense, we actually try to just use pre-trained models and fine-tune them and embed them into the next model as much as possible, and only go to a microservice if there is good reason for it.

32:06 If your model, for some reason, needs, for example, huge amounts of memories or a large amount of SQL queries to a database or something like that, then that is a good reason to actually put it in a microservice.

32:15 That's one way we try to solve it.

32:17 Another way, that's something we're still working on, we don't have it today, is that we're trying to create a model that basically learns a locally linear mapping from your previous feature space to the new feature space after retraining, or actually the other way around.

32:33 So if you retrain a model, but you put a mapping layer after it, then that mapping layer can actually make sure that even if the model is retrained, the new feature space is mapped to the same semantics as the old feature space.

32:44 I see.

32:45 So the inputs to the next level model basically are literally transformed to look the same as they would have before.

32:52 Exactly, exactly.

32:53 Okay.

32:54 Yeah, that sounds like a pretty interesting set of challenges and some good solutions.

32:58 But yeah, definitely it seems like that's something that's always going to be a bit of a tension.

33:04 Yeah, exactly.

33:05 And we're also, I mean, we're continuously trying to figure it out ourselves.

33:08 I mean, there is also, of course, versioning, because if you deploy a new model, there is at least some period in which you're going to have to run both the old model and the new model in parallel,

33:17 because not all the consumers will be updated at the same time.

33:20 It's getting complicated quickly, but luckily we have an awesome data engineering team there to help us solve all that.

33:27 Yeah, that's cool.

33:28 So one of the things I was wondering as I was looking through all this, like there's a lot of statistics and statistical inference and understanding these models.

33:37 So somebody who works on your team as a data scientist, like what's their general skill set?

33:42 Like how much programmer versus how much statistician versus some other skill I'm not thinking of?

33:48 It's a bit mixed.

33:49 So in general, we say that everyone at Sentience is a software engineer.

33:52 So that means that every data scientist has to have good software engineering skills, not just some, you know, some MATLAB scripting experience or something.

34:01 You have to be a software engineer.

34:02 That's for sure.

34:02 And then, of course, most people, we put a lot of emphasis on the machine learning background.

34:07 So most people in the data science team, they either have a, you know, PhD in machine learning or computer vision or something, or they have a background in physics or mathematics.

34:16 They need an analytical mindset, let's say.

34:19 And then finally, there is the signal processing, which is kind of a specific field.

34:24 So people coming from robotics or from speech recognition or also image processing often have a good signal processing background.

34:30 Yeah, it's quite challenging to find people that combine all three of them.

34:35 Yeah, I'm sure.

34:36 Definitely sounds fun in terms of projects you get to work on.

34:39 You guys don't build apps, right?

34:41 You basically provide this SDK or this API to customers who themselves build apps, right?

34:48 Well, actually, we just hired a designer, but ourselves, we're not the best in creating very fancy apps or something.

34:56 We are really a tech company.

34:58 And indeed, we have an API through which we expose all this information back to our customers.

35:03 But the customer still needs a tech team.

35:05 They need data scientists or developers to be able to do stuff with it.

35:11 This portion of Talk Python is brought to you by GoCD.

35:14 GoCD is an on-premise, open-source, continuous delivery tool to help you get better visibility into and control of your team's deployments.

35:23 With GoCD's comprehensive pipeline modeling, you can model complex workflows for multiple teams with ease.

35:30 And GoCD's value stream map lets you track changes from commit to deploy at a glance.

35:36 Say goodbye to deployment panic and hello to consistent, predictable deliveries.

35:39 We all know that continuous integration is super important to the code quality of your applications.

35:44 Choose the open-source local CI server, GoCD.

35:47 Learn more at talkpython.fm/gocd.

35:51 That's talkpython.fm/gocd.

35:55 All this sounds so cool and powerful and useful.

36:00 But at the same time, it also feels like it could be a little bit invasive into people's lives and into their privacy.

36:07 So what's the story around trying to strike that balance?

36:10 That's a question we get a lot.

36:11 And indeed, it is a balance we have to maintain.

36:14 There is a lot of information you can extract from the sensor data.

36:18 I mean, even your personality and your mood at something that's on our roadmap, something we're looking into, not something we per se have today completely.

36:25 But your personality influences how you behave and how you behave influences the motion of your phone.

36:31 So there is a lot of stuff you can do with it.

36:33 And indeed, then privacy becomes an important question for us.

36:37 On the one hand, there is a GDPR, so like the recent European privacy legislation.

36:42 If you look at the GDPR, Sentience is a data processor, not a data owner.

36:47 What actually means that compared to, let's say, Facebook or Google or something, we never claim that the data is ours.

36:54 The data is still owned by the customer, which means we cannot combine the data with other data.

36:59 We cannot sell the data.

37:00 And the data is siloed.

37:01 That's one thing.

37:02 On the other hand, and probably much more important, is that we explicitly force our customers to ask consent to their users.

37:10 So it cannot be that they use our SDK and put something in a small privacy statement, you know, hidden in the app or something.

37:17 Our customers really need to be very upfront with their end users, tell them what kind of data they gather, why they do it.

37:23 And as long as those customers provide enough value to the end users, that works.

37:27 I mean, it won't work for, let's say, advertising solutions.

37:30 Nobody wants to give consent to gather all this data to have better advertisements.

37:34 But it does work for, let's say, health and lifestyle coaching.

37:37 I mean, if we can help you live a healthier life, if we can contextualize your heart problems, or maybe even for insurance, if we can model your driving behavior.

37:46 And by that, by doing so, reducing the amount of money you have to pay to an insurance company.

37:51 Well, there is enough added value for most users to actually give that consent.

37:55 Yeah, that's a good point.

37:56 Yeah, I guess it's all about the trade-off for the benefit, right?

38:01 Like you said, no one's going to go, I would love to see better banner ads in my, like, Candy Crush app or whatever.

38:08 Yeah, indeed.

38:09 The tagline of Sentence, or at least what our CEO often says is, we want to make sure that AI improves people's life.

38:17 And it might sound a bit cheesy, but imagine indeed a world where you don't have to adapt to all your surroundings.

38:24 But instead, you know, your phone knows who you are, knows what you feel, knows what you want.

38:28 And the whole world adapts to you, not to spam you or manipulate you, but just to make your life easier and healthier and improve your quality.

38:36 That's the promise, right?

38:37 Exactly.

38:38 So, you talked about being an SDK.

38:41 Can you give us some examples of some of the apps that are using you guys as a service?

38:45 Yeah, sure.

38:46 There are different components, of course, in what we do.

38:48 One of the components I quickly mentioned before is driving behavior, where we, in a detailed manner, model, you know, how aggressively you drive, how do you take your turns, what is your driver DNA?

38:58 And that is currently being used by, I'm not allowed to name them, but let's say by the biggest ride-hailing company in the world to actually model the safety of their drivers.

39:08 So, not the passengers, but the drivers themselves, so that they can coach them and make sure that the riders are safe when they take the gap.

39:16 Another example is, for example, one of the biggest brand loyalty companies in the UK.

39:21 So, they have a huge user base of users that installed their app because they want to get the latest coupons and that kind of stuff.

39:30 And so, there they use our SDK to just personalize their communication with their users and to make sure that, you know, the user is not spammed with information that they don't care about.

39:40 But instead, it's a very personalized communication and increases engagement.

39:44 Right. So, maybe if you could tell them, like, this person is a parent versus this person is a workaholic, or if they're both, you know, that they might treat them differently, right?

39:54 Yeah, exactly. Exactly.

39:55 I mean, if you know that someone is sportive, you can propose more interesting stuff than if you know, okay, this person never sports and indeed is a workaholic or something like that.

40:03 And I guess maybe the most interesting use cases, to me at least, are in health and insurance.

40:08 So, in health, for example, we work with Samsung, who is also one of our main investors, on detection of heart arythemia.

40:14 So, problems with, I mean, if you have heart fibrillations and you want to contextualize that, you want to know why it happens, when it happens, and you want to expose that to your doctor.

40:23 So, your doctor can say, okay, we see that if you work late, and I see that you're a workaholic in general, and if you eat a lot of fast food that week, that is a time when you usually have your heart problems.

40:33 So, that's one of the use cases.

40:35 How does it know?

40:36 Do you have to, like, do you have a different device that detects the arrhythmia and, like, flags it in time, and then you can overlay it on your timeline or something like that?

40:44 Yeah, exactly. Exactly.

40:44 Okay.

40:45 Yeah, and you said you're also working with another company doing something similar?

40:48 Yeah, so there is, in the health space, we work with some smaller companies also from Europe and Belgium and the Netherlands more specifically.

40:54 So, there is FibriCheck, for example.

40:57 FibriCheck, they have a mobile app where you can put your finger on your camera, and the app will use the camera and the flashlight to extract your heart rhythm from the blood flow to your fingertip.

41:08 And so, there also, they use our SDK to contextualize that, to predict when you're probably going to have heart problems, why it's happening, and to expose this to a doctor.

41:18 And then there is another example, Medup.

41:20 Medup is a company in the Netherlands.

41:22 They have an app for care adherence, medical adherence.

41:26 So, a lot of people have to take a lot of pills, and a lot of people actually forget to take their pills.

41:30 And that's a huge problem.

41:32 So, what they did is they developed an app that reminds users to take their pills on time.

41:37 But, of course, if you just get such a reminder, you know, an alarm on your phone right before you have to go to work, or maybe even when you're in the car,

41:45 then you just snooze the alarm or dismiss it, and you forget about it altogether.

41:49 So, they use our SDK, again, to tailor those alarms to make them contextual aware and remind users at the right time.

41:56 Right.

41:57 Like, if you're driving, it makes no sense to remind you, so wait until you get to work.

42:00 Exactly.

42:00 Or wait until you return home if it knows you're coming home or something like that.

42:03 Yeah, indeed.

42:03 Or if we predict you will probably be leaving for work in 10 minutes, then this is the time to remind you, and don't wait 10 minutes.

42:10 Yeah, that'd be even better.

42:11 Cool.

42:12 So, those sound all really interesting.

42:13 We talked a lot about your architecture already, actually, but there's a few things that we haven't touched on that I think are worth covering.

42:19 One of the things you guys use is something called DevPy.

42:23 It's sort of an alternative local PyPI.

42:26 Tell people what DevPy is, and how is it helping you guys?

42:29 The problem we had with Python and PyPy as a package server is that you quickly end up in kind of a dependency hell.

42:37 You develop your project, put it in a repo, you have a setup.py to easily install it, and as a requirement, you list, let's say, NumPy version X, but also you list package Y as a dependency, but package Y actually depends on NumPy version Z.

42:54 So, you have this whole conflicting set of dependencies which quickly becomes very difficult to manage.

43:01 And how we actually did versioning of our own packages, so our repositories, is in the past we specified a version attribute in the setup.py and used git tags on BitBucket, so on our git repositories.

43:15 And those tags also contain a version number, and that kind of allowed us to pull the correct version and try to get everything installed as it should be.

43:24 But then you have to make sure that you maintain a setup.py.

43:27 Don't forget to increase the version number there.

43:30 Make sure the tags are in sync, and it really becomes messy quickly.

43:33 So, how we solve this is, indeed, we use DevPy these days.

43:36 We have our own package server, our Jenkins server.

43:39 So, Jenkins basically is our build server.

43:41 Everything gets built into packages automatically there.

43:44 Jenkins builds wheels from our internal repositories, builds those wheels both for Mac, for the developers, and for Linux, for actual production.

43:53 And it stores that with a version number.

43:55 And then if we do pip install something, then first our DevPy is consulted.

44:00 It fetches the correct version, the package with the correct version and correct dependencies.

44:04 And only if it cannot find it there, it goes further to PyPy.

44:08 Yeah, that's really cool to be able to control it like that.

44:11 Do you distribute your own packages for use on other projects within your own DevPy?

44:18 Everything we build is contained in like a functional repo, let's say, with an API.

44:24 And then we always have a microservice wrapper repo that just uses that functional repo as a dependency.

44:31 So, the functional part is always built by Jenkins, put into DevPy, and can then be used by different other projects as a dependency.

44:39 Yeah, okay.

44:39 That sounds like a really good setup you guys have going there.

44:42 Another thing that you talked about is pragmatic use of deep learning.

44:48 Tell us what do you mean by pragmatic use.

44:51 What are some of your recommendations?

44:53 Deep learning is cool.

44:54 And especially if we hire new people and they hear that we do deep learning, we use it a lot actually.

44:59 They are eager to also start using it for the problems they start working on.

45:03 But of course, we have to be pragmatic indeed in the sense that a lot of problems just don't need deep learning.

45:09 For example, if I talked earlier about detecting what is your home location and what is your work location, you can solve that without deep learning.

45:17 You just do some feature engineering, gather a little training data, and train a linear support vector machine on top of it or something.

45:24 It's important, I think, to use deep learning if it really solves your problem, if it makes your product better.

45:31 But indeed, don't just follow the hype.

45:32 Don't just do it because it's a buzzword.

45:34 Yeah, exactly.

45:35 More VC money because of it.

45:37 Indeed.

45:38 Yeah, yeah, that's really cool.

45:39 Like sometimes just standard algorithms and, you know, if cases effectively, right, are all you really need sometimes.

45:47 It is true that these days with VCs and even for customers, for some reason, it sometimes almost sounds embarrassing if you have to tell them that for a part of your product, you use traditional machine learning.

45:59 It's like, why don't you use deep learning?

46:00 But it's a matter of cost.

46:02 It's a matter of accuracy and also maintainability.

46:05 I mean, if you have a very simple – actually, I think it goes even further than the simplest example I gave.

46:11 If you can solve a problem by a simple business rule, then that's the way you should go.

46:14 Right.

46:15 Absolutely.

46:16 Let's take just a moment and, like, take a step sort of up this higher level, not anything that you guys are doing in general, not referring to your product or your mission, but just in general.

46:28 Like there's some people like Elon Musk, who I'm a big admirer of in general and others saying, like, we should be really worried about AI and machine learning.

46:38 And other people saying, no, this will make things lots better.

46:40 And you've given us some definite examples where it is going to be better for people, right, with, like, health, for example.

46:47 But where do you land in this debate?

46:49 And do we live in a simulation?

46:51 First, the extreme example or the extreme case that you sometimes read about, about, you know, AI taking over the world and stuff like that.

46:58 Well, an interesting quote there.

47:00 I think it's from Andrew NG, so one of the big deep learning guys who used to be head of AI at Baidu.

47:07 And I think today he's still head of AI at Stanford.

47:09 He said at some point, you know, that the fact or talking about AI taking over the world is a little bit like talking about overpopulation on Mars.

47:16 It might happen at some point.

47:18 It probably will.

47:18 But there is still no clear path to it, right?

47:21 So that's one thing.

47:24 Of course, it is true that AI or machine learning, which I like more as a term, actually, is becoming very powerful.

47:32 And in that sense, like any powerful tool, it can be used for the good and the bad.

47:36 So I do agree that we need politics.

47:40 We need legislation to be ready for this.

47:42 We need to make sure that governments are limited in what they can do, you know, that they cannot force you to install an app that tracks your every movement so that they can then control your future or something like that.

47:54 So I do agree with Elon Musk on that point that it's time for government officials to take this serious and to work on the legal aspect there.

48:02 Sure.

48:03 I totally agree.

48:04 And there's other interesting knock-ons.

48:06 I think the EU is working on this.

48:08 You know, when we get to things like driverless cars, if the driverless car is in an accident and it turns out the driverless car was at fault, who is responsible and how do you address that?

48:20 And if it's pure, totally unsupervised learning that made the car drive, like how do you know why it crashed?

48:34 We're going to be able to do that.

48:36 So first, a whole transformation in the mobility sector, in the insurance sector has to happen so that cars are actually seen as a service where you insure a service.

48:57 It's a different mindset.

48:58 Yeah, and I think we're going to have to get used to pushing the benefits in an aggregated way instead of a specific individual's responsibility way.

49:06 For example, yeah, the self-driver car did something really bad and it crashed into some people on the sidewalk.

49:11 But if you look at it as a whole, half a million fewer people were killed in car accidents this year.

49:17 So this is a horrible news story and we're really, you know, it's really bad.

49:20 But taken as a whole, self-driving cars are doing better for people.

49:24 That's like a theory I'm imagining, right?

49:26 But I can see the world struggling with those sort of ethical trade-offs.

49:30 Yeah, indeed.

49:30 You know, it's a little bit, I know maybe it's not a very good comparison, but if you think about the Industrial Revolution, there was a lot of people that were so scared about all the millions of jobs that would be lost if cars would not be made by hand anymore, but if machines would be used, you know.

49:44 But in the end, when we look back, I do think that most people agree that the Industrial Revolution made our life healthier.

49:51 We live longer, made it easier, we're happier.

49:54 And the same thing is going to happen with the AI revolution.

49:57 Yeah, I think that in the long term that that's definitely true.

50:00 Like I definitely wouldn't want to live pre-Industrial Revolution myself.

50:03 I wouldn't trade my spot in it now.

50:05 All right.

50:06 Well, Vincent, I think we're going to have to probably leave it there for our topics.

50:10 But that was a super interesting look inside what you guys are doing with machine learning and things like that.

50:15 So let's get to the two questions.

50:19 First one, favorite Python editor.

50:21 What do you open up if you're going to write some Python code?

50:23 By charm, for sure.

50:25 I love the JetBrains product in general.

50:27 You know, DataGrip for database stuff, IntelliJ for Java, PyTron for Python, yeah.

50:31 Yeah, awesome.

50:32 Yeah, me as well.

50:33 That's my favorite.

50:33 All right.

50:34 And notable PyPI package?

50:36 I've been thinking about this for a long time.

50:38 I think it would be Python Flame Graph just because it's a very cool way to do memory and performance profiling, create Flame Graph, see which methods in your code are the bottlenecks, and optimize them.

50:48 Yeah, I looked at that just a little bit.

50:49 And it looks like a very powerful way to quickly visualize where your performance problems are.

50:55 Yeah, exactly.

50:56 Exactly.

50:56 And to actually dig deeper into the stack of method calls and figure out what's happening.

51:00 Yeah, that's cool.

51:01 So I'll put a link to the GitHub repo for Python Flame Graph, which has a bunch of nice pictures.

51:05 Awesome.

51:06 Awesome.

51:06 Yeah, yeah.

51:07 So, all right, final call to action.

51:09 People are excited about deep learning.

51:11 They're excited about what you guys are doing.

51:13 What do they do to get started?

51:14 We're expanding.

51:15 We're continuously expanding.

51:16 This year, or coming year, we should grow from 50 to 80 people.

51:19 So we're always looking for passionate Python developers, machine learning guys.

51:23 Can they just reach out to you, like, on Twitter or something like that if they want to get more information?

51:26 Or how do they find out?

51:27 Yeah, yeah, definitely.

51:28 Just, you know, ping me on Twitter or LinkedIn or go through our website where there is a more official channel.

51:34 Maybe you can refer to podcast so we have an idea where you're coming from.

51:38 Don't focus too much on the machine learning part either.

51:40 If you're very good at Python or very good at machine learning or signal processing, we should talk.

51:45 Okay, awesome.

51:46 And then you guys have an app.

51:47 Even though you said you don't build apps, you have an app.

51:49 What's the story with this app?

51:52 Yeah, it's a demo app because our product is quite technical.

51:55 And so, of course, customers ask, like, okay, how does it look like?

51:58 What can I do with it?

51:59 So, indeed, we built a small demo app.

52:00 It's called Journeys.

52:01 You can find it on our website or on the App Store.

52:03 Our Play Store.

52:04 And Journeys, basically, when you install it, after a while, it will learn about your behavior.

52:08 After one week or two weeks, you will see pop up a whole set of segments that we assign to you.

52:12 Your moments, your home and work detections.

52:14 So, yeah, check it out.

52:16 Download it.

52:16 It's pretty cool.

52:17 And there is also a way to give feedback.

52:19 So, if something is wrong and you decide to give feedback, well, that helps us to retrain our models.

52:24 Oh, beautiful.

52:24 All right.

52:25 Well, thanks for sharing your story and what you guys have to.

52:26 It was great to chat with you.

52:27 Thanks a lot, Michael.

52:28 This has been another episode of Talk Python To Me.

52:32 Our guest has been Vincent Spurt.

52:34 And this episode has been brought to you by Datadog and GoCD.

52:37 Datadog gives you visibility into the whole system running your code.

52:41 Visit talkpython.fm/datadog and see what you've been missing.

52:46 Don't even throw in a free t-shirt for doing the tutorial.

52:49 GoCD is the on-premise, open-source, continuous delivery server.

52:52 Want to improve your deployment workflow but keep your code and builds in-house?

52:57 Check out GoCD at talkpython.fm/gocd and take control over your process.

53:03 Are you or a colleague trying to learn Python?

53:05 Have you tried books and videos that just left you bored by covering topics point by point?

53:10 Well, check out my online course, Python Jumpstart, by building 10 apps at talkpython.fm/course

53:16 to experience a more engaging way to learn Python.

53:19 And if you're looking for something a little more advanced, try my Write Pythonic Code course

53:23 at talkpython.fm/pythonic.

53:26 Be sure to subscribe to the show.

53:29 Open your favorite podcatcher and search for Python.

53:31 We should be right at the top.

53:32 You can also find the iTunes feed at /itunes, Google Play feed at /play,

53:37 and direct RSS feed at /rss on talkpython.fm.

53:41 This is your host, Michael Kennedy.

53:43 Thanks so much for listening.

53:45 I really appreciate it.

53:46 Now get out there and write some Python code.

53:48 I'll see you next time.