Learn Python with Talk Python's 270 hours of courses

#257: Exploring the galaxy with the fastest supercomputer, Python, and radio astronomy Transcript

Recorded on Friday, Mar 27, 2020.

00:00 With radio astronomy, we can look across many light years of distance and see incredible details

00:04 such as the chemical makeup of a given region. Kevin Vincent and Rodrigo Tabar from ICRAR are

00:11 using the world's fastest supercomputer, along with some sweet Python, to process the equivalent

00:16 1,600 hours of standard definition YouTube video per second. This is Talk Python to Me,

00:23 episode 257, recorded March 26, 2020.

00:28 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem,

00:46 and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy.

00:51 Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter

00:57 via at talkpython. This episode is brought to you by Linode and Clubhouse. Please check out what

01:02 they're offering during their segments. It really helps support the show. Kevin Rodrigo, welcome to

01:08 Talk Python to Me. Thanks, Mike. Great to have you guys both on the show. This topic is something that

01:14 I'm both fascinated with and actually not very knowledgeable about, so it's awesome. And it has

01:18 a really cool bunch of Python going on as well. So I think we're going to have a lot of fun talking

01:23 about radio telescopes and just processing ridiculous amounts of data with it.

01:27 Yeah, yeah. I think we were talking with Kevin about this the other day that we kind of lose sight of how big these numbers are, because

01:34 if you are always within this realm, you don't realize that this is actually pretty big.

01:40 Yeah, that's a bit of an understatement. So we'll definitely dig into all the data

01:43 and everything that's going on. It's pretty impressive. It's certainly, well, I'll leave it until we get to it. But it's some crazy, crazy numbers that you

01:52 all are doing. But before we get into that stuff, let's just start briefly with how you each got

01:58 into programming Python. Maybe Kevin, you go first.

02:00 I did my degree in physics back in the UK and then sort of drifted around doing various languages.

02:07 I programmed in C, C++, Prologue, Lisp, Smalltalk. And from that, you can tell I'm quite old.

02:14 When I came to join ICRA in 2009, Python is this de facto language for an awful lot of astronomy. So

02:24 I learned Python.

02:25 Yeah. What was that learning Python experience like?

02:27 Piece of cake, really. It's much easier than Java and C++ because the syntax is just so much cleaner.

02:34 Yeah, it is. And, you know, it sounds like you have experience with a lot of different languages,

02:39 right? Like Scheme, Lisp, Smalltalk, C++. There's a lot of different examples. And so,

02:45 you know, you come to Python and you're like, oh, this is a weird language. It doesn't have any line

02:49 breaks or it doesn't really love the synth, like a lot of syntactical elements in there. But

02:54 somehow...

02:55 Where do I put my semi-colons?

02:56 Yeah, I do.

02:59 Learning not to semi-colons took a while.

03:01 Yeah, I do find it funny that you can still write them if you just feel the need, you know?

03:05 You're like, I just put them at the end. It's going to be okay.

03:07 Yeah, praise be to IntelliJ or PyCharm because it tells you, you don't really need that.

03:12 Yeah, exactly. If you really need that comfort blanket, I suspect you could turn off that code

03:18 detection, code rule in PyCharm and just put the semi-colons. But you may not be accepted by your

03:24 fellow Python programmers.

03:25 From the bone.

03:27 It's right. That's right. Rodrigo, how about you?

03:31 Well, since very little, I always liked computers. So I decided to go and study a computer degree

03:36 without really knowing exactly what computing was about, like about programming and all.

03:41 So at junior, I started learning some languages and I became involved with astronomy. We had a group of

03:48 students who were doing collaborations with some observators. I'm originally from Chile. And in Chile,

03:53 there are many observatories because the conditions are so good for observing. So there was this group of

03:59 students doing collaborations with some observatories in Chile. That's how I got basically into the business.

04:04 So when I left uni, I moved into the European Southern Observatory headquarters in Germany. I worked over there for a couple of years and then moved here to ICRA in Australia to continue working in astronomy. In Python in particular, I started doing some more Python down here in Australia. I had done a couple of basic scripts before, but nothing much to it. And I really got into the weeds now.

04:29 Because we were heavily using Python here.

04:31 Well, yeah, that sounds really fun. And certainly Chile is one of those places where astronomy is, especially radio astronomy, right? Is that where Contact was filmed?

04:42 No, Contact was filmed in the US, in New Mexico, close to Socorro, south of Albuquerque.

04:48 It was set though, in like in that area, right? And that general, it was definitely South America somewhere.

04:53 Maybe some parts of the movie.

04:55 Arecibo, maybe? I think so.

04:57 Oh, yeah.

04:57 There might be a shot or two in Arecibo. That's in Puerto Rico? I forget.

05:03 Okay. Yeah, yeah, yeah. Okay. Puerto Rico. Yeah. Okay. So it's not exactly the same one, but there's definitely with the mountains there, there's a bunch of observatories, right?

05:11 Yes, yes. It's very heavy on the optical side as well. So for optical and radio astronomy, you have different set of requirements, if you want. For optical, you basically want like super, super clear skies.

05:21 Whereas with radio telescopes, you can have clouds on still observant. So Chile, in the north of Chile, there's a huge desert, which is very, very dry. That's perfect for optical astronomy.

05:33 Right, right, right, right. So yeah, I hadn't really thought about that. Of course, for optical stuff, the higher, the better, the clearer, the better.

05:42 What are the requirements for radio telescopes?

05:44 It depends on the frequency, the radio frequency that you're observing. If you are in the high frequencies is basically amount of water in the air. It's called PMW or PVW. I forget the term for optical astronomy is also for the lower frequencies of radio astronomy is our find the radio frequency interference, basically any device that is emitting radio frequency waves. So you want very isolated places for that.

06:14 Okay, I see. As we'll learn, you can measure things like water and stuff very far away with radio telescopes, right?

06:20 Yes, that's correct.

06:22 So I suspect having like water in the air is a problem. You don't want that. That's interesting.

06:27 I mean, it's how your microwave works, you know. A microwave agitates the water. It's the same sort of basic principle. In a millimeter band, that's why the ALMA telescope, which is in the Atacama Desert, as Rodrigo was saying, has to be so high.

06:43 So there is no moisture there. Whereas the stuff we tend to work on generally can be down at sea level, almost or a little bit higher.

06:52 Okay, interesting. And what do you guys do day to day? Are you both doing astronomy basically day to day? Or code for astronomy?

07:01 Yeah, code for astronomy, pretty much. I mean, most of my work is helping more hardcore astronomers do things faster.

07:09 So, for example, a group who were doing some optical work, it was taking 42 days to do something. They then passed it over to us and we got it down to 18 hours.

07:20 That's awesome. That means you can do so much more science, right?

07:24 Yeah. But there's a classic divide and conquer problem. Parallelize like mad.

07:29 Although most of our astronomy tasks are embarrassingly parallel.

07:33 We scatter and we don't really do a gather until the very end. That's it.

07:39 I see. So it's almost like you could almost do individual computation on a per pixel basis, maybe? Or the equivalent of a per pixel basis?

07:48 We tend to work in frequency channels more than pixels.

07:52 But yes, so we would just process one particular or one band of frequencies on one machine, another band on another and another on another.

08:01 Other work I do is quite a bit of machine learning work for detecting RFI, gravitational waves, doing corrections.

08:09 We're actually now moving some of our astronomy work into ocean wave investigations and trying to look at whether we can correct the swell heights.

08:19 So, you know, surfers know whether it's going to be a good day to go surfing.

08:23 Right. Okay.

08:24 Now, that would be a really unexpected consequence or outcome or capability from studying gravitational waves is better surf predictions.

08:34 They're just waves. It's a propagation speed that's different.

08:38 One's rather quick and one's not.

08:40 Yeah, I guess so.

08:42 Yeah, the whole gravitational wave detection stuff is some pretty cutting edge science and it's really interesting.

08:47 And it's cool that you're using machine learning to try to understand that.

08:51 We have a small group working on it.

08:53 We've got devices in the proper detectors.

08:57 So this is a very active area of research.

09:00 There's a lot of groups around the world working on this.

09:02 Yeah, I think it's kind of amazing.

09:04 There's a lot of stuff with gravity oriented things in astronomy right now.

09:08 We have the gravitational wave detection for the collisions of black holes.

09:13 We have the first picture of black holes in the last year and a half or so, whenever that was.

09:18 A lot going on around there.

09:19 Yeah.

09:20 And then, of course, the other thing is teaching.

09:23 Sure.

09:24 I guess if you're at a university, eventually, you might end up interacting with a student or two.

09:30 Very cool.

09:30 Unnecessary.

09:31 All right.

09:33 Rodrigo, what about you?

09:34 Well, kind of similar.

09:36 I'm a software person, right?

09:38 Who became involved in astronomy.

09:39 So I basically help astronomers to develop software in different languages for different purposes.

09:45 So not only for radio astronomy, but also for optical astronomy.

09:48 And also for, we have also a theoretical group.

09:52 So people who do simulations of galaxy formation and such.

09:56 So kind of all over the place.

09:58 And we, not only me, but all the people in the group we specialize kind of in this area of helping astronomers build the software, deploy it, optimize it, and so on.

10:08 How much do you end up helping them with standard software engineering things?

10:13 Like, hey, hey, I need to teach you source control.

10:16 This is Git and GitHub.

10:17 Let's spend an hour talking about that.

10:19 Or are they pretty much good to go?

10:21 Yeah, it depends on the generation, I would say.

10:24 So older generations are a bit harder to kind of, you know, move to that side.

10:28 But newer people, like younger people, then come with all those concepts already kind of built in, right?

10:35 They were born and GitHub was already there kind of thing.

10:38 So you don't have to push that far.

10:41 It's still mostly on the, maybe on the software design side of things.

10:45 You know, how you structure your software, how you tackle that particular problem, how you organize the code, how you optimize things for your particular architecture.

10:54 And so on.

10:55 Okay, cool.

10:56 And you're also working on this SKA construction, the Square Kilometer Array.

11:02 Yes, yes.

11:03 That's a whole topic.

11:03 I guess we'll talk more about it later.

11:06 But we are one of the main institutions that are working on the Square Kilometer Array project.

11:12 Yeah, so it's interesting.

11:13 I don't know if this works for light, but it does for radio, that if you put multiple detectors and sort of densely, but not actually connected to one giant antenna or something, you can put that together like a bigger detector, right?

11:30 A bigger lens in the radio world.

11:33 So that's the idea, right?

11:34 Yes, and that's exactly the idea.

11:35 It's called interferometry.

11:36 You basically, if you have, say, three antennas, A, B, and C, what you do is you take measurements individually from A, from B, and C, and then you correlate every other pair.

11:48 So you correlate the signals from A and B, from B and C, and from A and C, and you do that with a correlator, which is the one that is doing all this mixing of signals, and out goes one correlated signal, which is as if you have one big antenna.

12:05 So that's what happens in radios for me.

12:08 I think, I'm not sure, but I think in optical you can also do interferometry, but I'm not sure how the mechanism works in that sense.

12:16 Cool. So this SKA project, this is the Square Kilometer Array, which is an international project that you all are working on involving 13 countries that are, I guess, full members of the project and four others who are just participating, right?

12:31 Yeah, that's right.

12:32 The Square Kilometer part is the collecting area because we're starting to run out of adjectives.

12:38 There's things like the Very Large Array, there's the Extremely Large Telescope.

12:43 Where do we go?

12:46 Yes, exactly.

12:46 It actually tells you what the collecting area of the final system is going to be.

12:52 Now, we're going to be building this thing in two phases.

12:56 Phase one will only be 10% of the final telescope, which means, I mean, it's being built in two countries.

13:03 So the low-frequency component is coming to Australia, to Western Australia, and the mid-frequency is going into the Karoo in South Africa.

13:12 So there'll be 196, 15-meter dishes in South Africa and 131,072 antennas in Western Australia.

13:24 There's a fair bit of kit going out with a collectif of 650 million euros.

13:31 650 million euros.

13:33 Just for the first one.

13:35 This is the first part, yeah.

13:37 I don't know.

13:37 131,000 antennas bringing in all this data.

13:42 Yeah, that is a huge amount of antennas.

13:46 And it's spread over a square kilometer.

13:48 Well, it's a huge amount of data.

13:49 Yeah.

13:50 It's about 550 gigabytes a second.

13:53 550 gigabytes a second.

13:55 I don't really have a great way to understand that number, honestly.

14:00 Like, you've got to think of, like, large cloud services, like YouTube or Netflix or something like that, right?

14:07 Yeah.

14:08 Just about 16,000 hours of standard definition YouTube every second.

14:13 Wow.

14:14 Yeah, or you can visualize it if you take your, you know, your hard drive, your 500 gigabyte hard drive, and you throw it, and you throw one of those every second, right?

14:22 That's basically it.

14:24 Yeah.

14:25 That's a lot of data.

14:27 Also, it takes a lot of power, right?

14:28 Yeah.

14:29 And that's one of the key things, because we would like to be as green as possible, but we've got a power cap on us at the moment of 5 megawatt.

14:38 Biggest, most powerful system on the planet at the moment is 13 megawatts.

14:43 So that's still a challenge we have to address.

14:46 Yeah.

14:47 You almost need your own power plant to power.

14:50 Oh, well, we've got up at the Murchison Radio Observatory, the CSIRO, who've got a couple of megawatts of solar up there already.

14:58 Yeah.

14:58 Okay.

14:59 Is it the blades that generate RFI, or is it the generators that generate the RFI?

15:04 It's the generators.

15:05 Yeah.

15:05 Yeah.

15:06 This portion of Talk Python to Me is brought to you by Linode.

15:11 Whether you're working on a personal project or managing your enterprise's infrastructure, Linode has the pricing, support, and scale that you need to take your project to the next level.

15:20 With 11 data centers worldwide, including their newest data center in Sydney, Australia, enterprise-grade hardware, S3-compatible storage, and the next-generation network, Linode delivers the performance that you expect at a price that you don't.

15:34 Get started on Linode today with a $20 credit, and you get access to native SSD storage, a 40-gigabit network, industry-leading processors, their revamped cloud manager, cloud.linode.com, root access to your server, along with their newest API, and a Python CLI.

15:51 Just visit talkpython.com.com.com.

16:21 I think if I could just throw those out there really quick, if you guys could just comment on them.

16:25 All right.

16:25 One thing, these are facts about the final system.

16:28 So this is where we will get to when we finish building it.

16:33 Right.

16:34 Okay.

16:34 So, yeah, you're only talking, we're only really working on stage one now, and there's going to be some beyond that, right?

16:40 Oh, yeah.

16:40 Yeah.

16:41 Awesome.

16:42 So the next one, the next amazing fact is that the SKA central computer will have the processing power of 100 million PCs.

16:51 Rodrigo, is that because there's a bunch of GPUs, or are there like really just a lot of CPUs in there?

16:56 It's both.

16:57 So the final design of the computer for the SKA is still not fully decided, but it's definitely going to be a mixture of both.

17:04 Right.

17:04 And, yeah, so all of this is based on the, we calculate how many computations we will need to do.

17:10 Therefore, that kind of gives you the size.

17:13 Wow.

17:13 Okay.

17:14 The next one is the dishes will produce 10 times as much data traffic as the global internet.

17:20 Yes.

17:21 That's crazy.

17:22 Yeah, yeah.

17:23 That's because you have so many dishes, right?

17:27 Right.

17:28 This is before it goes to the correlator, of course.

17:30 Like what comes off the correlator, as Kevin was saying, is about half a terabyte per second, which is obviously not what the global internet traffic is.

17:38 But what comes out of the individual antennas, yeah, it definitely is bigger than the internet traffic.

17:44 Wow.

17:44 And there's a bunch of fiber optic that brings it back to these correlators that then like process it and averaging.

17:49 Yeah, exactly.

17:51 Yeah, yeah.

17:52 Yeah, crazy.

17:53 And then I guess finally the aperture arrays could produce up to 100 times the global internet traffic.

17:59 So, yeah, there's, I think this is a pretty interesting one.

18:02 Honestly, the one that's most exciting to me is the one about the airport radar on a planet tens of light years away.

18:12 That's what everyone is waiting for.

18:14 There are only two planets that we know of at the moment that fit within that area that could potentially hold life, though.

18:23 So, I mean, our nearest neighbor's alpha centauri, a proximus inside, has got two planets around it.

18:30 But it's a red dwarf, which means that it's a red dwarf, which means that it's a red dwarf, which means that they're quite thirsty with lots of solar flares.

18:35 So life as we know it would probably struggle to evolve there.

18:41 You want a nice star like ours that's nice and sensible and not throw a huge amount of rubbish at us.

18:49 Yeah, do you want to get like a cleansing radiation spray every 10 years or whatever?

18:53 Yeah, that's right.

18:54 It's not good for you.

18:56 Yeah, there's probably not enough sunscreen to like help you with that one.

19:01 I don't know.

19:03 Yeah, that's one of the sad things for me about all of this space stuff.

19:07 I really wish.

19:09 It's just so big that it's just really challenging to actually explore it, interact with it, measure it.

19:16 Like even if you do get measurements back, it's like, well, that was 100 years ago.

19:20 It would take another 100 years to like send him a message.

19:23 We've only got away four years.

19:26 I mean, last year, year before, I took a bunch of primary school students.

19:32 We taught the European Space Agency into lending us their dish at New North here.

19:37 And we sent messages to Potsdam Center.

19:39 All right, so wait 4.2 years or 4.2 years for it to get there for them to decide for it and send a reply back.

19:47 So in about six years' time, we'll know.

19:49 Yeah, that's not too bad, actually.

19:50 Not too bad.

19:51 Cool.

19:52 So I guess maybe the next thing that's interesting to dig into before we get like fully into the programming side of things is just like what kind of questions are you guys trying to answer?

20:02 I mean, it's super cool to have this giant radio telescope with 131,000 antennas together in this giant array.

20:10 But you get some measurements off of it.

20:12 Then what?

20:14 Well, one of the things that we've been joking about, but it's the cradle of life.

20:18 Are we alone?

20:19 One of the things a radio telescope can see is molecules in space.

20:24 We have water, hydrogen sulfide, ammonia, carbon monoxide.

20:29 But we can also see things like methanol, glycolyl nitrate, which is a simple sugar, and amino acetyl nitrile.

20:36 Now, if the clarionite, and you look up at the night sky, look at the constellation Orion, there's a nebula in there that has these chemicals floating in the planetary nebula.

20:50 And that's precursor organic compounds to what basically we are.

20:55 Yeah.

20:56 Yeah, that's super cool.

20:57 And if those gases and small particles coalesce into planets, those planets are going to have those things.

21:04 Or asteroids that crash into planets, right?

21:06 Yeah.

21:06 I mean, the other thing we have to look for is things like galaxy evolution, testing cosmological models, looking for dark matter, dark energy, origins and evolutions of cosmic magnetism.

21:18 We really don't know much about that at the moment.

21:21 And one of the fun things is, in the epoch of re-ionization, I'll put my teeth back in, the epoch of re-ionization.

21:31 After the Big Bang, we had everything being highly ionized gas.

21:37 And then, about 300,000 years, it became neutral and dark.

21:41 And then, slowly, as galaxies and quasars began to re-ionize things, stars started to appear, galaxies started to appear until about a billion years ago, when everything started to become transparent again.

21:55 So, we want to go back and have a look at that time.

21:59 And to do that, we need a huge collecting area, because radio photons are about 2 million times weaker than optical photons.

22:08 Right. So, you've got to have something incredibly sensitive to go far enough back in time to see that kind of stuff, right?

22:14 Yeah.

22:15 These measurements that allow you to see things like hydrogen and water and carbon monoxide and so on, each molecule has its own signature in the radio wave.

22:26 Yeah.

22:27 I mean, it's in the spectra.

22:31 So, we can look at the spectra and see, you know, there's a peak there.

22:34 There's a line there.

22:36 Well, that probably means it's being absorbed by something.

22:38 This is something admitting.

22:39 Right.

22:40 So, you know, it's just spectroscopy, which is used in optical, x-ray, ultraviolet, infrared, radio.

22:51 We all do it.

22:52 Okay.

22:52 Yeah.

22:53 So, it's like NMR, far away.

22:55 Yeah.

22:58 And then, of course, you know, we look at the redshift to see how far away things are, because, you know, space time expanded, the radio waves stretched.

23:06 Yeah.

23:06 I guess you've got to compensate as well.

23:08 We can then look for it and see how far away things are.

23:12 Wow.

23:13 It's really amazing that you can just send out radio waves and then get all this information.

23:16 Oh, no, we're not sending.

23:17 We're receiving.

23:18 Yeah.

23:18 Yes.

23:18 Okay.

23:19 Thank you.

23:19 But it's like that you can measure these radio waves.

23:22 Hmm.

23:22 And just, you can basically see it, right?

23:27 It's almost like as if you've got an optical telescope, but, you know, you're computing a visual representation for humans, right?

23:35 Yeah.

23:36 Except for it takes a lot longer.

23:39 Speaking of all the computation stuff, let's dig into it.

23:41 So, I know, Rodrigo, you're working a lot on this project.

23:44 And so, you guys have got your hands on this pretty serious computer, right?

23:49 Yes.

23:50 So, the work that we did last year was about running simulations of all of this, but not only, you know, one or two computers, but at very big scales.

24:03 Basically, to the biggest scale possible that we could achieve now and trying to come up with what the system will look like in 10 years when we actually have to run it at that scale, right?

24:15 So, we teamed up with the Oak Ridge National Labs in the US.

24:19 And they own, at the moment, the biggest, not the biggest, the fastest supercomputer in the world.

24:25 It's called Summit.

24:27 So, Summit has over 4,600 nodes.

24:30 And on each node, you find six GPUs, six V100 NVIDIA cards, plus something like 160 cores.

24:39 It's like each node on itself is a beast.

24:42 And you have 4,600 of them, right?

24:45 Right.

24:46 We teamed up with Oak Ridge.

24:48 We wanted to run a simulation on their computer.

24:51 But, of course, they also wanted something kind of back, right?

24:55 It wasn't a free launch.

24:57 And we have been collaborating with them for a number of years.

24:59 And one area of collaboration that we have been working on is using their ADOS2 library.

25:05 I can dig into that in a second, but it's basically an IO framework for largely distributed programs using MPI.

25:13 So, that was the deal.

25:14 We got some time in Summit.

25:16 We have an ex-PhD student of ours who is working right now over there.

25:22 So, he was our main contact point.

25:24 His name is Jason1.

25:25 Yeah, we decided to run a couple of different experiments to test all these individual parts.

25:30 So, the first experiment was to simulate an actual observation of an SKA-like telescope.

25:38 And that was basically using the whole machine.

25:41 We used almost all the nodes and all the GPUs in all of these nodes to simulate what the correlator would produce when observing, right?

25:51 So, we're not simulating individual antennas.

25:53 Right.

25:53 We're just simulating the output of the correlator, right?

25:55 The observation that we decided to simulate is exactly the epoch of ray ionization, which is one of these big use cases of the SKA.

26:04 So, we decided to simulate that.

26:06 We simulated the output of correlator as if it was correlating as many antennas as the SKA.

26:13 The only aspect that we had to tune down a little bit is the number of frequency channels that we, double quotes, observe in our simulation.

26:21 In the SKA, you can observe up to 64,000 channels.

26:26 We simulated about 28,000, 29,000.

26:29 Basically, one channel per GPU.

26:31 Yeah.

26:32 So, you had 27,000 GPUs running.

26:36 Yes.

26:36 Yeah, that's right.

26:37 Full power for six hours to generate a portion.

26:41 No, sorry.

26:42 For three hours, simulating six hours of observation.

26:44 Exactly.

26:45 Yeah.

26:45 Yeah, yeah, yeah.

26:46 To generate all this data.

26:48 Talk about the computing, right?

26:49 It goes in and this data just comes screaming into this supercomputer and you have to distribute it out and basically do all this processing.

26:58 Is it like image processing or is it like time series processing?

27:02 What are you doing?

27:04 In this first experiment, first of all, we generate the data in the supercomputer, right?

27:09 So, we don't have to bring anything from any external source.

27:12 We just generate the data on the GPUs and then we stream it out of the GPUs into the CPUs on each individual node.

27:19 We did some data reduction.

27:22 We basically took data from different channels and averaged it together.

27:27 So, we did that at the local node level first.

27:32 You know, the six GPUs, we coalesced into a single output signal.

27:37 And then every six nodes, we did again another further reduction.

27:41 This is something that would be similar to what you would be doing at the SKA.

27:45 And that basically reduces the amount.

27:48 There are, I guess, scientific reasons of when and when you don't want to do this kind of averaging.

27:54 But for the epoch of rayonization, it's certainly something that you would do.

27:58 So, we did this two-step averaging and then we brought the data immediately to disk.

28:04 And that was the first experiment.

28:06 It ran on its own.

28:07 There was no further computation on that first experiment.

28:11 Yeah.

28:11 So, one of the things that's interesting is you guys are getting so much data that you can't write the raw data to disk.

28:17 Yeah.

28:18 So, you've kind of got to process it and filter it down and do this averaging.

28:21 And then you can finally save that bit, which is probably still a lot of data.

28:24 Yeah.

28:25 Yeah.

28:25 Yeah.

28:25 So, for example, the data that we generated of the GPUs during those three hours was about 2.6 petabytes.

28:33 And what we ended up on this was about 110 terabytes.

28:37 Wow.

28:38 So, yeah.

28:38 And that's in three hours.

28:39 Yeah.

28:39 That's in three hours.

28:41 Yeah.

28:41 Yeah.

28:42 That's in three hours.

28:43 As I was saying, we decided to average on this particular case, but on the real thing.

28:47 And you may actually need to write all that data into disk.

28:50 And that's why we did some other experiments in that direction.

28:55 So, when I think of what I'm visualizing is there's just like you're saving so much data.

29:00 You know, if you have a power plant that runs on coal, there's like every day a giant train that brings in coal and it just is continuously going.

29:08 I can almost imagine like you almost are just like constantly shipping in hard drives and plugging them in.

29:13 Like, how do you deal with that?

29:14 The truck of hard drives is here today.

29:17 Quick, plug it in.

29:18 It's getting full.

29:18 Well, in the SKA, there will be a double buffer, basically.

29:22 So, as you observe, you fill one of your buffers with all the incoming data.

29:27 Once your observation finishes, you swap the buffers.

29:31 The next observation can fill the other buffer while you process the first.

29:35 During this later processing, again, reduce the amount of data by, again, orders of magnitude.

29:40 And that's what this.

29:41 So, I was talking before about the first experiment.

29:44 Well, we did a second and a third.

29:46 In the second experiment, we took the output of the first.

29:50 And we effectively reduced it even further.

29:52 So, the output of the first, which is basically this reduction of data from the correlator, gives you what we call in radio astronomy, visibilities.

30:01 So, in radio astronomy, you don't observe pixels and images directly.

30:05 You observe these visibilities that later on you have to actually image.

30:09 You have to create an image from them.

30:12 And that takes much longer, as Kevin was saying.

30:15 It's a much complicated process.

30:16 So, that's why you can do it a bit offline.

30:20 And we did that during the second experiment.

30:22 We took all the 110 terabytes of visibilities and we created images for each of the channels.

30:28 So, if you have many images for each of the channels, you end up with an image cube.

30:32 That's how they're called in the real astronomy.

30:34 Or if you want, you can play it as a movie as you go across the different channels.

30:39 And that image cube, it turned out to be like 3.3 gigabytes or something.

30:44 Like, again, a massive reduction of data.

30:47 Yeah, that starts to get to the level you can write that down.

30:50 Yeah, yeah, yeah, exactly.

30:51 But it's no problem, yeah.

30:52 Yeah, exactly.

30:53 And that you can also distribute across the world more easily through the internet.

30:57 And again, in the SKA, there will be something like that.

31:01 There will be the main computer that will do the main reductions.

31:05 And after the main reductions are done, data is sent over to what are called the SKA regional centers.

31:12 And that's where the final science will be done.

31:14 Yeah, so it sounds a little bit like the Large Hadron Collider, which does a ton of computation and filtering and averaging and whatnot of the data.

31:22 But then it streams a bunch out to probably places like Oak Ridge and other places where it gets further processed and further processed.

31:29 It sounds like you might be doing something similar here in the end.

31:32 Yeah, yeah, exactly.

31:33 It's not about kind of reducing the amount of the size of the data, depending on your science use case, and then distributing that.

31:41 So, let me ask you guys something.

31:43 You're running the simulation on Summit in Oak Ridge, which is the fastest supercomputer in the world, or nearly so.

31:50 What are you going to do in the real one?

31:52 Like, are you going to build one of the largest supercomputers in Australia and then another one in South Africa?

31:57 Is that pretty much what you have to do?

31:58 Yes, I know.

31:59 So, by the time, it won't be the fastest, right?

32:04 Sure.

32:04 Compared to current standards.

32:06 We'll do it on our iPhones by then.

32:08 I mean, what is it?

32:08 Yeah, exactly.

32:09 Everyone will collaborate a little bit.

32:11 No, what's the plan for dealing with this?

32:13 Because it sounds like you've got to move a serious bit of compute next to this system.

32:19 Yeah.

32:19 So, the plan is effectively run something in the order, if I'm not mistaken, of 100 perapholops.

32:26 I think 150 perapholops is the size of the supercomputer that should be built, which is kind of comparable to what Summit is doing now.

32:33 When is the time frame for this?

32:36 About 10 years, give or take.

32:38 This portion of Talk Python to Me is sponsored by Clubhouse.

32:43 Clubhouse is a fast and enjoyable project management platform that breaks down silos and brings teams together to ship value, not features.

32:50 Great teams choose Clubhouse because they get flexible workflows where they can easily customize workflow state for teams or projects of any size.

32:58 Advanced filtering, quickly filtering by project or team to see how everything is progressing.

33:03 And effective sprint planning, setting their weekly priorities with iterations and then letting Clubhouse run the schedule.

33:09 All of the core features are completely free for teams with up to 10 users.

33:14 And as Talk Python listeners, you'll get two free months on any paid plan with unlimited users and access to the premium features.

33:21 So get started today.

33:22 Just click the Clubhouse link in your podcast player show notes or on the episode page.

33:28 It'd be fun to dig into some of the Python code and some of the architecture that you guys had to put in place to make this happen, right?

33:36 So a lot of these types of systems, they have a lot of C++ in place.

33:40 They also have a lot of interesting Python going on, I'm sure, in the data science visualization side, but also maybe more in the core of the system.

33:48 Yeah.

33:49 First, I should probably mention the execution framework that we use for this.

33:53 So instead of running things with MPI, we have been using an execution framework that we developed at ICRA, our institute.

34:00 It's called the Liujia.

34:02 Kind of difficult to pronounce, even more difficult to write.

34:04 But I will give you a link to that.

34:07 Yeah, super.

34:08 The view of this execution framework is a bit like Dask, which people are more familiar with, in the sense that you build a graph with your computations, and then you execute that graph in a number of workers, right?

34:19 Now, the big difference between Dask and the Liujia, our execution framework, is that Dask is very dynamic in its nature.

34:27 You can bring workers up and down, and then the scheduler will dynamically adjust the load to what you have available, right?

34:35 Whereas the Liujia is more designed for the SKA case in particular, but it's still pre-generic.

34:42 But one of the main design decisions was to work with a static deployment.

34:47 So instead of trying to be dynamic in nature and try to move data from here to there and restart the computation here and whatever, we try to be very static.

34:57 Because moving data from one place to another is a very expensive operation.

35:02 If the compute is what's expensive and the data is not that bad, moving it around to balance the compute and getting that to happen is really important.

35:11 But when you have so much data that it's more than the internet.

35:15 Yeah.

35:16 You don't want to be moving around the internet data.

35:20 You're already at probably near a limit of moving it around just to get it somewhere.

35:24 So, like, you know, combinatorially passing it around is not really what you're after.

35:29 Exactly.

35:30 Yeah.

35:30 So instead of focusing on that dynamism that that gives you, we focus on having very good schedule up front of your computations.

35:40 So you know exactly how long each one is going to take.

35:43 You know exactly how much data is going to be where.

35:46 And you keep it like that, basically.

35:48 That sounds very cool.

35:49 That's the main difference.

35:50 And we have been developing this in a prototype fashion, but now using it in real life world as well.

35:56 We used it for this summit demonstration.

35:59 So all the things that we run, all this big simulation, all the processes that we have to spawn and so on, all these 4,600 nodes, we did it using our execution framework.

36:08 And just to give a very quick overlook, the execution framework is using 0MQ to send messages between the different entities.

36:18 And they're called managers and node managers.

36:20 So we send events across the different node managers and we use 0RPC, which is an RPC framework built on top of 0MQ.

36:28 We use that to also do a couple of remote calls between different node managers.

36:34 All the scheduling of the graph is done using Python.

36:37 There is some interfacing with the Metis library, which is written in C, but there is a Python wrapper for that already.

36:44 Right.

36:45 But the rest is all Python.

36:46 0MQ looks really interesting.

36:48 And I haven't done anything with it, but it has a nice Python library.

36:52 Yeah.

36:52 Yeah.

36:52 Yeah.

36:52 It's very, very nice.

36:53 By ZMQ, right?

36:55 Yes.

36:55 Yeah.

36:56 That's right.

36:56 Yeah.

36:57 It seems like something that would be really useful if you're sending a lot of messages around and whatnot.

37:01 And then something I had not even heard of, which you had brought up, is 0RPC, which, so it basically sends messages out over 0MQ and then waits for a response or something like that, kind of to come back as another message?

37:16 Yeah, exactly.

37:17 Is that how?

37:18 Okay.

37:18 It's what you would expect from an RPC framework, right?

37:21 You can get a reference on a remote object and invoke methods and get replies, pass down parameters and so on.

37:30 And all of that then travels through 0MQ.

37:33 I think it's using message pack for the serialization and then 0MQ for the actual networking.

37:42 And on top of that, I think 0RPC has findings for different languages.

37:47 So you can also do inter-language RPC with 0RPC.

37:52 Right.

37:53 Okay.

37:54 Well, that looks really interesting.

37:56 And to me, one of the big challenges it sounds like of programming this system is it's just so big and so distributed that you really need these layers.

38:05 You know, you talk to this sort of, okay, I'm going to talk to the 0RPC.

38:09 It talks to 0MQ, which then might talk to this distributed scheduling service that then figures out how stuff actually runs, right?

38:17 Like there's just layer after layer.

38:18 Is that a big challenge?

38:20 Definitely.

38:21 Yeah.

38:21 Yeah.

38:21 You have to try to keep, as you were saying, all your layers as clean as possible.

38:26 Before we even settled on 0RPC, we also tried other RPC frameworks like Pyro4 and the remember which else.

38:34 So on top of, you know, having to build layer on top of layer, we also have to support different ones at the same time.

38:40 I think we still have the support there and you can kind of turn it on, but it's not really what we use.

38:45 We just use 0RPC.

38:46 Yeah, sure.

38:46 Another interesting library that you guys have and you actually maintain is iJSON.

38:52 Yes.

38:53 We use and maintain iJSON as CRC32C.

38:56 So iJSON, briefly described, is a way to iterate over very long JSON streams of data without having a big memory consumption.

39:06 So you parse, parse, parse the JSON iteratively and you get kind of parsing events out of that.

39:11 Well, there are different parsing levels.

39:14 You can have like full objects and you can kind of query and what kind of objects do you want to get from your JSON stream and so on.

39:21 And all of that is on iteratively.

39:23 So you get like an iterator, you know, we are preparing version 3.0 and in that one you will get asynchronous iterables as well if you are in the asyncIO world.

39:34 And we got into this because we were, again, dealing with very big computational graphs, which we express as JSON content.

39:42 That's how we transmit this from one site to another.

39:45 Yeah.

39:46 And the last thing you want to do is like load gigabytes of JSON and deserialize it.

39:50 Yeah, yeah, exactly.

39:51 I just want these sub items here and then the rest of the time I don't care about.

39:56 Or maybe I just want the first one.

39:58 Like a first would be fine.

39:59 Even if there's 10 million, just give me the first.

40:02 Yeah, yeah, exactly.

40:02 Yeah, JSON is super cool.

40:04 And I think people, there's probably a lot of people who could take that and use it.

40:07 You know, I was just, I had someone in the office hours for my online courses saying,

40:12 I'm working with something that's a huge amount of data, something like Google BigQuery or something like that.

40:18 It was like, it's too hard to load up all of this at once.

40:21 So I got to take little bits and load it.

40:23 And like, well, you know, have you thought about IJSON, right?

40:26 That'd be really cool.

40:27 And use a process like that or something along those lines.

40:29 Yeah.

40:30 You know, the other thing I kind of touched on it just a minute ago, but maybe you could speak a little bit more to it, you guys.

40:36 Something that just amazes me in general, but this is such a big scale of it that I think it's even

40:42 more interesting is a lot of times how we develop code in the small and then deploy it to somewhere in the cloud in a much, much bigger,

40:51 more complex system than maybe we're used to working on.

40:56 So, you know, the example that comes to mind for me is like, there's some developer at a coffee shop.

41:01 They're working on, you know, a MacBook Air or something completely weak like that, right?

41:07 And they're running a single instance of like a dev server.

41:11 And then they push something to GitHub.

41:12 It automatically gets picked up by CICD.

41:15 It pushed over and it kicks off, you know, like a whole new version of some giant app running in a Kubernetes cluster across,

41:22 who knows, 10 servers and a bunch of nodes and pods and so on.

41:26 And then it's like, I'm going to be able to do this thing.

41:29 And then like, it's scaled out to this huge system for you guys.

41:34 Like, how does that work?

41:36 Right.

41:36 How do you debug this thing?

41:37 How do you reason about like, it's little algorithms that are running.

41:41 Can you set a break?

41:42 Anything, right?

41:44 Is that like a thing that you can do?

41:45 Or is that like just literally too much?

41:47 It's just, it's impossible.

41:48 Can you, do you have like a Kubernetes cluster locally that allows you to kind of simulate this?

41:54 Or do you have to program on the giant thing?

41:56 No, you don't need to program on the giant thing.

41:59 So for all these summit experiments, we started, as you were saying, like on the small, on our own laptop computers.

42:06 I, and also different platforms.

42:08 I usually run a Linux machine, but some, some people in our team or most people in our team use Macs.

42:15 Yeah, we started on the very small.

42:17 And for that, you have to make sure that on the very small scale, you know exactly what's going on.

42:22 And you know that everything is working as you expect, so you don't have unexpected errors, unexpected troubles in the future, right?

42:30 Right.

42:30 For example, when, when developing Deluge, we have a very good test coverage for all the code base.

42:36 And we run all the tests, you know, without internet connection on a single node.

42:40 Because at that scale, you have to make sure that it's working, that everything is working fine.

42:44 And from then on, you start kind of escalating and testing more complex things.

42:48 But you have to have a very solid foundation.

42:50 That's true for, for the development of Deluge, but also for the development of the code that we use in the Summit demonstration.

42:56 Besides, again, on our laptops, then we went into, you know, a server that had one GPU, then a small cluster with two, three nodes that had a couple of GPUs on each.

43:08 The Summit system in Oak Ridge, it's a Power 9 system.

43:13 It's not an Intel system.

43:15 Again, we, before jumping into Summit, we jumped into a cloud provider in the US that offers Power 9 machines.

43:21 We made sure that everything worked there in a single node.

43:23 Little by little, you start tackling problems as they come before hitting the full machine.

43:29 Yeah.

43:30 And I guess for you guys, there's another level of challenging where the machine itself, you can't just go to it.

43:38 It's not like I could go to the cloud now and ask for a Kubernetes cluster if I'm willing to pay for a little bit of it.

43:45 And I could just do that whenever I want.

43:47 But I suspect this large computer is pretty much booked out.

43:51 And you can't just get it whenever you want to make it go full power, right?

43:55 Yeah.

43:56 Well, first of all, there are all the kind of the paperwork involved, you know, in getting permission and so on.

44:00 They have to send you a physical key.

44:02 So they sent me a physical key from the US into Australia that I have to use when I log into the computer.

44:07 So, yeah, it's not your everyday AWS systems, right?

44:11 Yeah.

44:12 And once there, all the systems work with queues.

44:16 So you submit your jobs into a queue.

44:19 And then the queue schedule decides what runs when.

44:22 So, yeah, you're competing with a lot of people.

44:25 Depending on how many resources you're asking, you will be delayed or not and so on.

44:29 Yeah.

44:29 But in Summit itself, we also started scaling little by little.

44:33 We started with experiments with 6 no's, 10 no's, 60 no's.

44:36 And little by little, we, again, started finding more and more problems.

44:40 You know, things that you never really think about.

44:42 Or very, very transient errors that only happen when you are spawning, you know, tens of thousands of processes.

44:49 And one of them fails.

44:50 And you didn't see it before because you didn't spawn as many processes before.

44:54 Yeah, those are tricky to catch.

44:56 Yeah, yeah.

44:57 You're not in control at that point, right?

45:00 Your control was it's like up and running.

45:02 Everything's fine.

45:03 Now your code runs, right?

45:04 Yeah.

45:04 And it's only once you find those errors that you can start to reason about them.

45:09 And that's also very difficult, right?

45:11 You cannot just go and attach yourself to thousands of processes at the same time and kind of step the back through them.

45:19 You have to set a break point on Summit.

45:21 Yeah, you will have to log a lot of stuff and then reason very heavily about what could be the possible cost.

45:29 So we caught a lot of those.

45:31 And then the final bit was the stress on the file system.

45:36 So all of these clusters, they usually have a central file system that is shared across the cluster.

45:41 But obviously, as you use more nodes, you are putting more stress into the central file system when reading and writing data, right?

45:50 So that becomes a problem.

45:52 Yeah.

45:52 Yeah.

45:54 So what was it like when you basically hit enter to submit the job for the 27,000 GPUs?

46:01 Were you like, we better get it right.

46:06 Yeah, we all assembled into a room.

46:08 We did a countdown and we hit enter.

46:10 Yeah.

46:10 We basically had one shot doing the full simulation because we were given a time allocation of 20,000 node hours.

46:18 We knew that in the big experiment, we were going to be using about 15,000.

46:22 So it was either going to work or not, right?

46:26 But it did.

46:27 It did?

46:28 So all the gradual scaling up, all the testing, it all worked.

46:33 It paid off.

46:33 It paid off.

46:34 It paid off.

46:34 Yeah.

46:35 Did you get like a weird news report in Tennessee that it was suddenly hotter a little bit that day than they expected?

46:42 I wouldn't be surprised.

46:46 Like there's a warm breeze coming from the east.

46:49 I don't know what that is.

46:49 But yeah, that thing must have been really screaming.

46:52 That's quite something.

46:53 All right.

46:53 So I guess we could probably wrap it up in terms of time.

46:56 But this is super interesting.

46:58 So maybe you guys could just tell us maybe some of the lessons learned.

47:01 We kind of touched on them a little bit, right?

47:03 This like scale it up a bit at a time.

47:05 But what are some of the lessons you all learned?

47:06 Yeah.

47:07 For me, it was, I think it was the full process mostly.

47:10 I also maybe learned a bit about Summit in particular.

47:14 But it was more the process of scaling all this exercise up.

47:19 That was really challenging.

47:22 It really stressed the importance of having very solid foundations before you take the next step.

47:27 Because otherwise, if you are kind of giving steps in the dark, you will continue hitting walls.

47:33 Sure.

47:33 Kevin, how about you?

47:34 I'll reiterate what Rariko said.

47:36 I mean, a lot of the early work with Deloji, obviously, is Test Monkey.

47:40 There's a big project called Chili's, which is a very deep observation using the telescope in Socorro.

47:48 So four years.

47:49 And we did all of that on the Amazon cloud.

47:52 And slowly build it up using a different software stack slightly.

47:57 We didn't have all the problems with the GPUs.

47:59 But we would start with two or three nodes, parallelize it.

48:04 Does this work?

48:05 Are we getting what we want?

48:06 And then slowly wind it up.

48:09 So, I mean, it doesn't sound much now when we compare it to Summit.

48:14 I used to run it about 200 down by four nodes.

48:17 Okay.

48:17 Very cool.

48:18 Now, I guess let's wrap it up with the final two questions for you guys.

48:23 And you're welcome to throw out some of your own that you're maintaining or pick a different one.

48:27 But how about, yeah, notable PyPI package?

48:31 We'll start with that one.

48:31 Notify.

48:32 Notify.

48:33 All right.

48:33 Awesome.

48:34 Yeah.

48:34 I'm sure that's a foundation of a huge amount of work that you all are doing.

48:37 Yeah.

48:37 I think I will do some more stressing on iJSON just because I maintain it.

48:43 And because there is this new version coming.

48:45 So, yeah.

48:46 Go out.

48:47 Try it.

48:48 It's pretty cool.

48:49 Yeah.

48:49 It looks really, really useful whenever you have a ton of, like, very large JSON or you need to not load it into memory.

48:55 One I picked up from your other podcast is Typer.

48:58 That's beautiful for writing command line interfaces.

49:01 Oh, yeah.

49:02 Typer is great.

49:03 I think we covered that on Python Bytes.

49:04 That's right.

49:05 Yeah.

49:05 That's right.

49:06 So, Cracker.

49:07 Thank you for that one.

49:08 Yeah.

49:08 You're welcome.

49:09 That's super.

49:10 All right.

49:10 Now, when you all are writing some Python code or really any code, what editor are you using?

49:14 PyCharm.

49:15 PyCharm.

49:15 All right.

49:15 I use Eclipsed.

49:17 I've been using Eclipsed for the last, like, 15 years for writing Java, C++, Python.

49:21 So, Eclipsed in Python comes with PyDev, which is the...

49:25 All right.

49:26 You get PyDev and that adds it in, right?

49:27 Yeah.

49:29 But on the other way around, I used IntelliJ prior to going to ICRA.

49:34 So, I just...

49:35 You know, it has a nice look and feel.

49:37 Yeah, it does.

49:38 Yeah.

49:38 It's a pretty easy transition from IntelliJ over to PyCharm.

49:41 All right, you guys.

49:42 That's probably a good place to leave it or get short on time.

49:46 But thank you for sharing.

49:47 This is super interesting.

49:48 Final call to action.

49:49 People want to learn more about the SKA.

49:52 They want to learn about some of these libraries you're working on, more about radio astronomy.

49:56 What do you tell them?

49:57 There is also a lot of material out there, too.

49:59 If you're interested in the topic, there's tons of material.

50:03 Just go to the SKA telescope organization website, to the ICRA website.

50:07 I'm sure in YouTube, it will be full of videos as well to learn about all these different concepts.

50:14 Cool.

50:14 All right.

50:15 Well, thank you both for being here.

50:17 This was a lot of fun.

50:18 And I really enjoyed learning about radio telescopes.

50:22 And I didn't even realize, Rodrigo, that you are the maintainer of iJSON,

50:26 which is a nice little bonus.

50:27 Very cool.

50:28 Yeah.

50:28 Yeah.

50:28 Well, I took over just last year, I believe.

50:32 So I'm not the original creator.

50:34 I just became the maintainer after I started kind of contributing more to it.

50:38 Super.

50:39 Well, thank you, Rodrigo.

50:40 Thank you, Kevin.

50:41 Have a great day.

50:42 Yeah, you too.

50:43 Bye.

50:43 Bye.

50:43 Bye.

50:44 This has been another episode of Talk Python to Me.

50:47 Our guests on this episode have been Rodrigo Tabar and Kevin Vinson.

50:52 And it's been brought to you by Linode and Clubhouse.

50:55 Start your next Python project on Linode's state-of-the-art cloud service.

50:59 Just visit talkpython.fm/Linode, L-I-N-O-D-E.

51:03 You'll automatically get a $20 credit when you create a new account.

51:07 Clubhouse is a fast and enjoyable project management platform that breaks down silos

51:12 and brings teams together to ship value, not features.

51:15 Fall in love with project planning.

51:16 Visit talkpython.fm/Clubhouse.

51:20 Want to level up your Python?

51:22 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

51:27 Or if you're looking for something more advanced, check out our new async course that digs into

51:32 all the different types of async programming you can do in Python.

51:35 And of course, if you're interested in more than one of these, be sure to check out our

51:39 everything bundle.

51:40 It's like a subscription that never expires.

51:42 Be sure to subscribe to the show.

51:44 Open your favorite podcatcher and search for Python.

51:46 We should be right at the top.

51:47 You can also find the iTunes feed at /itunes, the Google Play feed at /play,

51:52 and the direct RSS feed at /rss on talkpython.fm.

51:57 This is your host, Michael Kennedy.

51:58 Thanks so much for listening.

52:00 I really appreciate it.

52:01 Now get out there and write some Python code.

52:03 I really appreciate it.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon