Monitor performance issues & errors in your code

#257: Exploring the galaxy with the fastest supercomputer, Python, and radio astronomy Transcript

Recorded on Friday, Mar 27, 2020.

00:00 KENNEDY: with radio astronomy, we can look across many light years of distance and see incredible details such as the chemical makeup of a given region.

00:08 KENNEDY: Kevin Vincent and Rodrigo Tovar from I C R A R are using the world's fastest supercomputer along with some sweet Python to process the equivalent 1600 hours of standard definition YouTube video per second. This is Talk Python to Me Episode 257 recorded March 26, 2020.

00:41 KENNEDY: Welcome to Talk Python to Me, a weekly podcast on Python, The language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mKennedy. Keep up with the show and listen to past episodes at talkPython.FM and follow the show on Twitter via @talkpython.

00:59 KENNEDY: This episode is brought to you by Linode and Clubhouse. Please check out whether offering during their segments it really helps support the show.

01:07 KENNEDY: Kevin, Rodrigo, welcome to Talk Python to Me?

01:09 VINSEN: Thanks

01:10 KENNEDY: Great to have you guys both on this show. This topic is something that I'm both fascinated with and actually not very knowledgeable about. So It's awesome, and it has a really cool bunch of Python going on as well. So I think we're gonna have a lot of fun talking about radio telescopes and just processing ridiculous amounts of data with it.

01:27 TOBAR: Yeah, I think we're talking with Kevin about this the other day that we kind of lost sight on how big this number sound. Because if you are always within this realm, you, dad, you you don't realize that this is actually pretty big.

01:40 KENNEDY: Yeah, that's a bit of an understatement, So we'll definitely dig into all the data and everything that's going on. It's pretty impressive. It's certainly...

01:49 KENNEDY: well, I'll leave until we get to it. But it's some crazy, crazy numbers that you all are doing. But before we get into that stuff, let's just start briefly with how you each got into programming with Python, Maybe Kevin, You go first.

02:12 VINSEN: I did my degree in physics back in the UK and then sort of drifted around doing various languages. I programmed in C, C++ Prolog, LISP, small talk. And from that you could tell him quite old.

02:15 VINSEN: When I came to join ICRAR in 2009, Python is this defector language for an awful lot of astronomy. So I learned Python.

02:25 KENNEDY: Yeah. What was that? Learning Python experience.

02:27 VINSEN: piece of cake really much, much easier in Java and C++, the syntax is just so much cleaner.

02:35 KENNEDY: Yeah, it is. And, you know, it sounds like you have experience with a lot of different languages, right? Like scheme, LISP, small talk, C++. There's a lot of different examples. And so, you know, you come to Python, and you're like, Oh, this is a weird language. It doesn't have any line breaks or it doesn't really love the scent like a lot of, ah syntactical elements in there.

02:59 VINSEN: Where do I put my semicolons! I did often want to add semicolons. It took a while.

03:01 KENNEDY: I do find it funny that you can still write them if you just feel the need, you know? Yeah, I got I just put him in the end. It's gonna be okay

03:09 VINSEN: between IntelliJ and PyCharm because it tells "you you don't really need that"

03:15 KENNEDY: If you really need that comfort, blame that I suspect you could turn off that code protection code rule in my charm and just just put the cynical, but you may not be accepted by your fellow Python program

03:26 KENNEDY: from the one

03:29 KENNEDY: It's right. It's right, Rodrigo. How about you?

03:30 TOBAR: Well, since I was very little I always liked computers, so I decide to go on and study Computer Degree without really knowing exactly what computing watts about like programming and all.

03:42 TOBAR: So I began learning some languages and I have become became involved in the biggest observatory. We have a group of students who were doing collaborations with some observers. I'm originally from Chile. And in Chile there are many observatories because the conditions are so good for observing. So there was this group of students leaving collaborations with some observatories in Chile. That's how you got basically into the business. So when we left uni, I moved into the European Southern Observatory headquarters in Germany. I worked over there for a couple of years and then moved here to be ICRAR in Australia to continue working. In Python in particular, I started doing some more Python down here in Australia. I had done a couple of basic scripts before, but nothing much to it. I really got into it now because we were clearly using.

04:31 KENNEDY: well, yeah, that sounds really fun. and Chile really is. Is one of those places where astronomy, especially radio astronomy, Right. Is that where _Contact_ was filmed?

04:42 TOBAR: No _Contact_ was filmed in the U. S. in New Mexico. Close to south of the border.

04:48 KENNEDY: It was set though, in, like, that area, right? In that, general, it was definitely South America somewhere. Puerto Cibo maybe? I think so.

04:57 TOBAR: Oh, yeah, yeah, yeah. Puerto Rico. I forgot.

05:03 KENNEDY: Okay. Yeah. Yeah. Okay. Puerto Rico. Yeah. OK, so it's not exactly someone, but there's definitely with the mountains there. There's a bunch of observatory, right?

05:11 TOBAR: Yes. Yes, it is very, very on the optical side as well. So for optical and radio observing you have different set of requirements if you want. For optical you want super super clear skies.

05:22 TOBAR: Whereas it with radio telescopes, you can have clouds and still observe. So Chile, in the north of Chile there is a huge desert, which is very, very dry. That's perfect for optical observing.

05:50 KENNEDY: Right, Right, Right, right. So, yeah, I hadn't really thought about that. Of course, for optical stuff, the higher the better. The clearer the better

05:58 KENNEDY: But what are the requirements for radio telescopes.

06:05 TOBAR: The fence on the frequency, the radio frequency that you're serving if you are in the high frequencies is basically amount off water in the air.

05:54 TOBAR: It's called PMW or PVW, or PWV I forgot the term for optical astronomy, I'm sorry, for lower frequencies of radio astronomy is fine If we can see interference. Basically any device areas emitting radiofrequency ways. So you want very isolated places for that?

06:24 KENNEDY: Okay, I see. As it will learn, you can measure things like water and stuff very far away with radio telescopes, right? Yes. Suspect having like water in the air is a problem.

06:42 KENNEDY: You don't want that. That's interesting.

06:28 VINSEN: It's like how a Microwave works. You know, I'm like a wave at you. Take the water. It's the same sort of basic principle

06:36 VINSEN: in a moment about that's why the alma telescope, which is in the Atacama Desert, is that Rodrigo is saying has to be so high.

06:44 VINSEN: So there is no moisture that where is the stuff we tend to work on, generally

06:49 VINSEN: Can be down at sea level, know a little bit higher.

06:53 KENNEDY: Okay, Interesting. And what you guys do day today. Are you both doing astronomy basically day today

07:00 KENNEDY: or code for astronomy

07:14 VINSEN: code for astronomy? Very much. I mean, most of my work is helping. Well, hardcore astronomers do things faster.

07:21 VINSEN: So for example. A group who were doing from optical work. It was taking 42 days to do something They then passed over to once we got it out down to 18 hours.

07:20 KENNEDY: That's awesome. That means you can do so much more science, right?

07:24 VINSEN: Yeah, but there's a plastic dividing conquer problem. They're paralyzed like mad. Although the most of our astronomy talks are embarrassingly parallel. If we scatter, we don't really do a gather until the very end.

07:39 KENNEDY: I see. So it's almost like

07:42 KENNEDY: you could almost d'oh individual computation on a per pixel basis, maybe over the equivalent of a perfect basis.

07:48 VINSEN: We tend to work in frequency channel more than pixels but yes, so we would just process one particular or one band of frequencies one machine, another band on another and another on another.

08:02 VINSEN: The other work I do is a quite a bit of machine learning work for detecting, RFI gravitation waves doing corrections. We're actually now moving some of our astronomy work into Ocean Wave investigations and trying to look at whether we can correct the swell heights. So you know whether it's gonna be a good day to go surfing.

08:23 KENNEDY: right? Okay,

08:25 KENNEDY: No, that would be a really unexpected consequence or outcome or capability from studying gravitational waves is better Surf predictions.

08:35 VINSEN: They just waves and provocation speed. That's different rather quick. And one more.

08:41 KENNEDY: Yeah, I guess so, Yeah, The whole gravitational wave detection stuff is some pretty cutting edge science, and it's really interesting. And it's cool that you're using machine learning to try to understand that

08:51 VINSEN: we have a small group working on it. We've got some devices in the detective, a very active area of research. There's a lot of groups around the world working on this.

09:03 KENNEDY: Yeah, I think it's kind of amazing. There's a lot of stuff with gravity oriented things in astronomy right now. We have the gravitational wave detection for the collisions of black holes. We have the first picture of black holes in the last year and 1/2 or so whenever that was a lot going on around there.

09:21 VINSEN: Yeah and then I guess the other thing is teaching.

09:21 KENNEDY: I guess if you at university, eventually you might end up interacting with a student or two. Very cool

09:28 VINSEN: and unnecessary evil.

09:33 KENNEDY: All right. Rodrigo's. What about you?

09:40 RODRIGO: Well, that kind of similar. I'm a software person who became involved in astronomy. So I basically help astronomers to develop software in different languages for different purposes, so not only for radio astronomy but also optical astronomy and also for a theoretical group so people could do simulations galaxy formations collects information on such so kind of all over the place on we. Normally me about all the people in the group we specialize comes to the kind of in this area off helping us romance with the software deployed to myself.

10:08 KENNEDY: How much do you end up helping them with standard software engineering things like, Hey, I need to teach you source control. This is git and GitHub. Let's spend an hour talking about that. Or are they pretty much good to go?

10:21 TOBAR: Yeah, that's the generation. I was site. So older generations are harder to kind of. They don't move to that science. But a newer people, like younger people are then come with all those concepts already built in right. When they were born GitHub was already there.

10:39 TOBAR: so you don't have to push that far. It's still more to offer mainly, on the software design side of things. You know how do you structure your software. How do you tackle that particular problem. How do you organize your code? How come you optimize things for your particular architecture and so on?

10:55 KENNEDY: Okay, cool. And you're also working on this SKA construction. The square kilometre array.

11:02 TOBAR: Yes. Yes, that's a full topic. I guess we'll talk more about it later. But we are one of. the main institutions are working on the on the square kilometre array. Project.

11:12 KENNEDY: Yeah. So it's interesting. I don't know if it's works for light, but it does for radio that if you put multiple detectors and sort of densely but not actually connected a one giant antenna or something, you can put that together like a bigger detector, right? A bigger lens in the radio world. So that's the idea, right?

11:34 TOBAR: Yes. And that's exactly the way it's called interferometry. You basically, if you got Sight three antenas A, B, and C

11:41 TOBAR: do what you do is you take measurements individually from A from B and C and then you correlate every other pair. So you correlate the signals from A and B and B and C and from A and C. And you do that with a correlator, which is the one that's doing this mixing single signals on out goes one correlated signal, which is us if you have one big content.

12:07 TOBAR: So that's what happens in Vegas for me. I think I'm not sure, but I think you know. You can also do interferometry, but I'm not sure how the mechanism works in that case.

12:16 KENNEDY: So this escape project, this is the square kilometre array, which is an international project that you all are working on, involving 13 countries that are I guess, full members of the project and four others who are just participating, right?

12:32 VINSEN: Yeah, that's right. The square kilometer is collecting area because way started to run out of adjectives.

12:39 VINSEN: This thing's like a very large array. The extremely large array. Where do you go. Tell me what the collecting area of the final system now. We're going to be building this thing in two phases. Phase 1 is gonna be 10% of the final telescope,

13:01 VINSEN: which means, and it is being built in two countries. So the low frequency component is coming to Australia or Western Australia.

13:09 VINSEN: On the mid frequency is going into the camera in South Africa, so there will be 100 mod-6 15-meter dishes in South Africa on 171,072 antennas in western Australia. So there's a fair bit of kit going out. About 650 million euros just for the the 1st part.

13:16 KENNEDY: don't know. Ah, 131,000 Antanas Bringing in all this data, that is a huge amount of antennas in its

13:37 VINSEN: It's a huge amount of data. 550 gigabytes per second.

13:53 KENNEDY: 550 gigabytes a second. I don't really have a great way to understand that number. Honestly, like

14:02 KENNEDY: you got to think of like, large cloud service is like YouTube or Netflix or something like that right

14:02 VINSEN: It's like 16,000 downloads of standard definitely videos from YouTube every second.

14:14 TOBAR: I said, if you take your you know, your hard drive your 500GB hard drive and you throw it and you throw one of those every second.

14:25 KENNEDY: yeah, that's ah lot of data. Also, it takes a lot of power

14:29 VINSEN: yet on that, that's one of the key things. we would like to be as green as possible, but we've got a power cap at the moment of 5 megawatt because the most power system on the planet is 10MW.

14:43 VINSEN: So that's still a challenge we have to address.

14:47 KENNEDY: Yeah, you almost need your own power plant.

14:50 VINSEN: We go up that much radium penetrated.

14:56 VINSEN: Go up there. Ready?

14:58 KENNEDY: Down. Okay. Is it so the blades, the generator RFI or is it the generators that generate?

15:05 VINSEN: It's the generator.

14:50 KENNEDY: This portion of Talk Python to Me is brought to you by Linode. Whether you're working on a personal project or managing your enterprises infrastructure, Linode has the pricing support and scale that you need to take your project to the next level with 11 data centers worldwide, including their newest data center in Sydney, Australia, enterprise grade hardware S3-compatible storage and the next Generation Network, Linode delivers the performance that you expect at a price that you don't. Get started on Linode today with a $20 credit and you get access to native SSD storage. A 40 gigabit network industry leading processors. Their revamped cloud manager at cloud.linode.com Root access to your server along with their newest API and a Python CLI just visit talkpython.fm/linode When creating a new low note account, you automatically get $20 credit for your next project. Oh, and one last thing they're hiring go to linode.com/careers to find out more. Let him know that we sent you.

16:08 KENNEDY: You know, one of the things I think before we dive into the Python side of this that might be fun to talk about a little bit is there's a website for the SKA that has six amazing facts. Maybe I could just throw those out there really quick if you guys could just comment on them.

16:15 VINSEN: One thing is these facts are about the final system. So this is where we are going to get to.

16:23 KENNEDY: Right. Okay. So, yeah, you're only talking. We're only really working on stage one now, and it's gonna be some beyond that, right?

16:25 VINSEN: Oh Yeah

16:25 KENNEDY: Awesome. So, the next amazing fact is that the SKA central computer will have the processing power of 100 million PCs. Rodrigo, is that because there's a bunch of GPUs? They're, like, really just a lot of CPUs in there.

16:56 TOBAR: It's both. So the final design of this super computer for this guy is still not fully the silent buys. Definitely going to be a mixture of both. Yeah. So all of this is based on their way to calculate how many computations we will meet. This therefore, that's gonna give you this size.

17:13 KENNEDY: Well, okay. The next one is the dishes will produce 10 times as much data traffic as the global Internet.

17:13 TOBAR: Yes,

17:22 KENNEDY: that's crazy.

17:23 TOBAR: Yeah, I know. That's that's because you have so many dishes, right? Right. This is before it goes into the correlator, of course, Like what comes off the correlator is like half a terabyte per second, which is obviously not what the global Internet traffic is.

17:39 TOBAR: But what comes out of the individual antennas? Yeah. Differently. It's bigger than the internet traffic.

17:44 KENNEDY: There's a bunch of fiber optic that brings it back to these correlators that then, like, process it and average it.

17:52 KENNEDY: Yeah, crazy. And then I guess finally, the aperture rays could produce up to 100 times the global Internet traffic. So, yeah, there's I think this is a pretty interesting one. Honestly, the one that's most exciting to me is the one about the airport radar equivalent on Planet 10 tens of light years away.

18:12 TOBAR: That's what everyone is waiting for you. There

18:14 VINSEN: are two planets that we know of at the moment. With fit within that area that could probably potentially hold life, though I mean, our nearest neighbours alpha centauri, well Proxima centauri until I got two planets around it. But it's a red dwarf, which means that their quite bursting with lots of solar flares. Life as we know it would probably struggle to evolve. There you want a nice star like ours?

18:36 KENNEDY: You want to get like a cleansing radiation spray every 10 years or whatever.

18:41 VINSEN: Yeah, that's right. It's no good for you

18:58 KENNEDY: Probably not enough Sunscreen that can help you with that one?

19:03 KENNEDY: Yeah, that's that's one of the sad things from me about all of this space stuff I really wish. It's just so big that it's just really challenging to, actually explore it. Interact with it, measure it. Then even if you do get measurements back, it's like, "Well, that was 100 years ago." It will take 100 years to send them a message.

19:24 VINSEN: Well we've only got to wait four years. Last year or the year before. I took out a bunch of primary school students. We talked the European Space Agency into lending us their dish and we sent messages to Alpha centauri.

19:41 VINSEN: Wait 4.9 years to 4.2 years to get there for them to send a reply back. So, about six years?

19:29 KENNEDY: Yeah, that's not too bad, actually, Not too bad.

19:30 KENNEDY: Cool. So I guess maybe the next thing that's interesting to dig into before we get, like, fully into the programming side of things. Just like, what kind of questions are you guys trying to answer? I mean, it's super cool to have this giant radio telescope with 131,000 antennas together and this giant array, but you get some measurements off of it, then what?

20:14 VINSEN: Well, one thing we've been joking but it's the cradle of life. Are we alone? One of the things of radio telescopes see is markings in space. Water, hydrogen sulphide, ammonia, carbon monoxide. But it can also see things like methanol one of the molecules of simple sugars and animo acital Nitritile.

20:38 VINSEN: And you look up for the night sky, look at the constellation Orion. There is a nebula in there that has these chemicals starting in the Trinity.

20:50 VINSEN: Now, that three quarters of what we are.

20:56 KENNEDY: Yeah if those gases and small particles coalesce into planets those planets are gonna have those things or asteroids that crash into planets, right?

21:06 VINSEN: yeah another thing we are looking for are things like galaxies testing cosmological models. Origins and evolution and cosmic magnetism with We really don't know much about that.

21:25 VINSEN: And the epoch of realignization. After the Big Bang, we had everything being highly ionized gas. And then after 300,000 years, it neutral and dark. Then slowly Galaxies and quasars began to really nice things. Stall So much of a 1,000,000,000 Galaxies started to appear into about 1,000,000,000 years ago when every since I've become transparent together. So we want to go back and have a look at that time.

22:00 VINSEN: And to do that, we need a huge collecting area because radio photos are 2million times weaker than optical photons.

21:09 KENNEDY: Right? So you gotta have something incredibly sensitive to go far enough back in time. Yeah, these measurements that allow you to see things like hydrogen and water and carbon monoxide and so on. Each molecule has its own signature in the radio wave

21:31 VINSEN: We could look at the spectrum. See, you a peak that there's about with a line that one. That probably means being absorbed by something this is something being emitted.

22:40 KENNEDY: All right,

22:41 VINSEN: so it's just spectroscopy which is used in optical, X ray, ultraviolet, infrared radio. We all do it.

22:52 KENNEDY: Okay. Yes, it was like NMR far away.

22:58 VINSEN: way. Look at the red shift to see how far away things are because space time expanding the radio wave.

23:06 KENNEDY: Yeah, I guess you gotta compensate as well

23:09 VINSEN: we can then look for it and see how far away things

23:13 KENNEDY: Well, it's really amazing that you just send out radio waves and then get all these waves.

23:17 VINSEN: Oh we are sending we are receiving.

23:18 KENNEDY: Yeah. Yes. Okay. Thank you. But it's like You can measure these radio waves, and it just you can basically see it, right? It's almost like as if you're you've got optical telescope, But your computing a visual representation for humans, right?

23:36 VINSEN: Yeah. Except it takes a lot longer.

22:38 KENNEDY: Speaking of all the computation stuff, let's dig into it. So I know Rodrigo, you're working a lot on this project. And so you guys have got your hands on this pretty serious computer, right?

22:50 TOBAR: Yes. So, the work that we did last year was about running simulations of all of this, but not only, you know, One or two computers. It's part of a very big scales. Let's get to the biggest scale possible that we could achieve. Now and trying to come up with what the system will look like in 10 years when we actually have to run it at that scale.

24:16 TOBAR: So we teamed up with Oak Ridge National Labs in the U. S. On they own at the moment that biggest, not the biggest, the fastest supercomputer in the world. It's called Summit.

24:27 TOBAR: So someone casts over 4600 nodes on each node you find 6 GPUs, 6 e100 bigger camped, plus something like 160 cores like like each node on itself is a beast on. You've got 4600 of them right and they're bridged.

24:46 TOBAR: We wanted to run a simulation on their computer, but of course it's They also wanted something come back right? It wasn't free lunch. We have been collaborating with them for a number of years on one area of collaboration that we have been working on is just seeing their ADIOS2 to library. I can't click into that in a second, but basically it's an IO framework for naturally distributed, programs. So that that was a view. We got some time in summit we have on ex-PhD student of ours. Who is working, right now, over there. So he was our main contact point. His name is Jason Yuan that would decide to run a couple of different experiments to test all this individual parts. So the first experiment was to simulate on actual observation of an SKA like telescope on that was basically using the whole machine. Weused almost all the nodes and all the GPUs in all of these nodes to simulate what the correlator previews when observing, right? So we're not simulating antennas we're simulating the output off the correlator, right? The observation that we decided to simulate is exactly that epoch re-alignization which is fun of these Big use cases of the SKA. So we decided to similar that with simulated in the output off Correlator us. If it waas correlating as many antennas off the SKA, the only aspect that we have to tune down a little bit. It's a number of frequency channels that we observe in this simulation.

26:22 TOBAR: In the SKA. You can observe up to 64,000 channels we simulated about. 28-29,000 basically 1 channel per GPU.

26:32 KENNEDY: Yeah. So you had 27,000 GPUs running full-power for six hours to generate a portion? No. Sorry for three hours simulating six hours of observation.

26:46 KENNEDY: Yeah, to generate all this data. Talk about that computing. Right? It goes in. And this data just comes screaming into the supercomputer, and you have to distribute it out and basically do all this processing. Is it like image processing? Or is it like time-series processing? What are you doing

26:04 TOBAR: in this first experiment, we first generate data in the supercomputer. Right? So we don't have to bring anything from any sensors we just generate on the GPUs. And then we stream it out of the EPUs into the CPUs on each node with it, some some data reduction. We're basically took data from different tunnels on averaged it together. We did that at the local node of the local level first. You know, the 6 GPUs we coalesce into a single output signal, and then every six nodes another further reduction. This is something that would be similar to what you would be doing at the SKA.

26:47 TOBAR: And that basically produces something. Well, that although. I guess scientifical reasons of when and when You don't want to do this kind of averaging, but for the epoch this is certainly something that you would want to do.

26:59 TOBAR: So we did this to step averaging, and then we wrote the data immediately to disk on. That was the first experiment run on its own. There was no further competition on the first experiment.

27:12 KENNEDY: Yeah. So one of the things that's interesting is you guys were getting so much data, you can't write the raw data to disk.

27:18 KENNEDY: You kind of got a process it and filter it down and do this averaging. Then you can finally save that, which is probably still a lot of data.

27:25 TOBAR: Yeah, so for example that they were generated off the GPUs. And during that, those three hours was about 2.6 petabytes on what we ended up on. This was about 110 terabytes.

27:42 KENNEDY: And that was Three hours?

27:43 TOBAR: As I was saying, we decided to average in this particular case, but on the real thing on you may actually to write all the data into this, and that's why I waited some other experiments in that direction.

27:55 KENNEDY: So when I think of but I'm visualizing is there's just like you're saving so much data. You know, if you have a power plant that runs on coal, there's like every day a giant train that brings in coal, and it just is continuously going. I can almost imagine like you almost just like constantly shipping and hard drives and plugging them in, like, how do you deal with that? A Truck of hard drives Is here today, quick plug it in, we're getting full.

28:17 TOBAR: The SKA, there will be a level buffer. Basically. So as you serve you fill one of your buffers with all the incoming data.

28:28 TOBAR: Once your observation finishes, you swapped the buffers. The next observation can fill the other buffer while you process the first during during this later processing again reviews amount of data by again orders of magnitude. And that's what this does. So I was talking before about the first experiment. We did a second and a third in the second experiment, We took that output off the first on we effectively reduced it even further, so the output of the first, which is basically this reduction of data from the correlater gives you what we call in radio astronomy Visibility's. So in real time you do them of certain pixels on images directly into you serve with visibilities. That later on you have to image. You got to create an image from them. That takes much longer, as Kevin was saying, It's a much complicated process, so that's why you can do it a bit offline. And we did that during the second experiment, we took all the 110 terabytes off visibilities and we created images for each of the channels, So if you have many images for each of the times you have an image Cube. That's how call it in radio astronomy. Or if you want, you can play it as a movie, ask you go across different channels on that image Cube. It turned out to be, like, 3.3 gigabytes or something. Like again, a massive reduction of data.

29:47 KENNEDY: Yeah, that starts to get to the level. You can write that down.

29:53 TOBAR: Yeah, exactly. And that's you can also distribute across the world more easily through through the Internet, and again in SKA. There will be something like that in there. There will be the the main computer that will do the main productions and after the main reductions are done data is sent over to call the SKA Regional Centers on that's where the final science will be.

31:14 KENNEDY: Yeah, so it it sounds a little bit like the Large Hadron Collider, which does a ton of computation and filtering and averaging and whatnot of the data. But then it streams a bunch out, too. Probably places like Oak Ridge and other places where it gets further processed and further processed. It sounds like you might be doing something similar here in the end.

31:34 TOBAR: Exactly know, it's a lot of reducing the the size of the data, that depending on your science use case. And then distributing that.

31:42 KENNEDY: So let me ask you something. You're running the simulation on Summit in Oak Ridge, which is the fastest supercomputer in the world. Or nearly so. What you gonna do in the real world? Like, are you gonna build one of the largest supercomputers in Australia and then another one in South Africa? Is that pretty much what you have to do?

31:58 TOBAR: By the time it won't be the fastest right share compared to current, so we'll do it.

32:07 KENNEDY: We'll do it on our iPhones, by then. No. What's the plan for? Like dealing with this? Because it sounds like you've got to move the like a serious bit of compute next to this system.

32:19 TOBAR: Yeah, so the plan is to effectively run something on the order that if I'm not mistaken off 100 petaflops, I think its petaflops, it is the size of the supercomputer that should be built, which is kind of comparable to what summit is doing now,

32:34 KENNEDY: When is the time frame for this?

32:36 TOBAR: about 10 years , it will take.

32:40 KENNEDY: This portion of Talk Python to Me is sponsored by Clubhouse. Clubhouse is a fast and enjoyable project management platform that breaks down silos and brings teams together to ship value, not features. Great teams choose Clubhouse because they get flexible workflows where they can easily customize workflow state for teams or projects of any size. Advanced filtering quickly filtering by project or team to see how everything is progressing, and effective sprint planning, setting their weekly priorities with iterations and then letting clubhouse run the schedule. All of the core features air completely free for teams with up to 10 users. And as Talk Python listeners, you'll get two free months on any paid plan with unlimited users and access to the premium features Go get started today. Just click the clubhouse link in your podcast, players show nodes or on the episode page.

33:29 KENNEDY: It'd be fun to dig into some of the Python code and some of the architecture that you guys had to put in place to make this happen, right, so a lot of these types of systems they have a lot of C++ in place. They also have a lot of interesting Python going on, I'm sure in the data science visualization side, but also maybe more in the core of the system.

33:49 TOBAR: Yeah. First, I should probably mention the execution framework that we use for this. So Intead set off running things with MPI. We have inducing on execution framework that we developed at ICRAR, our institute.. It's called DALiuGE kind of difficult to pronounce even more difficult to write. But I will give you a link to that. The idea of this execution framework. It's a bit like dusk which people are more familiar with. In the sense I do build a graph with your computations and then you execute that graph in a number of workers. Right now, the big difference between dusk and DALiuGE, our execution framework itself. Dusk is very dynamic in this nature you can bring. You can bring workers up and down, and then the scaler will dynamically adjust the load to what is available? in. Where as DALiuGE is more designed with the SKA case in mind in particular are still pretty generic.

34:43 TOBAR: But one of the main design decisions wants to work with our static deployments. So instead of trying to be dynamic in nature and try to, you know, move data from here to there and restarted computations here and whatever. We tried to be very static because moving data from one place to another is some a very expensive operation.

35:02 KENNEDY: If the computer is what's expensive and the data is not that bad, moving around to balance the compute and getting that to happen is really important. But when you have so much data that it's more than the Internet

35:18 TOBAR: moving around the Internet.

35:21 KENNEDY: you're already out of probably near a limit of moving it around just to get it somewhere. So, combinatory passing around is not really what you're after. Yeah,

35:31 TOBAR: so instead of focusing on the dynamism that gives you, we focus on having very good schedule at front of your computations. So you know exactly how long each one is going to take. You know exactly how much data going to be where and you keep it laying out. That's the main difference. We have bean developing this prototype fashion, but we're now use in real life world as well. We just it for this summit demonstration. So all the things that we run of this big simulation, all the processes that we have this one and so on all these 4600 nodes were using our message framework. Just to give you a very quick overlook execution from where 0MQ to send messages between the difference entities and managers across node manager and node managers. So we send events across the different node manages. And we use 0RPC, which is an RPC framework built on top of 0MQ.

36:29 TOBAR: We use that do a couple of remote calls between different node managers. Is all the scheduling of the graph is done using python there is some interfacing with the metice library which is written in scheme item wrapper for that already, right? But the rest is all is all 0MQ.

36:47 KENNEDY: 0MQ looks really interesting, and I haven't done anything with it, but it has a nice Python library. Yeah, ZMQ

36:56 KENNEDY: That's right. Yeah, it seems like something that it would be really useful if you're spending a lot of messages around and whatnot. And then something I had not even heard of which you have brought up is 0RPC. Which So it basically sends messages out over 0MQ and then waits for a response or something like that. Kind of to come back as another message.

37:18 TOBAR: Exactly. It's what we would expect from a RPC framework. Right? Do you can get a reference of a remote object and invoke methods and get replies, pass down parameters and then all of that travels through 0MQ.

37:34 TOBAR: I think it's using message back for the organization and then 0MQ for, for the actual networking

37:42 TOBAR: and on top of that I think 0RPC has bindings for different languages. So and you can also do inter-language RPC with 0RPC.

37:54 KENNEDY: Okay, well, that looks really interesting to me. One of the big challenges. It sounds like a programming. The system is it's just so big and so distributed that really need these layers. You know, you talk to this sort of Okay, I'm gonna talk to the zero RPC. It talks to 0MQ. Which then might talk to this distributed scheduling service that then figures out house have actually right, right? Like there's just layer after layer. Is that a big challenge.

38:21 TOBAR: Yeah, yeah. You got to try to keep a as you were saying all your layers as clean as possible Before we even settle 0RPC were we tried other obvious frameworks like pyro4 or Raimer. Which else? So on top off, you know, trying to feel layer on top player will have to support different ones at the same time. I think we still felt the support there and you can turn it on. It's no really what we use. we, use bureaucracy.

38:46 KENNEDY: Yeah, sure. Another interesting library that you guys have and you actually maintain is iJSON,

38:53 TOBAR: Yes we use and maintain iJSON. So iJSON briefly described its way to iterate over very long JSON strings of data without having a big memory consumption. So you parse parse parse the JSON iterally and you get parsing events or ... there are different levels, you can have my full objects. You can kind of query what kind of objects you want to get from irritations. So you got like an iterator. You know, we're working on Version 3.0. And in that one, you get a series iterable.

39:35 TOBAR: and we got into this because we were again dealing with very big computational breaths which we express us JSON content. That's how we do expect this from one side channel.

39:46 KENNEDY: Yeah, and the last thing you want to do is like load gigabytes of JSON and deserialize it.

39:52 KENNEDY: I just want these subitems here and the rest I don't care about Or maybe I just want the first one like a first to be fine, even if there's 10 million just giving the first.

40:03 KENNEDY: Yeah, iJSON is super cool, and I think people, there's probably a lot of people who could take that and use it. You know, I was just I had someone in the office hours for my online courses saying I'm working with something that's a huge amount of data, something like Google Big Query or something like that. It was like it's too hard to load up all of this at once, so I got to take little bits and load it and like, Well, you know, have you thought about IJSON, right, That'd be really cool. And your process like that or something. Something was like, Yeah,

40:31 KENNEDY: you know, the other thing. I kind of touched on it just a minute ago, but maybe you could speak a little bit more to it. You guys, something that just amazed me in general. But this is such a big scale of it. I think it's even more interesting is a lot of times how we develop code in the small and then deploy it to somewhere in the cloud in a much, much bigger, more complex system, then maybe were used to work on. So, you know, the example that comes to mind for me is like,

40:59 KENNEDY: there's some developer at a coffee shop there working a Mac book air. So they completely weak like that, right? And they're running a single instance of like a dev server, and then they push something to GitHub. It automatically gets picked up by CI/Cd. It pushed over and it kicks off, you know, like, ah, whole new version of some giant app running in a kubernetes cluster across who knows? 10 servers and a bunch of nodes and pods, and so on. The mismatch of I work in this little tiny thing and then, like it's scaled out to this huge system for you guys, like, how does that work, right? How do you debug this thing? How do you reason about like it's a little algorithms that are running? Can you set a breakpoint.

41:43 KENNEDY: Anything, right? Is that like a thing that you can do? Or is that like just literally too much? It's just It's impossible Do you have a kubernetes cluster locally that allows you to kind of simulate this? Or do you have to program on the giant thing?

41:57 TOBAR: No we don't need to program on the giant thing. So for all the summit experiments, we, as you were saying, started small on our own laptop computers, I on also different platforms. I usually run Linux 14 on about some some people in our two most people in our team use Macs.

42:16 TOBAR: Yeah, we're started very small and for that? You've got to make sure that on the very small scale, you know exactly what's going on, you know that everything is working as you expect, so you don't have unexpected errors. unexpected troubles in the future, right For example, when we were developing DALiuGE, we have very good test coverage across the code base, we run all the tests without an internet connection on a single node. Because at that scale you have to make sure that it's working that everything's working fine. And from there on you can start escalating to more complex things. I have to have a very solid foundation. That's for elementals of DALiuGE, but also for the elements of the code that we use in this summit demonstration. Besides, again on our laptops and then we went into, you know, a sever that has 1 GPU and a small cluster with 2-3 nodes and a couple of nodes with 2 GPUs each.

43:09 TOBAR: on the summit system, you know, which is a power nine system it's not an intel system. Before jumping into summit we jump into our cloud provider in the US offers power nine machines. We make sure that everything worked there in a single node, and little by little, you start tackling problems as they come before using the big machine.

43:30 KENNEDY: Yeah, And I guess for you guys, it's There's a another level of challenging Where the machine itself you can't just go to. It's not like I could go to the cloud now and ask for kubernetes cluster if I'm willing to pay for a little bit of it. And I could just do that whenever I want. But I suspect this large computer is pretty much booked out, and you can't just get it whenever you want to make it go Full power.

43:55 TOBAR: Yeah, well, first fall that are others. kind of paperwork involved to get permission. They have to send your physical key. So they send a Physical key from the U. S. Into Australia that I have to use when I log into the computer. So, yeah, it's not your everyday on it. AWS instance, right? Once that all the systems work with queues. So you submit, your jobs into a queue. And then the queue schedules decides what runs?

44:23 TOBAR: So, yeah. You're competing with a lot of people depending on how many resources you're asking for, you will be delayed or not. So yeah, at summit itself. We also started scaling little by little. Side-experiments with six nodes. 10 nodes, 60 nodes, a little by little, we and started finding more and more problems. You know, things that you never really think about or very, very transient errors that only happen when you are spawning Thousands of processes on one of them files on. You didn't see it before because you weren't spawning that many processes before.

44:55 KENNEDY: Yeah, that's where those are tricky to catch. You don't have any control at that point. When you're in control, it's up and running? Everything's fine now.

45:04 TOBAR: Yeah, I actually, once you find those errors that you can start to reason about them and that's that's a very difficult right now, kind of just go and attach yourself to thousands of processes at the same time and then kind of step back for them. Yeah, you do. We have to log a lot of stuff and then recent, very clearly about what could be the possible cause. So we caught a lot of those and then that the final the final bit is stress on the file system. So all of these classes, they usually have a central file system that is shared across the cluster.

45:42 TOBAR: But obviously when you use more nodes, You're putting more stress into the central file system when reading and writing data.

45:53 KENNEDY: Yeah.

45:54 KENNEDY: So what was it like when you basically hit Enter to submit the job for the 27,000 GPUs. Were you like "We better get it right?"

46:03 TOBAR: wait, really, like a little bit better. Get it right. We all assembled it to room with it. Come

46:09 TOBAR: on. We could answer. There were basically have one shot to doing the full simulation because we were given a time allocation of 20,000. Node hours.

46:18 TOBAR: You knew that in the big experiment, we were going to be using 15,000. So it was either going to work or not, Right? But it did.

46:29 KENNEDY: so all the gradual scaling up all the testing, It all paid off. Yeah. Did you get, like, a weird news report in Tennessee that it was suddenly hotter a little bit that day than they?

46:44 TOBAR: I wouldn't be surprised

46:48 KENNEDY: warm breeze coming.

46:51 KENNEDY: Yeah, that thing must have been really screaming. That's that's quite something All right. So I guess we could probably wrap it up in terms of time, but this super interesting. So maybe you guys could just tell us. Maybe some of the lessons learned kind of touched on him a little bit, right? This, like, scale it up a bit of time. But what are some of the lessons you learn,

47:07 TOBAR: huh? Yeah, for me, it was I think it was the process. Mostly, I also maybe learned a bit of something particular but it was more that process of scaling all this exercise off that that was really challenging, really Stress importance of having very solid foundations before you take the next step. Because otherwise, if you're kind of given steps in the dark you will continue hitting walls?

47:33 KENNEDY: Kevin what about you?

47:35 VINSEN: I want to iterate what Rodrigo said. I mean, a lot of the early work with the DALiuGE I was a test monkey. There's a big project called Chillis, which is a very deep observation using a telescope in Sikora? It took four years we didn't walk that slowly building up. I was a different software stack. We didn't have a problem with the GPUs are we would start with two or three nodes power through this work. Are we getting what we want? And then winding up. Of course it doesn't much. Now we compare it to summit.

48:14 VINSEN: you just run out about 200.

48:17 KENNEDY: I guess. Let's wrap it up with the final two questions for you, you guys. And you're welcome to throw out some of your own that you're maintaining or pick a different one. But how about the notable PyPI I package will start with that one.

48:32 VINSEN: Numpy

48:33 KENNEDY: Numpy right, Awesome. Yeah, that's I'm sure that's a foundation of a huge amount of work that you are all doing.

48:38 TOBAR: Yeah, I think I will do some some more. Stressing on on iJSON, just because I'm intended because there is this new version coming, so yeah, go out.

48:49 KENNEDY: Yeah, it looks really, really useful. Whenever you have a ton of, like, very large, JSON, or you need to not load it to memory.

48:56 VINSEN: Well, I picked up from your other podcast. Is typer that beautiful for writing command line interfaces

49:02 KENNEDY: Oh, yeah. typer is great. I think we covered that in PythonBytes. That's right. Yeah.

49:06 VINSEN: Thank you for that one.

49:08 KENNEDY: Yeah, you're welcome. Super. All right, Now when you all are writing some Python code or really any good What editor are you using?

49:16 KEVIN: PyCharm

49:16 TOBAR: I use eclipse. I've been using eclipse for the last 15 years for writing Java C++ Python. So eclipsing Python is coming with comes with Pydev of which is that.

49:30 VINSEN: I'm the other way around. I started on IntelliJ and then transitioned into PyCharm.

49:39 KENNEDY: Yeah, it's a pretty easy transition from intelliJ to PyCharm are you guys that's probably a good place to leave it or get short on time. But thank you for sharing. This is super interesting. Final called action. People want to learn more about the SKA. They want to learn about some of these libraries You're working on more about radio astronomy what you tell.

49:57 TOBAR: There's also a lot of material out there, too. If you're interested in the topic, there's tons of material that just go to the skatelescope.org or the icrar.org. I'm sure YouTube we will be full of videos as well, to to learn about all these.

50:14 KENNEDY: Cool. All right, well, thank you both for being here. This was a lot of fun, and I really enjoyed learning about radio telescopes. And I didn't even realize, Roderigo that you were the maintainer of iJSON, which is a nice little bonus. Very Cool.

50:28 TOBAR: Yeah. Yeah, well, I took over just last year. I believe so. I'm not the original creator I'm just the maintainer. I started kind of contributing more to it.

50:39 KENNEDY: super. Well, thank you, Rodrigo. Thank you, Kevin. Have a great great day.

50:45 KENNEDY: This has been another episode of Talk Python to me. Our guests on this episode have been Rodrigo Tobar and Kevin Vincent. And it's been brought to you by Linde and Clubhouse

50:56 KENNEDY: Start your next Python project on Liode's, State of the Art Cloud Service Just visit talkpython.fm/linode. L I N O D E. You'll automatically get a $20 credit when you create a new account.

51:08 KENNEDY: Clubhouse is a fast and enjoyable project management platform that breaks down silos and brings teams together to ship value, not features. Fall in love with project planning. Visit talkpython.fm/clubhouse

51:21 KENNEDY: Want to Level up your Python If you're just getting started, try my Python jump start by building 10 apps course. Or, if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python. And, of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite pod catcher and search for Python. We should be right at the top. You can also find the iTunes feed at /iTunes. The Google play. Feed at /Play in the direct RSS at /RSS on, talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon