Automating AI Cluster Network Design With Apstra and Terraform
Putting cloud, automation, and AI together to work for you.
Watch this hands-on demo to see how to terraform apply to import many AI cluster-design examples into Juniper Apstra™ intent-based networking software.
You’ll learn
About design types, such as sizes of clusters, GPU compute fabrics for model training, storage fabrics, and management fabrics
How the logical rack types follow NVIDIA best practices for rail-optimized design
Who is this for?
Host
Experience More
Transcript
0:03 you've heard about clouds you've heard
0:05 about Automation and you've definitely
0:06 heard a lot about AI recently we're
0:08 going to be putting all of those things
0:10 together today in this quick demo I'm
0:11 James Kelly from Juniper Networks I'm
0:13 going to be talking about setting up and
0:15 automating the design of AI clusters
0:18 with juniper abstra and the new
0:20 terraform provider for abstra let's get
0:23 into it
0:26 all right let's start in our browser by
0:28 looking at this repository on GitHub
0:30 this is the terraform abstra examples
0:33 repository under the Juniper
0:35 organization this is accessible and open
0:37 to everyone
0:39 um here you'll find many examples and a
0:42 growing list of them as well of things
0:44 that you can automate inside of juniper
0:46 abstra which is of course I should
0:48 probably explain Juniper Network's
0:51 intent based management tool for data
0:54 center Fabrics it's also multi-vendor
0:57 and very well known for that
0:59 um terraform is obviously an
1:02 infrastructure as code automation tool
1:04 and the terraform provider for App Store
1:06 allows you to drive abstra and hence
1:07 your data centers through terraform and
1:10 one of these kinds of data centers that
1:12 I mentioned in the opening is AI
1:15 clusters and AI clusters there's some
1:17 differences to the fabric design and of
1:20 course abstra lets you customize fabric
1:22 design but rather than you know a
1:25 customer say starting out with having to
1:27 design their data centers we have many
1:29 examples that we put together for
1:31 different sizes of AI clusters that you
1:33 can just apply to abstra in a matter of
1:35 a few seconds with terraform and that's
1:38 what I'm going to be showing you guys
1:39 today
1:40 so one of the subfolders of this
1:43 repository is AI clusters and a few
1:47 people including myself have put
1:48 together this automation to do these
1:51 examples and make as I said various
1:54 sizes of clusters easily creatable
1:56 inside of abstract in terms of the
1:58 design so you'll find a whole bunch of
2:00 things here in terms of all of the steps
2:02 that you would need one of the very
2:04 first things you'd want to do is of
2:06 course install terraform if you don't
2:08 have that say on your laptop there's
2:10 ways of using terraform Cloud that is
2:12 explained in some of the other examples
2:13 I'm not going to show that today
2:15 um and beyond that you also need an
2:18 instance of abstra now you might already
2:19 have abstra running in your data center
2:21 but you know rather than having to
2:24 install or set up an instance of appstra
2:26 one of the easy ways you can access
2:28 abstra is through abstract Cloud Labs
2:30 you can easily start a topology here for
2:32 free this is open and accessible to
2:34 everyone when you do that you can pick
2:36 an expiration time and after a few
2:40 minutes it will spin up an instance of
2:43 abstra and a topology now in my specific
2:46 sandbox in apps for cloud Labs I elected
2:49 to just do an abstra only instance so
2:52 there's no actual physical devices since
2:53 I'm just going to be showing the
2:55 automation of The Logical design today I
2:57 don't need any physical boxes and you
2:59 can see that there's a simple button
3:01 here open a new tab that'll allow you to
3:04 log in
3:05 so I'm going to log in with the password
3:08 that it provided here
3:09 in the username admin of course
3:13 now this is a fresh instance of apps for
3:15 that I really haven't done anything to
3:17 and it's a welcome screen here actually
3:21 talks about building racks and designing
3:23 the networks and then creating and
3:25 deploying a blueprint and this order is
3:27 relevant because this is the order that
3:28 you would typically design and then
3:31 deploy a blueprint data center with
3:33 as I mentioned we're going to be
3:35 automating the design of AI clusters and
3:38 that turns out to be about all about
3:41 um you know logical devices the racks
3:44 and the templates and then of course the
3:45 template is used to deploy the blueprint
3:48 and you can stamp those out again and
3:50 again so as I mentioned the design phase
3:53 is certainly customizable but rather
3:56 than having you have to point and click
3:57 your way around that you can automate
3:59 things with terraform and we've
4:00 automated all of these examples for you
4:02 so let's have a look at that let me just
4:05 go into racks first of all and assure
4:07 you that this is a standard out of the
4:09 box abstra instruments there's nothing
4:11 in here nor in any of the templates that
4:15 doesn't come out of the box with app
4:16 strip what we're going to see is that
4:18 there's a whole bunch of templates
4:20 related to AI cluster networks that are
4:22 in here and we'll talk about some of the
4:25 nuances and differences in that design
4:28 and how that is important to AI use
4:31 cases such as model training so now that
4:33 we've got this up and running one of the
4:35 things that we need to do is actually
4:37 get this example terraform HCL
4:40 configuration onto the laptop so I'm
4:45 going to use GitHub desktop because it's
4:47 a nice visual tool makes sense to demo
4:49 from it I've already logged into my
4:52 GitHub desktop instance and if I just
4:55 start typing some of this stuff you can
4:57 see I can easily come and choose to
4:59 clone the repository that was just
5:01 showing you in the browser
5:03 so now all of that stuff is downloaded
5:05 to my laptop I happen to have Visual
5:08 Studio code installed and this handy
5:10 button will open it up just like that
5:12 I'm going to just make this instance of
5:15 Visual Studio code a little bit bigger
5:17 to match my video
5:19 and you've got all of the examples we're
5:22 going to only need to look at the AI
5:24 clusters bit of this now I'm not going
5:27 to go into all of the HCL configuration
5:29 the one thing you do need to see and
5:32 actually change is the place that you're
5:36 going to point terraform to so we need
5:38 to change the username and password to
5:41 match what we got from Astra Cloud labs
5:43 and we need to change this after URL as
5:46 well let's go back to our instance of
5:48 Astra here copy the IP address and port
5:51 number
5:52 and then go back into Visual Studio code
5:54 and replace that after URL placeholder
5:57 with this
5:58 okay so from there I'm just going to
6:00 save the file
6:02 and
6:04 close this down one of the things that
6:07 you can do in Visual Studio code is open
6:09 up a terminal
6:11 let's go into the AI cluster subfolder
6:14 from here I'm going to do a terraform
6:16 init that'll just make sure that I have
6:19 the most recent version of the terraform
6:22 provider for abstra
6:24 after that you can if you would like to
6:27 do a terraform plan it's an optional
6:30 step before you apply
6:33 and it looks like I have a error in my
6:36 provider file
6:38 right so one of the things that I happen
6:40 to have in my password that I need to
6:42 change is that little
6:45 symbol there it's got to be URL encoded
6:48 I'm going to go back into here and just
6:50 resave the file
6:52 didn't expect that but that's a live
6:55 demo for you okay after that it's happy
6:59 and it says that it's found 53 resources
7:02 to add nothing to change nothing to
7:03 destroy of course because we haven't
7:05 applied anything yet so that all looks
7:07 good
7:09 now let's do a terraform apply and while
7:13 we're doing this what I wanted to show
7:15 is how fast things happen in abstra at
7:18 the same time so I've got these
7:19 templates up in the back here right so
7:22 watch those templates all change as many
7:25 of the templates are now put into app
7:27 strip with this terraform apply just
7:29 have to answer yes I'm ready to go and
7:32 boom you can see all that happening
7:34 and then in the browser over here you
7:37 can see how quickly all of these things
7:39 appeared now what's pretty neat about
7:41 all of these examples of for example
7:44 different sizes of GPU clusters some
7:47 large ones we've got certainly quite a
7:49 bit here in terms of 256 dgx those are
7:53 Nvidia servers with the h100 gpus or the
7:56 a100 gpus what you'll see if these
7:59 examples and the ones like it at smaller
8:01 sizes
8:03 like that guy and that guy and that guy
8:05 these are the back end training Fabrics
8:08 that are used for model training and
8:11 when you're running an Ethernet training
8:13 fabric you've got RDMA over converged
8:15 ethernet we'll call it Rocky for short
8:18 roce and that Rocky fabric has a very
8:21 special design
8:23 recommended by Nvidia to drive the
8:26 maximum performance and job completion
8:28 time
8:29 and networking performance drives
8:31 Network job completion time of course in
8:34 your network and this special design is
8:36 called a rail optimized design where
8:39 rail local traffic can go over fewer
8:42 hops because these 64 dgx servers
8:46 actually have an internal switch to it
8:49 between the gpus so you don't need to go
8:51 for example from GPU you know one to GPU
8:54 2 within the same server across the top
8:56 of rack Leaf switch it's able to do that
8:59 just inside of the server and it has a
9:03 special feature of the rail optimized
9:05 design in the newer versions of Nvidia
9:07 nickel this feature is called pxn and it
9:10 allows even further rail local
9:12 optimizations such as when two different
9:15 servers are talking to each other
9:17 you'll have traffic that doesn't have to
9:20 go over the spine Network and we can
9:23 explain that in just a second but I
9:25 wanted to kind of go into one of these
9:27 clusters and show you what it looks like
9:29 of course
9:31 and then sort of explain some of the
9:33 rail optimized design from there you can
9:36 see the option to expand things and
9:38 whatnot
9:39 this happens to be using a certain rack
9:42 type rather than just click that I'm
9:44 going to actually go into
9:46 all of the racks and show you all of the
9:48 different types of racks here for the
9:50 storage fabrics for the management Front
9:52 End fabrics
9:54 and the rack that I was just looking at
9:55 is this one right here
9:57 now
9:58 this does not look like a typical data
10:01 center rack and in fact this is not a
10:03 physical rack design
10:05 in this case in order to accommodate for
10:07 nvidia's rail optimized design what
10:10 we've done is we've built a custom rack
10:12 type inside of abstra for what we call a
10:14 stripe or in other words it's a group of
10:17 eight Leaf switches and why eight well
10:20 because you have eight gpus on the
10:22 server and like I said those servers
10:25 that you see here the 16 of them could
10:27 be you know dgx or they could be hgx
10:29 based which just means you can get them
10:31 from like say a Dell or a super micro or
10:33 someone else and they follow the same
10:35 pattern
10:36 of having that internal switch in eight
10:38 gpus and effectively the same build out
10:40 of hardware
10:41 now
10:43 what you'll see in this design called
10:45 rail optimize is that the GPU one inside
10:48 of the server in this special Network
10:50 for the gpus to interconnect again this
10:53 is for model training or sometimes
10:55 called the back end Network you'll see
10:57 that these eight different gpus are
10:59 individually cabled up to the eight
11:01 different Leafs inside of this stripe
11:04 and this is of course not a typical data
11:06 center design in a typical data center
11:08 to design what you'd expect is probably
11:10 many fewer ports on each server and
11:13 you'd expect those ports to go up to one
11:15 top of racks switch or sometimes
11:18 aggregated to a pair of of switches that
11:22 might be connected with ESI lag
11:24 so this is of course very different in
11:27 these IP Fabrics when I mentioned that
11:30 pxn feature is able to do is when you
11:33 have traffic let's say going from GPU
11:36 one of this server to GPU one of the
11:39 second server they'll of course be able
11:41 to you know since they're cabled up to
11:43 the same Leaf be directly accessing each
11:46 other with just a single hop across that
11:49 top of rack Leaf switch now what if
11:52 let's say GPU one of this server needs
11:54 to talk to GPU two of the other server
11:58 well you'd expect that you know you
12:00 probably expect I think that you'd go
12:02 from this Leaf switch here to a spine
12:07 switch and then down to the leaf switch
12:09 with which you know gpu2 is connected
12:13 yeah that might make normal sense but as
12:16 I mentioned this special nickel feature
12:18 from Nvidia called pxn allows This
12:20 Server to understand the overall
12:22 topology and it will use the internal
12:24 switch in the server to pass the traffic
12:26 from gpu1 to GPU 2 locally and then
12:30 it'll send it to the top of rack switch
12:32 representing all of the GPU twos and
12:35 then down to this second server the cool
12:38 thing about that is you know it really
12:40 optimizes for latency and performance is
12:43 really key inside of these use cases
12:45 because of course these servers are very
12:48 expensive hundreds of thousands of
12:49 dollars each GPU is very expensive right
12:52 tens of thousands of dollars and to the
12:55 extent that your gpus are sitting idle
12:59 wasting time waiting for the network
13:00 you're losing return on investment of
13:03 course right so this is why you know the
13:06 network may not seem like the most
13:08 important thing to AI clusters you'd
13:10 think it's got to be the gpus and all of
13:12 those servers there's such a big expense
13:14 but if your network is holding back the
13:16 performance of your model training and
13:18 your gpus you know what good are your
13:20 gpus to you they're not right this is
13:22 why the network performance is really
13:24 key when you're designing AI clusters
13:26 and with these examples you know you'll
13:29 be following the best practices in
13:31 starting from a foundation that is
13:33 strong to assure the best performance
13:35 possible
13:37 when would traffic be going through the
13:39 spine switches you might ask well let's
13:41 go back to look at the template again
13:43 and I'll just pick on this smaller
13:46 cluster because it's a little bit easier
13:47 to see
13:49 one of the times when you would see
13:52 these different groupings here which we
13:55 called Stripes of different you know 16
13:57 different servers or eight leaves of
14:00 course if you know traffic has to go
14:02 from one of these servers to a server in
14:05 a different stripe then of course the
14:07 traffic will cross over the spine
14:09 Network
14:10 and when it's Crossing across the spine
14:12 Network again there's things that you
14:14 want to design for in the spine Network
14:17 such as Dynamic load balancing and
14:19 network congestion management protocols
14:22 like dcqcn which is a combination of you
14:25 know ecn and PFC you can go and read
14:28 about these things in glorious detail in
14:30 the Juniper Juno's documentation and
14:32 we'll talk about automating some of
14:34 those configurations in another video
14:35 but for this video I think I'm done
14:38 explaining the rail optimized design
14:40 that is recommended as a best practice
14:42 from Nvidia and you've seen also how
14:45 simple it is to use terraform and take
14:48 all of these examples for these things
14:50 that I hope you dig into yourself and go
14:52 and apply them into an instance of
14:54 aperture that is really accessible to
14:56 anyone so thank you for your time
14:58 interest in joining me on this quick
15:00 demo Journey again this is James Kelly
15:02 from Juniper Networks reach out to us at
15:04 Juniper if you're building AI clusters
15:05 we'd love to help you