Don’t Let Your AI Get Caught in Traffic
Demo: Automate Juniper Apstra software using Apstra Terraform provider for AI clusters.
In this hands-on demo, James Kelly shows step-by-step how to use Juniper Apstra® software to optimize AI networking fabrics to avoid the slowing effects of congestion and packet drops.
You’ll learn
How to set up a working AI cluster with three different fabrics in Apstra software
How dynamic load balancing helps alleviate congestion vs. regular ECMP load balancing
Who is this for?
Host
Experience More
Transcript
0:00 don't get caught in traffic don't play
0:02 in traffic and stay in your lane today
0:05 we're going to be paying attention to
0:07 exactly none of that and doing the exact
0:10 opposite I'm James Kelly from Juniper
0:12 Networks we're going to be talking about
0:14 playing in traffic at high speeds in AI
0:17 clusters and exactly how you manage that
0:19 let's get into it all right let's take a
0:21 look at our handy terraform abstra
0:24 examples repository if you were with me
0:27 in the last demo about AI topology
0:31 design you'll know that this demo is
0:33 going to show you how to automate abstra
0:36 with our abstra terraform provider for
0:39 AI clusters last time we were talking
0:41 about rail optimized design an Nvidia
0:43 prescribed feature and how we make that
0:45 easy with juniper abstruct in our intent
0:48 based multivendor fabric manager at
0:50 Juniper Networks today we're going to be
0:52 talking about a different part of our
0:56 best practice recommended design and
0:58 last time we were looking at this
0:59 subfolder of the examples I'm going to
1:01 go into this example lab of a real AI
1:06 cluster that we've set up with three
1:08 different Fabrics or three different
1:10 so-called blueprints in abro and I've
1:15 actually got an instance of apure Spun
1:17 up in cloud Labs you'll know that I
1:19 mentioned last time too that uh you can
1:21 delete and create your own topologies
1:23 here and uh just at a few minutes after
1:28 a click of a button you'll have access
1:29 access to abstra and be able to log in
1:32 so when you open it up in a new tab you
1:34 basically have this and it asks you to
1:36 log in and it gives you the credentials
1:37 right
1:38 there also like last time I'm going to
1:42 start off by uh showing you my GitHub
1:45 desktop I've just pulled down the
1:48 repository that I was showing you in my
1:49 browser terraform apture examples that
1:52 has all of the Juniper examples in there
1:56 and we're going to be using that one of
1:57 them of course so I'm going to open this
1:59 up up in Visual Studio
2:02 code like last time the first thing that
2:04 I have to do is actually point my
2:07 provider at my instance of
2:12 abstra and I'm going to just copy and
2:15 paste this I had a quick note outside of
2:18 the screen recording that's how I did
2:20 that so easily but admin is the username
2:23 amazing catv your dollar sign is the
2:25 password and this is the IP address that
2:27 you saw in the last
2:30 browser tab let me just flash back there
2:33 so
2:34 that you know you can see exactly what
2:37 I'm talking about here's the IP address
2:40 here's the credentials all right
2:44 so back to this I'm just going to now uh
2:49 save this
2:50 file and take a look at uh what we're
2:54 talking about in our blog was Dynamic
2:57 load balancing and how it helps
2:59 alleviate congestion over regular ecmp
3:03 load balancing and the possibilities of
3:06 using some packet spraying in the future
3:08 and other forms of load balancing um as
3:11 part of abstra reference designs those
3:14 intent-based very nicely validated
3:17 designs that are laid down as a data
3:20 center Network fabric um we're going to
3:23 be using the layer 3 routed only designs
3:27 and as part of that it doesn't include
3:30 the configuration in junos for dynamic
3:33 load balancing if you want to learn more
3:35 about Dynamic load balancing I'm going
3:37 to keep the video short and ask you to
3:39 go and read the details in the blog it's
3:41 also a little bit of self promotion
3:42 since I wrote the blog I would encourage
3:45 you to read it I would hope um this is
3:47 the actual very simple configuration in
3:49 junos and if you're familiar with junos
3:51 this should the stanza sort of syntax
3:54 look very familiar to you you can change
3:57 the inactivity timer in microsc or you
4:00 can just leave it out completely and
4:01 it'll default to
4:03 256 I wanted to show that to you um also
4:07 unlike last time where we only looked at
4:09 designs I've got a blueprints file here
4:12 and this really creates the blueprints
4:15 creates some of the resources that are
4:16 necessary for the blueprints and I'm
4:18 going to be applying as I described in
4:21 the blog the configlet that was created
4:25 in that other file to two different
4:27 Fabrics one for my back end GPU to GPU
4:31 Fabric and one for my storage fabric
4:34 both of those are the Rocky very high
4:36 bandwidth fairly low flow count Fabrics
4:40 that'll highly benefit from Dynamic load
4:42 balancing as compared to you know other
4:44 kinds of load balancing and as an
4:47 example you could apply this to all
4:48 devices we've just you know given this
4:50 little condition in here if you only
4:52 wanted to apply it to the leaves for
4:54 example if you had let's say single
4:56 links between your leaves and spines
4:59 dynamic balancing really wouldn't help
5:01 you at all on the way back from The
5:03 Spine down to the leaf that would be an
5:05 example of how you would do that all
5:07 right so I'm not going to go into any
5:09 more about the configlet I'll let you
5:11 check that out at your own Leisure what
5:13 I will show you though is that coming in
5:17 here like last time we can before I do
5:20 this I need to change into the right
5:25 folder and I'll do a terraform in it we
5:29 just make sure that I have the right
5:31 version installed the most recent
5:32 version of the abstra terraform
5:36 provider and if I do a terraform plan
5:39 this is a know blank slate in terms of
5:42 my instance of Abra that I spun up it'll
5:44 say that there's 58 different resources
5:47 to
5:48 add and I will do a terraform
5:52 apply and before I type yes what I can
5:56 do I suppose is perhaps this just kind
5:58 of split the the screen here and let's
6:01 look at
6:03 abstra and you'll remember how all of
6:05 the designs showed up for example under
6:08 the racks and the templates in the last
6:10 demo that I did this time if we go into
6:13 blueprints you see that there's
6:14 absolutely no blueprints here right
6:18 now as soon as I say yes here it's going
6:21 to start firing away at the abstra AP
6:25 the teror form provider for Abra is
6:27 implemented in go and that uses a go SDK
6:30 for abstra was also um open sourced and
6:34 all of this stuff is going just in a
6:36 matter of seconds creating three
6:39 different fabric configurations inside
6:42 of abstr now what this terraform
6:45 actually doesn't do that will be adding
6:47 in the future as part of my next demo as
6:49 I'm going step by step through this is
6:52 additional configlets I talked in the
6:54 blog about you know hinting at bcq CN in
6:56 the next and then looking at some
6:58 Analytics um so this isn't actually
7:01 going to commit anything to the Juniper
7:03 Juno devices we'll save that for a later
7:05 demo but you can see basically now that
7:07 the terraform stuff is all done it's
7:10 gone and created the backend GPU fabric
7:13 backend storage Fabric and let's now
7:18 make this window a bit
7:19 bigger the front end management Fabric
7:22 and if you look inside of any one of
7:24 these you'll have to come into the stage
7:27 Tab and you can see that there's certain
7:29 types of resources that were allocated
7:31 this was done in that blueprints uh. TF
7:34 file and there's certain things for
7:36 example like the spine and the leaf that
7:39 aren't yet actually allocated right so
7:43 if you wanted to for example go and you
7:46 know allocate those here you could do
7:48 that manually as I said we're going to
7:49 save all of that automation for a later
7:52 demo and show that to you as well and of
7:55 course none of this is committed because
7:57 for it to be committed we have to be
7:58 using real devices as I'm putting this
8:00 together quickly uh like I said one step
8:02 at a time so we will get there in later
8:04 demos stay tuned for how this actually
8:07 shows up at the device level and then
8:10 you can kind of see for example when you
8:12 see one of these Juniper devices all of
8:14 the configuration for it and we'll take
8:17 a look at how that configlet gets
8:18 applied here in the next demo in the
8:21 meantime you can see that uh this is the
8:23 front-end fabric where we actually
8:24 didn't apply the config so I should look
8:26 at let's say the storage or the other
8:28 one and when I go in there if I go into
8:32 the config lates here excuse me you'll
8:34 see this config here DLB for AI leaves
8:39 that was the config that we provide uh
8:42 provisioned excuse me using
8:46 terraform all right that concludes the
8:48 demo I hope you'll go and check out this
8:50 repository you can easily pull it down
8:52 and walk through these things using
8:54 Astra Cloud labs and you can see how
8:56 easy it is to create and automate stuff
8:58 in asra
9:00 and stay tuned for the next step on the
9:03 journey to creating and automating AI
9:06 training clusters with Juniper Networks
9:09 I'm James Kelly thank you