James Kelly, Senior Director, PLM, CRDC, Juniper Networks

Automating AI Cluster Design With Juniper Apstra

AI & MLData Center

Screenshot from the video showing an image of a man speaking and gesturing with his hands and a graphic image displaying the words, “Avoid an Infrastructure Bottleneck,” “GPUs represent the bulk of the CapEx investment,” and “The network connects GPUs in distributed training.”

Prevent network bottlenecks by automating AI cluster design.

Don’t let a network bottleneck waste expensive GPU time when building AI clusters. Find out how to “Terraform apply” to effectively import many AI cluster design examples into Juniper^® Apstra^®software in this video filmed live at Tech Field Day 18.

Learn more about Juniper Apstra software.

You’ll learn

To automate AI cluster design using examples with many sizes of clusters, GPU compute fabrics for model training, storage fabrics, and management fabrics
How to follow the NVIDIA best practice for rail-optimized design, which is explained with the topology of eight leaf devices in a stripe grouping

Who is this for?

Network Professionals Business Leaders

Host

James Kelly

Senior Director, PLM, CRDC, Juniper Networks

Resources

Watch

0:46

The image shows a screenshot from Juniper Apstra® Time Voyager. There is a blue text box in the corner with the headline, “Limit on Saved Revisions.” The text shows you how you can go to Settings in Juniper Apstra to update the quota of automatically saved revisions.

Transcript

0:09 my name is James Kelly I've been at

0:11 Juniper since 2006 uh you heard Monsour

0:15 talk earlier today about the most recent

0:18 inflection in what's happening in Data

0:20 Center and that's my topic I'm also

0:22 going to be building on what Chris

0:23 talked about so going to be talking

0:25 about AI clusters and doing some

0:28 demonstrations of what we've got as

0:30 examples for AI clusters in terms of the

0:33 different designs that I'll go through

0:34 in just three slides first and then I'll

0:37 actually demonstrate how you can play

0:38 around with these designs yourself in

0:40 abstra Cloud Labs um and how you can

0:43 actually look at the entire

0:45 configuration of all of the switches

0:47 with all of the additional things that

0:49 uh that are put on top of the standard

0:51 uh IP fabric reference design and abstra

0:54 specific to AI clusters for training

0:56 where the networking problems are are a

0:59 little bit beyond what we see in the

1:00 traditional data center

1:02 space all right so with that quick

1:06 introduction I wanted to set the stage

1:08 of why networking is important when it

1:10 comes to building AI

1:12 clusters um you know we are all familiar

1:15 I'm assuming with the Nvidia stock price

1:17 and the reason for that is gpus are very

1:20 expensive and building AI clusters the

1:24 networking cost is really whether it's

1:26 from a TCO perspective or from a capex

1:29 perspective It's relatively small

1:31 compared to the overarching bill that

1:33 you're going to be paying when these

1:34 servers are using you know somewhere

1:37 between 10 and 16 times the power of a

1:39 traditional data center server and there

1:41 are hundreds of thousands of dollars I

1:43 mean if you're talking about an Nvidia

1:45 dgx h100 that has eight of the h100 gpus

1:48 in it it's about 400 grand um so when

1:52 you're spending all of this money um you

1:54 know why is the network important uh so

1:57 the reason is very simple that I put on

1:59 the slide here if the network is a

2:01 bottleneck that delays the training job

2:03 completion then expensive GPU time is

2:05 wasted and the training becomes Network

2:08 bound instead of compute bound so

2:10 obviously this is a problem you know AI

2:12 training where we want it to become

2:14 compute bound we want it to effectively

2:16 scale linearly or as close to it as

2:18 possible with the number of gpus that we

2:21 add right but just like you know a

2:25 kubernetes cluster or these other types

2:27 of data center applications or cloud

2:29 applic

2:30 that are

2:31 distributed what holds all of those

2:34 distributed applications together that's

2:36 the network right so at the end of the

2:38 day you know a training

2:42 model that is or a model to be trained

2:45 in a cluster of different gpus is going

2:48 to have to be distributed across all of

2:50 the gpus and effectively networked

2:53 together and the way that this works

2:55 obviously is that the model itself

2:57 usually doesn't fit on a single GPU

2:59 right

3:00 um so the model is divided up and then

3:02 for purposes of moving in parallel the

3:05 data set which is also humongous is also

3:07 divided up so there's all of this

3:09 paralyzation that happens in an AI

3:11 cluster when it comes to training and as

3:15 that paralyzation kind of needs to get

3:18 reconciled um for the for the checking

3:21 and the evolution of the model over the

3:23 course of the training across the

3:25 different jobs that's where there's tons

3:27 of communication over the network right

3:30 and inside of these networks they're

3:32 also I said they're not your average

3:34 data center right your average data

3:36 center is probably connected on your

3:38 Revenue facing ports maybe like 10 gigs

3:41 per second 25 gigs per second 100 would

3:43 be considered quite fast inside of an AI

3:47 cluster the GPU to GPU fabric is

3:51 connected at 400 gigs per second per GPU

3:55 I just said that there's eight gpus

3:56 inside of one of those dgx servers each

3:59 one of those has its own 400 gig Nick so

4:02 just imagine that and then the server

4:04 also has separate Nicks that are used um

4:07 to connect to the storage cluster and

4:09 then it has a 100 Gig Nick which in

4:11 these servers is the slow speed and

4:14 that's into the frontend management

4:15 Network where you have some other

4:17 servers that are responsible called

4:19 headend nodes to coordinate the training

4:20 jobs across the the cluster so these

4:23 these um these servers are effectively

4:26 each connected into three different

4:28 networks and the beauty abstra into

4:30 different blueprints is you can actually

4:31 manage these three networks which are

4:33 effectively inside of one data center

4:35 from one paint of glass from one abstra

4:37 right besides the challenge of the

4:39 speeds and feeds there's something

4:42 called rail optimized design that is

4:44 prescribed by Nvidia we all know Nvidia

4:47 is the 800 pound gorilla in the space of

4:50 the gpus they they own the entire stack

4:53 they've done some great stuff in terms

4:56 of innovating inside the server so so

4:59 there's this technology called Envy link

5:02 Envy link inside the server connects all

5:04 of the gpus so that when a GPU you know

5:07 one needs to talk to GPU 2 it doesn't

5:09 need to go out of the Nick to a switch

5:12 and then back in it can go over the EnV

5:14 link so for this reason the AG GPU Nicks

5:19 are actually not typically connected to

5:21 the same Leaf so there's not really a

5:24 concept of like a topof rack switch here

5:26 and because there's eight gpus inside

5:29 the dgx servers or any of the hgx

5:31 servers that you'll find they're modeled

5:33 on the same pattern from Dell super

5:35 micro Lambda Labs Etc you'll see this

5:38 pattern that I've got in the slide here

5:40 where the GPU servers at the bottom on

5:43 the GPU fabric side those Nicks

5:45 specifically those eight they're cabled

5:47 up where GPU one Nick goes to Leaf one

5:51 GPU 2 goes to Leaf 2 all the way through

5:54 eight so you're building your data

5:56 center or your cluster in these grp

5:59 groups of eight leaves and Juniper we

6:02 call this group a stripe so this is

6:05 basically how you build out your data

6:06 center it's it's very different and kind

6:08 of surprising all the way down to the

6:11 physical cabling of how these things

6:13 work and then typically you know other

6:15 than that it's a clow um fabric so you

6:18 know every leaf is connected to every

6:21 spine the one thing that you typically

6:23 see in data center networks that's a

6:25 little bit different here is as a rule

6:27 of thumb in the storage and the Jeep pu

6:29 fabric we don't necessarily recommend

6:32 doing any kind of over subscription um

6:35 so the downward facing access from the

6:38 leaves in terms of that aggregate

6:40 carrying capacity is the same as what

6:42 you see in the fabric up to the spines

6:45 um and you can imagine that if you're

6:46 using a lot of 400 gig links from your

6:50 Leafs you're using state-of-the-art

6:52 switches like the tomahawk Ford that

6:54 Monsur mentioned for example today and

6:58 with that you often see a lot of ports

7:01 going up to the spines not just a few

7:03 which is also a different physical

7:05 pattern that's that's not really

7:07 necessarily familiar to many

7:09 people when it comes to building out

7:11 data centers we also think in terms of

7:13 often building out fixed form factor

7:15 devices so you can use the qfx series

7:19 like the tomahawk for you know qfx 5230

7:22 as a leaf you can use it as a spine as

7:24 well now you'd need many spines the

7:27 other way that you could build out your

7:29 spine

7:29 is with Juniper's PTX series we call

7:32 these high raic switches but they're

7:34 basically just chassis modular devices

7:36 where you can you know buy a bunch of

7:38 spines and then add line cards to them

7:41 over time and grow that way in a little

7:43 bit more of a progressive manner also

7:45 the cool thing is you know it comes in a

7:48 four line card eight line card and 16

7:50 line card variants in the 16 line card

7:52 variant that I've got here you can get

7:55 up to

7:57 576 ports of 400 gig G inside the same

8:00 chassis and these chassis are 800 gig

8:02 ready um for when you know Tomahawk 5

8:05 based platform comes out on the qfx side

8:08 as well so I often get asked you know in

8:10 the scale uh that you're dealing with

8:13 James in these data centers for AI

8:15 clusters you know do you see us going to

8:18 Supine and you know generally speaking

8:21 you know that becomes a lot more

8:22 expensive because of the 400 and

8:24 eventually the 800 gig Optics right that

8:26 you would need just to just to

8:27 accommodate all of that cabling even

8:29 would be very expensive um so really

8:32 sticking to a three-stage CL or a

8:34 two-tier network is pretty important and

8:36 that's why in large enough networks you

8:38 typically see the preference for some of

8:40 these high Ric spines so I put together

8:43 this one slide is sort of like hey

8:45 what's Juniper's maximum GPU cluster

8:47 that you could build in a three-stage

8:49 clo and these are sort of the numbers

8:50 right if you're using the

8:52 state-of-the-art qfx 5230 Leaf which is

8:54 64 ports of 400 gig in your stripes then

8:58 you can get um you know 16 h100 servers

9:04 or 128 gpus in that

9:07 stripe um sorry actually 32 um and then

9:11 you can also get up to 72 Stripes

9:14 connected in if you used 32 of these uh

9:17 PTX uh PTX spines so that's just a a

9:22 quick example I'm going to to now kind

9:25 of flip into the demonstration mode that

9:27 I

9:28 mentioned

9:30 so these things look a little bit

9:32 different um on slides and we're used to

9:35 looking at topologies that way but um

9:38 inside of abstra you can also automate

9:41 this kind of rail optimized design where

9:44 you have stripes of eight Leafs um

9:46 Nvidia calls it rail because they

9:48 effectively consider you know GPU one

9:50 going to Leaf one and all of the GPU

9:53 ones inside that stripe going to Leaf

9:55 one a rail right and they have different

9:58 Technologies to try to keep traffic

9:59 within the rail in the ring patterns

10:02 that happen inside of these training

10:04 models um so oftentimes when it comes

10:08 time to showing this to customers we can

10:11 build out these sorts of slides but in

10:13 abstra with the rack concept as a

10:16 logical concept you can actually just

10:18 build a rack that's composed of eight

10:20 different leaves so you can use the rack

10:22 logical construct in abstra and you can

10:24 use that basically as your stripe and

10:27 that's exactly what we've done

10:30 with a whole bunch of A1 100s and h100s

10:33 and this is the terraform abstra

10:35 examples repository of the Juniper org

10:37 if you go into the different folders

10:39 there's different examples I'm going to

10:41 be demonstrating this one uh it's got a

10:43 whole bunch of racks and templates and

10:46 um other things in it for storage I

10:48 think that the GPU fabric is the most

10:50 interesting one to look at because

10:51 that's where the rail optimized design

10:53 of the 8way leaf sort of happens so um

10:57 I'm going to I'm going to demonstrate

10:59 this one and it actually shows you a

11:01 whole bunch of different sizes of GPU

11:03 clusters of different types and how

11:06 they're designed and um like I said you

11:08 can play around with this in your own

11:10 instance of abstra cloud Labs I've got

11:12 another tab here if you've never heard

11:14 of it go to cloudlabs abra.com and you

11:17 can bring up your own instance of like a

11:20 topology over here of abstra only you

11:23 don't actually need any physical devices

11:24 to play around with the the demos that

11:26 I'm going to be showing you today the

11:28 second demo after that that I'm going to

11:29 be going through is the real life lab

11:33 based configuration of a GPU cluster

11:36 that we have a juniper of mixed a100

11:38 servers and h100 servers in two

11:41 different stripes and options for both

11:44 qfx and PTX binds and all of the

11:46 additional networking configuration

11:48 which I'll talk about when I when I get

11:49 into that demo so I'm going to be going

11:51 through those in in order

11:54 um so you know one of you had the

11:57 question earlier of like what comes

11:59 first terraform or the blueprint right

12:01 so all of the stuff that the blueprint

12:03 is actually created from in this AI

12:06 cluster designs is done over um over you

12:11 know terraform running in this case

12:13 local from my laptop all that you need

12:16 to do when you clone this get repo is go

12:19 into these provider files and basically

12:21 set up your admin password and IP

12:25 address to whatever abstra Cloud Labs

12:27 gives you after that you can just run

12:29 terraform and it terraform apply and be

12:32 Off to the Races everything will show up

12:34 in your instance of abstra so um that's

12:39 that's what I'm going to do I've

12:42 actually done this before before I hit

12:45 um terraform apply and then answer yes

12:48 I'm just going to go into this instance

12:50 of abstra and show you this is sort of

12:52 like a a fresh instance of abstra I

12:54 haven't done anything to it yet and um

12:58 here you know if you go into look at the

13:00 racks for example or if you were to go

13:03 into look at all of the config templates

13:05 that uh sorry not config templates but

13:08 templates that uh are present there's a

13:11 lot of outof the-box stuff in abstra so

13:14 instead of adding more outof the Box

13:16 stuff and abstra for all these AI

13:17 designs which we could do we've just

13:19 built them in terraform and open source

13:20 them and you can just import them into

13:22 your instance of aptra yourself in a

13:24 matter of moments and then you can also

13:25 tweak them to whatever liking you've got

13:28 so I just wanted to show you know this

13:30 is indeed an empty instance of abstra I

13:33 think that you know those of you

13:35 familiar with terraform will have seen

13:36 the you know now you see it now you

13:39 don't but in Reverse now you don't see

13:41 it and now you do once I hit yes here

13:44 you know there's a whole bunch of stuff

13:45 happening inside of this and here in my

13:50 templates design I've suddenly have all

13:53 of these different templates that I can

13:54 create a blueprint from of all different

13:56 sizes so everything between 512 gpus and

14:01 248 gpus of both A1 100s and h100s the

14:06 difference between those from a

14:07 networking standpoint by the way is that

14:08 the h100s are connected with a 400 gig

14:11 Nick the older a100 are connected with a

14:13 200 gig Nick so little bit different uh

14:16 connectivity link requirements when

14:18 you're you know setting up your app

14:19 store

14:20 rack um if we if we go into you know

14:23 looking at these templates here you can

14:26 see that what we've done right this log

14:28 iCal rack construct here holds together

14:32 the eight

14:33 leaves and I can optionally show all of

14:36 the leafes here and every single server

14:39 has eight ports like I said there's also

14:42 actually storage ports and front end

14:44 ports those would be on different

14:45 network fabrics and I'll show you that

14:46 in the next demo but uh yeah this is

14:49 basically a nice way that you can get

14:52 and play around with rail optimized

14:54 designs in abstra if you'd like to and

14:56 start tweaking them um whenever we have

14:58 a customer demo I ask them you know how

15:00 many gpus are you looking at investing

15:02 in and I can literally like play around

15:04 with this or tweak one of these and

15:06 demonstrate exactly what their fabric is

15:08 going to look like so I think it's

15:09 something kind of fun to play around

15:10 with and you know while AI clusters are

15:14 still novel and this pattern of rail

15:15 optimiz design rail optimize design is

15:17 novel it's it's kind of cool do we have

15:19 any sort of ABS on rail optimize versus

15:22 non-real optimized same workload same

15:24 that's one of the hardest parts we

15:25 always get whenever we say like you can

15:27 do amazing things now 20% more amazing

15:29 than the previous solution which we have

15:31 no metrics on like rail optimized also

15:34 I'm curious where there's like

15:37 configuration Discovery visualization

15:40 and ultimately some level of like

15:43 anomaly detection an actionable change

15:46 that like you could say we get rail

15:48 optimized but when does app begin to say

15:50 hey things are changing you need to

15:52 change your configuration to align to it

15:55 well remember that these are physical

15:57 designs so they're not not easy to

15:59 change um on the fly but uh there are

16:02 some hyperscalers that are using

16:04 non-rail optimized designs to do

16:06 different things um you know some of the

16:09 news yesterday actually from from Google

16:12 nocp was the Google falcon reliable

16:14 transport protocol so they're looking at

16:17 actually converging the networks down to

16:19 one and still being able to get um the

16:22 full carrying capacity without you know

16:25 the problems of hot spots and cold spots

16:27 and the network and you know

16:29 this the GPU to GPU traffic is

16:31 RDMA um in this case it's RDMA over

16:33 ethernet so it requires effectively a

16:36 lossless profile um when rail optimized

16:40 design was introduced one of the things

16:42 is it's not just the fact that there's

16:43 an Envy link switch inside of the server

16:46 but the driver for the communications

16:50 library for model training they call it

16:52 nickel nccl for for short is the acronym

16:55 um it has this new technology called pxn

16:58 um in it PCI Express Times Envy link or

17:01 something they call it pxn for short and

17:04 it's able to detect when you're needing

17:08 to send traffic from let's say GPU 1

17:11 inside of this server to GPU 2 inside of

17:13 This Server well obviously GPU 1 and two

17:16 are not on the same rail so they're not

17:17 connected to the same Leaf what it can

17:19 do is it can actually pass the traffic

17:21 over to GPU 2's Nick internally on the

17:24 the server and then you're basically

17:26 avoiding a hop across the spine in your

17:29 your network so they're solving for for

17:32 latency effectively with that pxn

17:34 solution and um I think that your

17:37 question is a great question I don't

17:38 have the answer to that of like when is

17:40 rail optimized designed good and and how

17:42 you know is it not so good those are

17:44 some of the experiments we're running

17:46 inside of the Juniper AI cluster lab

17:48 actually U many people will for example

17:50 run their ml Commons ml perf test with

17:53 all of the servers plugged into the same

17:55 switch I mean like you're kind of

17:57 cheating at that point right because you

17:59 don't have any congestion so our lab is

18:01 based on this this

18:03 topology um and and the right over

18:05 subscription ratio for the number of

18:07 servers that we have so sort of a true

18:09 to

18:10 form well just the fact that you can do

18:12 a repeatable way in which you could see

18:15 the optimal implementation because you

18:17 know these are going to become more than

18:19 onetime events and yeah as workloads

18:22 change and and model training capacity

18:25 changes this we need to keep going back

18:27 to a tool that knew why we did it and

18:29 how we did it the first way so like as a

18:32 methodology this is the right way to go

18:34 because they we figured out gpus troone

18:36 too just like

18:38 networks yeah okay let let's move on to

18:41 the the next demo I keep on swiping the

18:43 wrong way so um in

18:47 the in the other uh folder here you'll

18:50 find a different example I've actually

18:52 already applied this one because it's

18:54 got a lot more resources so it takes a

18:55 couple of minutes to apply it um here

18:59 you've got not just the logical designs

19:02 but actually the instantiation of all

19:04 three blueprints so all three Fabrics

19:07 your front end your storage and your GPU

19:09 to GPU Fabric and inside of those

19:12 blueprints um all of the networking so

19:15 for example all of the creation of the

19:17 ASN pools for the ASN the bgp asns in

19:20 the fabric or the spine to to Le fabric

19:24 um IP addresses from those pools all of

19:26 those things are terraformed um and so

19:30 also all of the connectivity down to the

19:32 server these are typically routed fabric

19:34 so we've used that as a model so SL 31s

19:36 down to the server um besides the

19:39 networking the other thing that's really

19:41 important inside of an AI cluster that's

19:44 going to be used for training is

19:46 congestion avoidance and congestion

19:47 management when it happens uh one of the

19:50 challenges that we see with traditional

19:52 ecmp load balancing is the creation of

19:55 hot spots and cold spots in the network

19:58 um in particular because we've got a lot

20:00 of elephant flows in terms of the GPU to

20:03 GPU traffic so there's not a lot of

20:05 flows but they're very big so the

20:07 probability of the random hashing having

20:10 hash collisions and generating hotspots

20:12 in the network is is very problematic so

20:14 we turn on Dynamic load balancing which

20:17 can move flows um when it sees a small

20:19 break in them um so it doesn't reorder

20:21 packets effectively for the end host um

20:24 in addition to Dynamic load balancing we

20:27 turn on something called Data Center qcn

20:30 um quantize congestion notification it's

20:33 just like a couple of protocols that

20:34 have been around for a while um for

20:37 Rocky traffic uh Rocky is the rocce

20:40 acronym uh RDMA over converged ethernet

20:44 it requires um this protocol called PFC

20:47 priority flow control and explicit

20:50 congestion notification is ecn the

20:52 combination of these things Juniper

20:53 calls DC qcn and it's a pretty standard

20:56 industry acronym but this is basically

20:58 looking at the buffer sizes um and

21:01 statistically marking packets to tell

21:03 the end host to slow down when there's

21:06 too much congestion um so these

21:09 Protocols are also useful you know we

21:10 try to avoid congestion as best as

21:12 possible to maximize that carrying

21:14 capacity of the fabric when that's not

21:16 possible and congestion happens the host

21:18 has to back off these are these

21:20 Protocols are the mechanism by which

21:21 that gets communicated back to the end

21:23 host and the Knicks have to support it

21:25 and you know certainly uh the Nicks that

21:27 we have do support it pretty

21:29 common so um inside of all of this stuff

21:33 you'll find zero junos configuration

21:35 right this is all HCL for example right

21:38 if I look at my abstra blueprints you

21:39 see the resources for these three

21:42 blueprints um you can see that I've got

21:44 some some htl to generate for example

21:47 all of the ASN pools and all of the ipv

21:49 for pools um the only place where you do

21:53 see some junos configuration as I

21:55 mentioned is abstra has a standard

21:57 reference design and that's part of the

22:00 the Juniper validated design the jvd

22:02 stuff that monser talked about earlier

22:04 too a repeatable design that's very well

22:06 tested that we know is going to be high

22:08 quality um inside of these we need to

22:13 add the DLB the dynamic load balancing

22:15 and the DC qcn configurations so we

22:18 actually do have um you know additional

22:20 terraform files to lay down that Juno

22:24 configuration so there's a little bit of

22:27 extra stuff there I guess you could say

22:30 so I've got this in my other window here

22:32 and here you can see as I mentioned

22:34 you're managing all three of these

22:35 networks from one place um and this is

22:39 this is the state I mean I haven't done

22:40 anything to this after abstra terraform

22:43 apply happens and it creates 290 odd

22:45 objects so if we go into our back end

22:48 here you can see staged and instead of a

22:53 rack composed of 32 servers which would

22:55 be really expensive lab if Juniper inv

22:57 invested in that um we've got eight of

23:01 the a100 servers and four of the h100

23:03 servers uh and cabled up into two

23:06 different stripes we have uh effectively

23:09 it's a bit hard to read let me turn off

23:10 the links there what we call a medium

23:12 stripe and a small stripe um that's also

23:15 for diversity we have one that's based

23:17 on the qfx 5220 the other one based on

23:19 the qfx 5230 Tomahawk 3 Tomahawk 4 um

23:23 and we also have a variable inside of

23:25 that erform file that allows you to pick

23:28 you know qfx or PTX spines I think in

23:32 this case I've got the the the PTX

23:35 spines so um normally when you create a

23:38 blueprint in abstra you've got to go

23:40 through the build of it and you have to

23:42 assign all of the resources from the

23:44 resource pools and then after that you

23:46 have to turn all of the logical devices

23:49 for the spines and the leaves into

23:51 interface maps and then eventually also

23:54 into actual devices that are in that

23:56 inventory list that Chris showed earlier

23:58 we don't have any physical devices so

24:00 that part is is still yellow this is all

24:02 done in a virtual instance of apture in

24:04 the cloud as I mentioned and then I've

24:06 got my configlets here too um you know

24:08 so you can for example go and see if you

24:11 click on one of these configlets you can

24:12 see here's my Dynamic load balancing

24:14 profile and all of the switches that it

24:15 needs to get um added to as a little bit

24:19 of extra configuration beyond the golden

24:21 standard abstra reference design and it

24:24 sort of highlights all of these in green

24:26 and tell you which one it's applied to

24:27 as well well so other than that I think

24:30 a really interesting feature too is kind

24:32 of clicking into one of these these

24:34 leavs and you can sort of see all of

24:37 your different for example spine facing

24:39 ports uh up here and you can see all of

24:41 your Revenue facing ports down here uh

24:44 some of these are channelized for the

24:46 200 gig connectivity because all of our

24:48 ports are 400 gig native speed other

24:51 thing is you can go to click on rendered

24:52 over here in the bottom right and you

24:55 can you know if you are a networking guy

24:57 or JC ni right and you want to actually

24:59 look at all of the configuration it's

25:01 all here so you can scroll through it

25:03 and you can see all of this stuff and

25:05 then way down at the bottom after all

25:07 the bgp policy stuff you've got the

25:09 configlets generated as

25:11 well so all of this stuff was generated

25:14 uh from terraform and it's it's really

25:17 nice to be able to play around with in

25:19 the actual AI Lab at Juniper right now

25:22 we're running a customer P um but I did

25:24 want to give a shout out and a thank you

25:26 to to Chris as well as raj sub manium

25:29 who's not here today but Raj has been

25:31 like furiously coding away adding to the

25:33 abstra terraform provider for dashboards

25:36 and building an example reference

25:37 dashboard for AI clusters so uh he he

25:41 gave me an instance of abstra just an

25:43 hour or so ago and I wanted to to show

25:45 that as well so now you can actually

25:47 terraform all of your monitoring and so

25:49 things like the hot and cold spots in

25:51 the network or detecting ecmp imbalances

25:54 these are things that are native in

25:57 abstra the intent-based analytics probes

25:59 and now you can codify those dashboards

26:02 as well for your organization which is

26:05 of course you know what people that are

26:07 doing get Ops and infrastructure as code

26:08 do right if you've got a a kubernetes

26:10 cluster as code all of your Prometheus

26:13 and your Graff deployment is also

26:14 probably done as code so it's the same

26:17 thing from an abstra monitoring

26:19 perspective now this dashboard doesn't

26:20 have any interesting information on it

26:22 because it's not running any traffic

26:24 that's the part where we have to

26:25 actually apply this to our lab and run

26:27 traffic acoss it so stay tuned for that

26:29 demo probably something that I'll build

26:31 in the future but yeah all of these

26:33 probes and widgets and stuff like that

26:34 were also dynamically

26:43 generated

Automating AI Cluster Design With Juniper Apstra

Prevent network bottlenecks by automating AI cluster design.

You’ll learn

Who is this for?

Host

Resources

Experience More

Juniper Apstra Demo: Time Voyager (4.2 Update)

Empowering Efficiency Using Juniper Apstra for Data Center Automation

The Only Data Center That Plays Nice With Others

Introducing Juniper’s Connected Security Distributed Services Architecture

Transcript