Automating AI Cluster Design With Juniper Apstra
Prevent network bottlenecks by automating AI cluster design.
Don’t let a network bottleneck waste expensive GPU time when building AI clusters. Find out how to “Terraform apply” to effectively import many AI cluster design examples into Juniper® Apstra® software in this video filmed live at Tech Field Day 18.
Learn more about Juniper Apstra software.
You’ll learn
To automate AI cluster design using examples with many sizes of clusters, GPU compute fabrics for model training, storage fabrics, and management fabrics
How to follow the NVIDIA best practice for rail-optimized design, which is explained with the topology of eight leaf devices in a stripe grouping
Who is this for?
Host
Experience More
Transcript
0:09 my name is James Kelly I've been at
0:11 Juniper since 2006 uh you heard Monsour
0:15 talk earlier today about the most recent
0:18 inflection in what's happening in Data
0:20 Center and that's my topic I'm also
0:22 going to be building on what Chris
0:23 talked about so going to be talking
0:25 about AI clusters and doing some
0:28 demonstrations of what we've got as
0:30 examples for AI clusters in terms of the
0:33 different designs that I'll go through
0:34 in just three slides first and then I'll
0:37 actually demonstrate how you can play
0:38 around with these designs yourself in
0:40 abstra Cloud Labs um and how you can
0:43 actually look at the entire
0:45 configuration of all of the switches
0:47 with all of the additional things that
0:49 uh that are put on top of the standard
0:51 uh IP fabric reference design and abstra
0:54 specific to AI clusters for training
0:56 where the networking problems are are a
0:59 little bit beyond what we see in the
1:00 traditional data center
1:02 space all right so with that quick
1:06 introduction I wanted to set the stage
1:08 of why networking is important when it
1:10 comes to building AI
1:12 clusters um you know we are all familiar
1:15 I'm assuming with the Nvidia stock price
1:17 and the reason for that is gpus are very
1:20 expensive and building AI clusters the
1:24 networking cost is really whether it's
1:26 from a TCO perspective or from a capex
1:29 perspective It's relatively small
1:31 compared to the overarching bill that
1:33 you're going to be paying when these
1:34 servers are using you know somewhere
1:37 between 10 and 16 times the power of a
1:39 traditional data center server and there
1:41 are hundreds of thousands of dollars I
1:43 mean if you're talking about an Nvidia
1:45 dgx h100 that has eight of the h100 gpus
1:48 in it it's about 400 grand um so when
1:52 you're spending all of this money um you
1:54 know why is the network important uh so
1:57 the reason is very simple that I put on
1:59 the slide here if the network is a
2:01 bottleneck that delays the training job
2:03 completion then expensive GPU time is
2:05 wasted and the training becomes Network
2:08 bound instead of compute bound so
2:10 obviously this is a problem you know AI
2:12 training where we want it to become
2:14 compute bound we want it to effectively
2:16 scale linearly or as close to it as
2:18 possible with the number of gpus that we
2:21 add right but just like you know a
2:25 kubernetes cluster or these other types
2:27 of data center applications or cloud
2:29 applic
2:30 that are
2:31 distributed what holds all of those
2:34 distributed applications together that's
2:36 the network right so at the end of the
2:38 day you know a training
2:42 model that is or a model to be trained
2:45 in a cluster of different gpus is going
2:48 to have to be distributed across all of
2:50 the gpus and effectively networked
2:53 together and the way that this works
2:55 obviously is that the model itself
2:57 usually doesn't fit on a single GPU
2:59 right
3:00 um so the model is divided up and then
3:02 for purposes of moving in parallel the
3:05 data set which is also humongous is also
3:07 divided up so there's all of this
3:09 paralyzation that happens in an AI
3:11 cluster when it comes to training and as
3:15 that paralyzation kind of needs to get
3:18 reconciled um for the for the checking
3:21 and the evolution of the model over the
3:23 course of the training across the
3:25 different jobs that's where there's tons
3:27 of communication over the network right
3:30 and inside of these networks they're
3:32 also I said they're not your average
3:34 data center right your average data
3:36 center is probably connected on your
3:38 Revenue facing ports maybe like 10 gigs
3:41 per second 25 gigs per second 100 would
3:43 be considered quite fast inside of an AI
3:47 cluster the GPU to GPU fabric is
3:51 connected at 400 gigs per second per GPU
3:55 I just said that there's eight gpus
3:56 inside of one of those dgx servers each
3:59 one of those has its own 400 gig Nick so
4:02 just imagine that and then the server
4:04 also has separate Nicks that are used um
4:07 to connect to the storage cluster and
4:09 then it has a 100 Gig Nick which in
4:11 these servers is the slow speed and
4:14 that's into the frontend management
4:15 Network where you have some other
4:17 servers that are responsible called
4:19 headend nodes to coordinate the training
4:20 jobs across the the cluster so these
4:23 these um these servers are effectively
4:26 each connected into three different
4:28 networks and the beauty abstra into
4:30 different blueprints is you can actually
4:31 manage these three networks which are
4:33 effectively inside of one data center
4:35 from one paint of glass from one abstra
4:37 right besides the challenge of the
4:39 speeds and feeds there's something
4:42 called rail optimized design that is
4:44 prescribed by Nvidia we all know Nvidia
4:47 is the 800 pound gorilla in the space of
4:50 the gpus they they own the entire stack
4:53 they've done some great stuff in terms
4:56 of innovating inside the server so so
4:59 there's this technology called Envy link
5:02 Envy link inside the server connects all
5:04 of the gpus so that when a GPU you know
5:07 one needs to talk to GPU 2 it doesn't
5:09 need to go out of the Nick to a switch
5:12 and then back in it can go over the EnV
5:14 link so for this reason the AG GPU Nicks
5:19 are actually not typically connected to
5:21 the same Leaf so there's not really a
5:24 concept of like a topof rack switch here
5:26 and because there's eight gpus inside
5:29 the dgx servers or any of the hgx
5:31 servers that you'll find they're modeled
5:33 on the same pattern from Dell super
5:35 micro Lambda Labs Etc you'll see this
5:38 pattern that I've got in the slide here
5:40 where the GPU servers at the bottom on
5:43 the GPU fabric side those Nicks
5:45 specifically those eight they're cabled
5:47 up where GPU one Nick goes to Leaf one
5:51 GPU 2 goes to Leaf 2 all the way through
5:54 eight so you're building your data
5:56 center or your cluster in these grp
5:59 groups of eight leaves and Juniper we
6:02 call this group a stripe so this is
6:05 basically how you build out your data
6:06 center it's it's very different and kind
6:08 of surprising all the way down to the
6:11 physical cabling of how these things
6:13 work and then typically you know other
6:15 than that it's a clow um fabric so you
6:18 know every leaf is connected to every
6:21 spine the one thing that you typically
6:23 see in data center networks that's a
6:25 little bit different here is as a rule
6:27 of thumb in the storage and the Jeep pu
6:29 fabric we don't necessarily recommend
6:32 doing any kind of over subscription um
6:35 so the downward facing access from the
6:38 leaves in terms of that aggregate
6:40 carrying capacity is the same as what
6:42 you see in the fabric up to the spines
6:45 um and you can imagine that if you're
6:46 using a lot of 400 gig links from your
6:50 Leafs you're using state-of-the-art
6:52 switches like the tomahawk Ford that
6:54 Monsur mentioned for example today and
6:58 with that you often see a lot of ports
7:01 going up to the spines not just a few
7:03 which is also a different physical
7:05 pattern that's that's not really
7:07 necessarily familiar to many
7:09 people when it comes to building out
7:11 data centers we also think in terms of
7:13 often building out fixed form factor
7:15 devices so you can use the qfx series
7:19 like the tomahawk for you know qfx 5230
7:22 as a leaf you can use it as a spine as
7:24 well now you'd need many spines the
7:27 other way that you could build out your
7:29 spine
7:29 is with Juniper's PTX series we call
7:32 these high raic switches but they're
7:34 basically just chassis modular devices
7:36 where you can you know buy a bunch of
7:38 spines and then add line cards to them
7:41 over time and grow that way in a little
7:43 bit more of a progressive manner also
7:45 the cool thing is you know it comes in a
7:48 four line card eight line card and 16
7:50 line card variants in the 16 line card
7:52 variant that I've got here you can get
7:55 up to
7:57 576 ports of 400 gig G inside the same
8:00 chassis and these chassis are 800 gig
8:02 ready um for when you know Tomahawk 5
8:05 based platform comes out on the qfx side
8:08 as well so I often get asked you know in
8:10 the scale uh that you're dealing with
8:13 James in these data centers for AI
8:15 clusters you know do you see us going to
8:18 Supine and you know generally speaking
8:21 you know that becomes a lot more
8:22 expensive because of the 400 and
8:24 eventually the 800 gig Optics right that
8:26 you would need just to just to
8:27 accommodate all of that cabling even
8:29 would be very expensive um so really
8:32 sticking to a three-stage CL or a
8:34 two-tier network is pretty important and
8:36 that's why in large enough networks you
8:38 typically see the preference for some of
8:40 these high Ric spines so I put together
8:43 this one slide is sort of like hey
8:45 what's Juniper's maximum GPU cluster
8:47 that you could build in a three-stage
8:49 clo and these are sort of the numbers
8:50 right if you're using the
8:52 state-of-the-art qfx 5230 Leaf which is
8:54 64 ports of 400 gig in your stripes then
8:58 you can get um you know 16 h100 servers
9:04 or 128 gpus in that
9:07 stripe um sorry actually 32 um and then
9:11 you can also get up to 72 Stripes
9:14 connected in if you used 32 of these uh
9:17 PTX uh PTX spines so that's just a a
9:22 quick example I'm going to to now kind
9:25 of flip into the demonstration mode that
9:27 I
9:28 mentioned
9:30 so these things look a little bit
9:32 different um on slides and we're used to
9:35 looking at topologies that way but um
9:38 inside of abstra you can also automate
9:41 this kind of rail optimized design where
9:44 you have stripes of eight Leafs um
9:46 Nvidia calls it rail because they
9:48 effectively consider you know GPU one
9:50 going to Leaf one and all of the GPU
9:53 ones inside that stripe going to Leaf
9:55 one a rail right and they have different
9:58 Technologies to try to keep traffic
9:59 within the rail in the ring patterns
10:02 that happen inside of these training
10:04 models um so oftentimes when it comes
10:08 time to showing this to customers we can
10:11 build out these sorts of slides but in
10:13 abstra with the rack concept as a
10:16 logical concept you can actually just
10:18 build a rack that's composed of eight
10:20 different leaves so you can use the rack
10:22 logical construct in abstra and you can
10:24 use that basically as your stripe and
10:27 that's exactly what we've done
10:30 with a whole bunch of A1 100s and h100s
10:33 and this is the terraform abstra
10:35 examples repository of the Juniper org
10:37 if you go into the different folders
10:39 there's different examples I'm going to
10:41 be demonstrating this one uh it's got a
10:43 whole bunch of racks and templates and
10:46 um other things in it for storage I
10:48 think that the GPU fabric is the most
10:50 interesting one to look at because
10:51 that's where the rail optimized design
10:53 of the 8way leaf sort of happens so um
10:57 I'm going to I'm going to demonstrate
10:59 this one and it actually shows you a
11:01 whole bunch of different sizes of GPU
11:03 clusters of different types and how
11:06 they're designed and um like I said you
11:08 can play around with this in your own
11:10 instance of abstra cloud Labs I've got
11:12 another tab here if you've never heard
11:14 of it go to cloudlabs abra.com and you
11:17 can bring up your own instance of like a
11:20 topology over here of abstra only you
11:23 don't actually need any physical devices
11:24 to play around with the the demos that
11:26 I'm going to be showing you today the
11:28 second demo after that that I'm going to
11:29 be going through is the real life lab
11:33 based configuration of a GPU cluster
11:36 that we have a juniper of mixed a100
11:38 servers and h100 servers in two
11:41 different stripes and options for both
11:44 qfx and PTX binds and all of the
11:46 additional networking configuration
11:48 which I'll talk about when I when I get
11:49 into that demo so I'm going to be going
11:51 through those in in order
11:54 um so you know one of you had the
11:57 question earlier of like what comes
11:59 first terraform or the blueprint right
12:01 so all of the stuff that the blueprint
12:03 is actually created from in this AI
12:06 cluster designs is done over um over you
12:11 know terraform running in this case
12:13 local from my laptop all that you need
12:16 to do when you clone this get repo is go
12:19 into these provider files and basically
12:21 set up your admin password and IP
12:25 address to whatever abstra Cloud Labs
12:27 gives you after that you can just run
12:29 terraform and it terraform apply and be
12:32 Off to the Races everything will show up
12:34 in your instance of abstra so um that's
12:39 that's what I'm going to do I've
12:42 actually done this before before I hit
12:45 um terraform apply and then answer yes
12:48 I'm just going to go into this instance
12:50 of abstra and show you this is sort of
12:52 like a a fresh instance of abstra I
12:54 haven't done anything to it yet and um
12:58 here you know if you go into look at the
13:00 racks for example or if you were to go
13:03 into look at all of the config templates
13:05 that uh sorry not config templates but
13:08 templates that uh are present there's a
13:11 lot of outof the-box stuff in abstra so
13:14 instead of adding more outof the Box
13:16 stuff and abstra for all these AI
13:17 designs which we could do we've just
13:19 built them in terraform and open source
13:20 them and you can just import them into
13:22 your instance of aptra yourself in a
13:24 matter of moments and then you can also
13:25 tweak them to whatever liking you've got
13:28 so I just wanted to show you know this
13:30 is indeed an empty instance of abstra I
13:33 think that you know those of you
13:35 familiar with terraform will have seen
13:36 the you know now you see it now you
13:39 don't but in Reverse now you don't see
13:41 it and now you do once I hit yes here
13:44 you know there's a whole bunch of stuff
13:45 happening inside of this and here in my
13:50 templates design I've suddenly have all
13:53 of these different templates that I can
13:54 create a blueprint from of all different
13:56 sizes so everything between 512 gpus and
14:01 248 gpus of both A1 100s and h100s the
14:06 difference between those from a
14:07 networking standpoint by the way is that
14:08 the h100s are connected with a 400 gig
14:11 Nick the older a100 are connected with a
14:13 200 gig Nick so little bit different uh
14:16 connectivity link requirements when
14:18 you're you know setting up your app
14:19 store
14:20 rack um if we if we go into you know
14:23 looking at these templates here you can
14:26 see that what we've done right this log
14:28 iCal rack construct here holds together
14:32 the eight
14:33 leaves and I can optionally show all of
14:36 the leafes here and every single server
14:39 has eight ports like I said there's also
14:42 actually storage ports and front end
14:44 ports those would be on different
14:45 network fabrics and I'll show you that
14:46 in the next demo but uh yeah this is
14:49 basically a nice way that you can get
14:52 and play around with rail optimized
14:54 designs in abstra if you'd like to and
14:56 start tweaking them um whenever we have
14:58 a customer demo I ask them you know how
15:00 many gpus are you looking at investing
15:02 in and I can literally like play around
15:04 with this or tweak one of these and
15:06 demonstrate exactly what their fabric is
15:08 going to look like so I think it's
15:09 something kind of fun to play around
15:10 with and you know while AI clusters are
15:14 still novel and this pattern of rail
15:15 optimiz design rail optimize design is
15:17 novel it's it's kind of cool do we have
15:19 any sort of ABS on rail optimize versus
15:22 non-real optimized same workload same
15:24 that's one of the hardest parts we
15:25 always get whenever we say like you can
15:27 do amazing things now 20% more amazing
15:29 than the previous solution which we have
15:31 no metrics on like rail optimized also
15:34 I'm curious where there's like
15:37 configuration Discovery visualization
15:40 and ultimately some level of like
15:43 anomaly detection an actionable change
15:46 that like you could say we get rail
15:48 optimized but when does app begin to say
15:50 hey things are changing you need to
15:52 change your configuration to align to it
15:55 well remember that these are physical
15:57 designs so they're not not easy to
15:59 change um on the fly but uh there are
16:02 some hyperscalers that are using
16:04 non-rail optimized designs to do
16:06 different things um you know some of the
16:09 news yesterday actually from from Google
16:12 nocp was the Google falcon reliable
16:14 transport protocol so they're looking at
16:17 actually converging the networks down to
16:19 one and still being able to get um the
16:22 full carrying capacity without you know
16:25 the problems of hot spots and cold spots
16:27 and the network and you know
16:29 this the GPU to GPU traffic is
16:31 RDMA um in this case it's RDMA over
16:33 ethernet so it requires effectively a
16:36 lossless profile um when rail optimized
16:40 design was introduced one of the things
16:42 is it's not just the fact that there's
16:43 an Envy link switch inside of the server
16:46 but the driver for the communications
16:50 library for model training they call it
16:52 nickel nccl for for short is the acronym
16:55 um it has this new technology called pxn
16:58 um in it PCI Express Times Envy link or
17:01 something they call it pxn for short and
17:04 it's able to detect when you're needing
17:08 to send traffic from let's say GPU 1
17:11 inside of this server to GPU 2 inside of
17:13 This Server well obviously GPU 1 and two
17:16 are not on the same rail so they're not
17:17 connected to the same Leaf what it can
17:19 do is it can actually pass the traffic
17:21 over to GPU 2's Nick internally on the
17:24 the server and then you're basically
17:26 avoiding a hop across the spine in your
17:29 your network so they're solving for for
17:32 latency effectively with that pxn
17:34 solution and um I think that your
17:37 question is a great question I don't
17:38 have the answer to that of like when is
17:40 rail optimized designed good and and how
17:42 you know is it not so good those are
17:44 some of the experiments we're running
17:46 inside of the Juniper AI cluster lab
17:48 actually U many people will for example
17:50 run their ml Commons ml perf test with
17:53 all of the servers plugged into the same
17:55 switch I mean like you're kind of
17:57 cheating at that point right because you
17:59 don't have any congestion so our lab is
18:01 based on this this
18:03 topology um and and the right over
18:05 subscription ratio for the number of
18:07 servers that we have so sort of a true
18:09 to
18:10 form well just the fact that you can do
18:12 a repeatable way in which you could see
18:15 the optimal implementation because you
18:17 know these are going to become more than
18:19 onetime events and yeah as workloads
18:22 change and and model training capacity
18:25 changes this we need to keep going back
18:27 to a tool that knew why we did it and
18:29 how we did it the first way so like as a
18:32 methodology this is the right way to go
18:34 because they we figured out gpus troone
18:36 too just like
18:38 networks yeah okay let let's move on to
18:41 the the next demo I keep on swiping the
18:43 wrong way so um in
18:47 the in the other uh folder here you'll
18:50 find a different example I've actually
18:52 already applied this one because it's
18:54 got a lot more resources so it takes a
18:55 couple of minutes to apply it um here
18:59 you've got not just the logical designs
19:02 but actually the instantiation of all
19:04 three blueprints so all three Fabrics
19:07 your front end your storage and your GPU
19:09 to GPU Fabric and inside of those
19:12 blueprints um all of the networking so
19:15 for example all of the creation of the
19:17 ASN pools for the ASN the bgp asns in
19:20 the fabric or the spine to to Le fabric
19:24 um IP addresses from those pools all of
19:26 those things are terraformed um and so
19:30 also all of the connectivity down to the
19:32 server these are typically routed fabric
19:34 so we've used that as a model so SL 31s
19:36 down to the server um besides the
19:39 networking the other thing that's really
19:41 important inside of an AI cluster that's
19:44 going to be used for training is
19:46 congestion avoidance and congestion
19:47 management when it happens uh one of the
19:50 challenges that we see with traditional
19:52 ecmp load balancing is the creation of
19:55 hot spots and cold spots in the network
19:58 um in particular because we've got a lot
20:00 of elephant flows in terms of the GPU to
20:03 GPU traffic so there's not a lot of
20:05 flows but they're very big so the
20:07 probability of the random hashing having
20:10 hash collisions and generating hotspots
20:12 in the network is is very problematic so
20:14 we turn on Dynamic load balancing which
20:17 can move flows um when it sees a small
20:19 break in them um so it doesn't reorder
20:21 packets effectively for the end host um
20:24 in addition to Dynamic load balancing we
20:27 turn on something called Data Center qcn
20:30 um quantize congestion notification it's
20:33 just like a couple of protocols that
20:34 have been around for a while um for
20:37 Rocky traffic uh Rocky is the rocce
20:40 acronym uh RDMA over converged ethernet
20:44 it requires um this protocol called PFC
20:47 priority flow control and explicit
20:50 congestion notification is ecn the
20:52 combination of these things Juniper
20:53 calls DC qcn and it's a pretty standard
20:56 industry acronym but this is basically
20:58 looking at the buffer sizes um and
21:01 statistically marking packets to tell
21:03 the end host to slow down when there's
21:06 too much congestion um so these
21:09 Protocols are also useful you know we
21:10 try to avoid congestion as best as
21:12 possible to maximize that carrying
21:14 capacity of the fabric when that's not
21:16 possible and congestion happens the host
21:18 has to back off these are these
21:20 Protocols are the mechanism by which
21:21 that gets communicated back to the end
21:23 host and the Knicks have to support it
21:25 and you know certainly uh the Nicks that
21:27 we have do support it pretty
21:29 common so um inside of all of this stuff
21:33 you'll find zero junos configuration
21:35 right this is all HCL for example right
21:38 if I look at my abstra blueprints you
21:39 see the resources for these three
21:42 blueprints um you can see that I've got
21:44 some some htl to generate for example
21:47 all of the ASN pools and all of the ipv
21:49 for pools um the only place where you do
21:53 see some junos configuration as I
21:55 mentioned is abstra has a standard
21:57 reference design and that's part of the
22:00 the Juniper validated design the jvd
22:02 stuff that monser talked about earlier
22:04 too a repeatable design that's very well
22:06 tested that we know is going to be high
22:08 quality um inside of these we need to
22:13 add the DLB the dynamic load balancing
22:15 and the DC qcn configurations so we
22:18 actually do have um you know additional
22:20 terraform files to lay down that Juno
22:24 configuration so there's a little bit of
22:27 extra stuff there I guess you could say
22:30 so I've got this in my other window here
22:32 and here you can see as I mentioned
22:34 you're managing all three of these
22:35 networks from one place um and this is
22:39 this is the state I mean I haven't done
22:40 anything to this after abstra terraform
22:43 apply happens and it creates 290 odd
22:45 objects so if we go into our back end
22:48 here you can see staged and instead of a
22:53 rack composed of 32 servers which would
22:55 be really expensive lab if Juniper inv
22:57 invested in that um we've got eight of
23:01 the a100 servers and four of the h100
23:03 servers uh and cabled up into two
23:06 different stripes we have uh effectively
23:09 it's a bit hard to read let me turn off
23:10 the links there what we call a medium
23:12 stripe and a small stripe um that's also
23:15 for diversity we have one that's based
23:17 on the qfx 5220 the other one based on
23:19 the qfx 5230 Tomahawk 3 Tomahawk 4 um
23:23 and we also have a variable inside of
23:25 that erform file that allows you to pick
23:28 you know qfx or PTX spines I think in
23:32 this case I've got the the the PTX
23:35 spines so um normally when you create a
23:38 blueprint in abstra you've got to go
23:40 through the build of it and you have to
23:42 assign all of the resources from the
23:44 resource pools and then after that you
23:46 have to turn all of the logical devices
23:49 for the spines and the leaves into
23:51 interface maps and then eventually also
23:54 into actual devices that are in that
23:56 inventory list that Chris showed earlier
23:58 we don't have any physical devices so
24:00 that part is is still yellow this is all
24:02 done in a virtual instance of apture in
24:04 the cloud as I mentioned and then I've
24:06 got my configlets here too um you know
24:08 so you can for example go and see if you
24:11 click on one of these configlets you can
24:12 see here's my Dynamic load balancing
24:14 profile and all of the switches that it
24:15 needs to get um added to as a little bit
24:19 of extra configuration beyond the golden
24:21 standard abstra reference design and it
24:24 sort of highlights all of these in green
24:26 and tell you which one it's applied to
24:27 as well well so other than that I think
24:30 a really interesting feature too is kind
24:32 of clicking into one of these these
24:34 leavs and you can sort of see all of
24:37 your different for example spine facing
24:39 ports uh up here and you can see all of
24:41 your Revenue facing ports down here uh
24:44 some of these are channelized for the
24:46 200 gig connectivity because all of our
24:48 ports are 400 gig native speed other
24:51 thing is you can go to click on rendered
24:52 over here in the bottom right and you
24:55 can you know if you are a networking guy
24:57 or JC ni right and you want to actually
24:59 look at all of the configuration it's
25:01 all here so you can scroll through it
25:03 and you can see all of this stuff and
25:05 then way down at the bottom after all
25:07 the bgp policy stuff you've got the
25:09 configlets generated as
25:11 well so all of this stuff was generated
25:14 uh from terraform and it's it's really
25:17 nice to be able to play around with in
25:19 the actual AI Lab at Juniper right now
25:22 we're running a customer P um but I did
25:24 want to give a shout out and a thank you
25:26 to to Chris as well as raj sub manium
25:29 who's not here today but Raj has been
25:31 like furiously coding away adding to the
25:33 abstra terraform provider for dashboards
25:36 and building an example reference
25:37 dashboard for AI clusters so uh he he
25:41 gave me an instance of abstra just an
25:43 hour or so ago and I wanted to to show
25:45 that as well so now you can actually
25:47 terraform all of your monitoring and so
25:49 things like the hot and cold spots in
25:51 the network or detecting ecmp imbalances
25:54 these are things that are native in
25:57 abstra the intent-based analytics probes
25:59 and now you can codify those dashboards
26:02 as well for your organization which is
26:05 of course you know what people that are
26:07 doing get Ops and infrastructure as code
26:08 do right if you've got a a kubernetes
26:10 cluster as code all of your Prometheus
26:13 and your Graff deployment is also
26:14 probably done as code so it's the same
26:17 thing from an abstra monitoring
26:19 perspective now this dashboard doesn't
26:20 have any interesting information on it
26:22 because it's not running any traffic
26:24 that's the part where we have to
26:25 actually apply this to our lab and run
26:27 traffic acoss it so stay tuned for that
26:29 demo probably something that I'll build
26:31 in the future but yeah all of these
26:33 probes and widgets and stuff like that
26:34 were also dynamically
26:43 generated