AI Networking is CRAZY!! (but is it fast enough?)
Discover how to effectively harness the power of AI Networking for your data center.
Network Chuck dives into the heart of AI networking to explain the complexities associated with training AI models in your data center. He highlights network bottlenecks that slow down progress, and describes how innovations like RDMA, RoCE, ECN, and PFC are vital to deploying AI seamlessly. Finally, Chuck explains why you should build your AI networking data center using Juniper Networks solutions.
You’ll learn
The factors that affect data center network performance when implementing AI in your network.
The advantages of Ethernet over InfiniBand when implementing AI in your data center network.
How Juniper Networks’ 800 Gig Ethernet switches and Juniper Apstra can eliminate network bottlenecks so you can take full advantage of AI networking for your data center.
Who is this for?
Host
Experience More
Transcript
0:00 We've got a problem. The network is slowing down AI and it's getting mad at us,
0:04 I think. But seriously, AI models like chat, GPT have to be trained.
0:08 They need some knowledge before they can start helping us or take over the
0:11 world, whichever comes first. But how they're trained, this is crazy.
0:14 And also where our problem lies. So picture this,
0:17 thousands of insanely powerful and expensive GPU servers.
0:21 These things are like $350,000 a pop and they're clustered together,
0:24 crunching through hundreds of terabytes of data. Time is money here.
0:27 And also our enemy, we're tweaking everything to make the training go faster.
0:31 We need that GPT five. And even though we're rocking the most powerful GPUs,
0:34 money can buy, it doesn't matter.
0:36 The network can and sometimes is the bottleneck.
0:40 It's like driving a Lamborghini and rush hour traffic.
0:42 A training process relies on these roads or network connections.
0:46 And not only are we transferring just insane amounts of data from our storage
0:49 servers to our GPU servers, this is Mike. Massive amounts of data.
0:53 But the GPUs are also communicating constantly with each other,
0:56 working together to complete jobs.
0:58 But if you have any stragglers or traffic jams, that slows down everyone.
1:02 That job completion time, that lag even has a name. It's called tail latency.
1:06 The network can make or break. AI training made a claim that last year,
1:10 33% of AI ML elapsed time was spent waiting on the network.
1:15 So we've got to do something because at this point AI is taking over the world
1:18 Skynet. It's inevitable. And all we can do is really just help it along,
1:22 score some points. Maybe we'll make nice pets or a battery. I don't know.
1:26 Put me somewhere nice and give me a juicy steak. Now, before we dive in,
1:28 you got to know it's not the network's fault, it's AI's fault.
1:31 It's their big beefy GPU servers. Did it hear me say that?
1:34 If you look inside a data center, the networks are amazing.
1:37 I mean most of the time, Netflix, Facebook, Twitter,
1:40 all these things are still up. And they work really fast. Thank you.
1:42 Data centers. But ai, that's tricky.
1:46 AI came in and did some weird stuff.
1:48 Not only does it generate just an insane amount of network traffic,
1:51 like ridiculous amounts,
1:53 it's also kind of weird traffic stuff we're not used to. And honestly,
1:56 we didn't build our networks for this. Now it's not all doom and gloom.
1:59 We actually have built new types of networks to deal with AI traffic.
2:03 It's a little thing called InfiniBand, and that's a whole conversation.
2:06 But honestly, I'm not happy with it. I don't think it's our solution.
2:09 So in this video, I want to break that down.
2:10 Why is AI traffic weird and what are we doing about it?
2:13 And by I mean companies like Juniper, not me,
2:15 they are the sponsor of this video and assistant to the regional AI overlord.
2:19 They're scoring some serious points and they're helping us win the war. Oh,
2:22 you didn't hear about that war? Oh, not that war.
2:24 The AI networking wars and they're intense and finna band versus ethernet.
2:29 Scared to even talk about it, but we're going to talk about it. So part one,
2:31 here we go. Let's talk about the infestation. Why do I say infestation? Well,
2:34 because AI is kind of taking over data centers and it's becoming kind of a toxic
2:38 work environment. And it's not just networking power's taken a hit.
2:41 These data centers were not built to handle the amount of power these GPU
2:44 servers need. They're hungry.
2:45 One GPU server can consume the power allotted for an entire rack of servers.
2:50 Cooling is an issue too.
2:51 Before we could just simply air cool all of our stuff with fans,
2:54 but now we're having to look at liquid cooling. Are you serious?
2:57 Like liquid cooling in a data center? But why is that a problem?
2:59 Why are data centers kind of like what do we do? Well,
3:01 it's because data centers were meant for CPU workloads.
3:04 That's how we built them. But then something big happened,
3:06 the industry figured out that the same GPUs that give you amazing graphics and
3:10 heat your mom's basement up to face melting temperatures are also kind of
3:14 amazing at AI stuff. Things like learning and inference workloads. Yeah,
3:18 it's like a hundred times faster at that than a CPU. So the industry said,
3:21 we'll take 20,000. They loaded up GPU servers inside data centers like crazy.
3:25 Everyone's trying to have the best AI and it takes a lot of GPUs to do that.
3:29 And look at these servers. Some of these have eight GPUs.
3:31 No wonder we're having power and cooling issues.
3:33 And then of course we have the networking. Now again,
3:36 the networking in most data centers is amazingly fast,
3:38 but AI is kind of a strange beast and it all has to do with AI's training,
3:42 how it learns. And just like college, it's super expensive. You see,
3:44 for an AI model to mature or grow up, there are three steps. First,
3:48 we gather a ton of data and in the case of things like chat, GBT,
3:51 we're talking hundreds of terabytes of data, Reddit, post diary entries,
3:55 normal stuff like that.
3:56 I'm very hungry.
3:58 And then we take all of that data and we spoonfeed it to the GPU servers.
4:03 We're saying here,
4:03 kid learn and it's so much and we're giving it to him so fast.
4:06 These links are incredibly fast. And then after it's done eating,
4:09 I think that's where my analogy ends. It's trained up, it's good.
4:13 I know everything.
4:17 It's an inference mode. It's ready to help us.
4:19 It's ready to help you with that next soup recipe. I dunno what you're doing.
4:22 Now, the first big problem with the training process is just the sheer amount of
4:25 traffic we're dealing with here. And it's not just that AI networks,
4:28 they can't deal with latency or a loss.
4:31 So essentially what we need is infinite bandwidth with no latency or loss
4:34 something. Our typical ethernet networks running T-C-B-I-P really couldn't do.
4:38 So in came a new player called Infinite Band.
4:40 Infinite Band is meant for high performance computing or HPC.
4:44 And honestly it's just brilliant marketing, infinite band,
4:47 infinite bandwidth genius. And it offered almost just that.
4:50 Right now we're seeing 400 gigabit ethernet within InfiniBand and InfiniBand.
4:55 It's different. It was purpose built for HPC environments.
4:58 Things like AI training and if you're familiar with networking,
5:01 it doesn't use T-C-P-I-P, it has its own stack, its own transport protocol,
5:05 which does away with a lot of the overhead that TCP IP introduces.
5:09 So less time is spent processing data packets.
5:11 It's designed to be a lossless fabric,
5:13 which means it's not going to be dropping packets or retransmitting packets.
5:16 And the biggest thing it does and what makes it really powerful is a little
5:20 thing called RDMA,
5:21 remote direct memory access is kind of like that meme with the kid that's
5:25 skipping all the stairs.
5:26 That's what it's doing for our networks and how it transfers data.
5:29 It obliterates so much latency, it's kind of awesome. And honestly,
5:32 latency was a really big issue, which is why RDMA became a thing. Now,
5:35 where is a latency in a typical network with servers?
5:38 It normally looks like this.
5:39 Let's say we have our storage server with all your Reddit posts and diary
5:42 entries and YouTube videos, all the stuff. And it's transferring that stuff,
5:46 that data to A GPU server first,
5:48 that GPU server will request data from that server using the TCP IP stack.
5:52 Keeping in mind this is like a regular network, not infin band.
5:54 Then the CPU of that storage server will process that data,
5:57 retrieving it from memory,
5:58 and then that retrieved data will travel through the network stack of the
6:01 storage server's os, which involves multiple layers of processing.
6:04 You can kind of see that latency stacking up a little bit.
6:06 The data is then sent over the network. So think ethernet cables and switches.
6:10 And this again is using T-C-P-I-P and all the overhead that it might introduce.
6:14 And then it arrives at the GPU server and it kind of does the same thing in
6:17 reverse.
6:17 It goes to the server's network stack arrives in memory and the CP will process
6:21 that data and it's transferred to the GPU's memory.
6:23 Now this is how most systems work.
6:24 Like it's how your computer transfers data and in most cases it does work fine.
6:28 But in the case of AI with a zero tolerance for latency and the massive amount
6:32 of data, it becomes so painful and noticeable latency was a huge problem.
6:37 So here comes RDMA and RDMA is so cool.
6:40 It's all about removing roadblocks and removing latency.
6:43 And here's what it looks like. It's pure magic.
6:45 The GPU server will first initiate an RDMA operation,
6:48 which will allow it to skip a few stairs and by a few,
6:50 I mean a whole stinking lot directly accessing the system memory of the stored
6:54 server bypassing the CPU and the OS kernels of both servers. Think about that.
6:59 This is a massive deal because the data's not going through the usual network
7:02 stack in its processing,
7:03 which involves multiple layers of copying and contact switching.
7:05 The data is simply transferred directly from the memory of the stored server to
7:10 the memory of the GPU server. It's just like, Hey, there you go. That simple.
7:13 And it's made even faster because of the underlying network InfiniBand.
7:16 They're probably thinking, Chuck, I don't see a problem.
7:18 It sounds like we already solved everything. And I would say for some people,
7:21 if you've got the money and the know-how and the staff and you don't mind
7:24 overhauling a bit of your network and dealing with a bit of headaches.
7:27 And that brings in some problems I have within InfiniBand because while it does
7:30 sound great, it has some major things for me that I don't like. Number one,
7:33 it's different.
7:34 And there's nothing wrong with different except in the case where it has
7:36 different routers, different switches, different connections, and nicks,
7:40 it has its own stack. It's basically a different type of networking,
7:43 which could be fun to learn honestly.
7:44 But not a lot of network engineers know it, especially compared to ethernet.
7:47 I mean as a network engineer myself and any other network engineer,
7:51 you cut your teeth on ethernet it. That's what everybody runs.
7:54 So as someone who's trying to spin up AI networking in their data center,
7:57 it's going to be kind of hard to find talent staff to manage that for you
8:01 because not everyone knows Infin band.
8:03 And it can be complex to set up with compatibility being a huge issue because I
8:07 mean, right now most people are running ethernet and if they make the switch to
8:10 InfiniBand, they're switching from ethernet to InfiniBand,
8:13 but they're still going to have ethernet running somewhere and they're going to
8:16 have some compatibility. And there are things that can make that work,
8:19 but it's just more complexity. And in the world of networking or anything in it,
8:23 complexity is there, but you want to avoid it.
8:25 You want to focus on the simple things that work and be complex.
8:28 Only when you have to my other big beef with it, it'll often in most cases,
8:31 be more expensive than ethernet switches and routers. And the third,
8:35 I think this is kind of a big one,
8:36 the community is not as big and ethernet based. Networking.
8:39 If you have a problem, you can pretty much go to any form out there, Twitter,
8:42 anything and get some help in Finn Band. It's going to be harder,
8:44 not impossible, but definitely harder. So yes, InfiniBand is pretty awesome,
8:48 but it's not the only option. I told you there's a war. Ethernet isn't dead.
8:53 It didn't take this lying down. It saw what Infin band did and said,
8:55 you know what? Hold my coffee. And it did some pretty cool stuff.
8:59 First speed Infin Band boasts a lot of speed. Ethernet's got speed too.
9:03 400 gigabit ethernet, we've got it.
9:05 And actually 800 gigabit ethernet is on the horizon.
9:08 Actually 800 gigabit ethernet is here.
9:10 Juniper announced their 800 gig ethernet boxes on January 29th. So we got that.
9:14 And 1.6 terabits per second is on the horizon now,
9:18 which is just insane to think about. I just installed 10 gigabit.
9:21 Now the big thing with Infinite band was RDMA.
9:23 That memory to memory access is killer bypassing CPU stuff and all that Ethernet
9:28 was like, you know what? We can do it too.
9:30 R-O-C-E-R-D-M-A over converged ethernet. It's a thing. Look it up.
9:34 Or I'll just tell you about it right now actually.
9:36 That's pretty much what it is. It's RDMA over ethernet. It had a V one,
9:40 now we're at V two V one could do layers one and two in the TPIP stack and then
9:44 V two can go over layer three. So that sounds awesome, right?
9:47 But it does kind of bring about a problem or at least a question.
9:50 We know that AI networking requires a lossless fabric.
9:54 It has no patience for latency. We can't be losing packets left and right.
9:57 And especially with running RDMA, you got to have a lossless fabric.
10:00 Ethernet inherently is not lossless, it's lossy. Let's break that down.
10:03 In a typical ethernet network with T-C-P-I-P when things get hairy or whether
10:07 it's congestion like a traffic jam,
10:09 too much traffic going across one connection,
10:12 how TCP will handle that is by dropping packets and then retransmitting them.
10:15 This is the overhead we're talking about with T CCP IP. That causes latency.
10:19 And even with R-O-C-E-R-D-M-A over converged ethernet,
10:21 which will bypass the higher layers of T CCP ip,
10:23 it still relies on the physical and data link layer to get across and without
10:27 anything there to help it, there's going to be congestion.
10:30 And because it's sitting there in layers one and two,
10:32 it's not relying on TCP to handle any kind of retransmission or error handling.
10:35 So there's nothing stopping it from getting so congested like rushed hour
10:38 traffic where you're just like, not even stop and go, just stop.
10:41 So what has ethernet done to help with that? Two amazing things. First,
10:45 ECN or explicit congestion notification.
10:47 This is kind of magic and it's a very clever way to do this between two devices
10:51 that support it, which will be anything we deploy in AI networking.
10:54 They can essentially let each other know when things are getting too congested.
10:57 Getting back to our analogy, like a road or a traffic jam. If one ends like,
11:01 Hey, I see a lot of cars coming my way, it's going to be a traffic jam.
11:04 I know it. It'll start marking cars, put a sticker on. It says, Hey,
11:07 congestion's coming, sticker congestion's coming are marking packets,
11:10 congestion's coming.
11:12 So the other end will know about it and then the other end will go, oh, okay,
11:15 well I'm just going to slow it down a bit.
11:16 I'm not going to let so many cars through.
11:18 I'm going to slow it down to avoid that congestion where packets would be
11:22 dropped. And we don't have a lossless fabric. That alone is pretty cool.
11:25 But then we also have this PFC or priority flow control.
11:29 PFC can pause certain types of traffic from flowing while allowing others.
11:33 So in the scenario of AI training,
11:35 it'll prioritize any training data traffic being sent over and pause other flows
11:40 to avoid that congestion.
11:41 So PFC and ECN will work together to create a smooth road for your
11:45 Lamborghini just to speed down. So at this point,
11:48 ethernet and a FEA band are neck and neck, RDMA, both have it a lossless fabric.
11:52 They both do that. Now you just got to cast your vote ethernet or InfiniBand.
11:57 Actually, I would love to hear in the comments below what you think.
11:59 You heard my argument,
12:00 maybe do a bit of research and let me know if I'm missing something.
12:03 But like Juniper, I am team ethernet.
12:05 I'm rooting for it to win because next to ip,
12:07 ethernet is the world's most widely adopted network technology.
12:10 It dominates every type of networking including data center networks.
12:13 Everyone already uses it. Network engineers know and love it.
12:16 And there's really no need to jump to a more proprietary protocol. Now,
12:19 I'm not saying infin band is proprietary because it's not,
12:22 but it doesn't feel as open as ethernet, not as much competition,
12:24 which does affect price. And we haven't even talked about the future.
12:27 Who's going to innovate faster? Ethernet or InfiniBand?
12:30 My boat would be for ethernet. It can innovate at the speed of the industry.
12:34 Now, to close this out, I want to play a little game.
12:35 What would my data center look like? My AI data center, if I had to build one,
12:39 what would be my diagram? Boom. Here it is. First, it's a cloths fabric or CLOS.
12:43 No, I'm blocking kind of a fat trees apology. I know I'm throwing out words.
12:47 I got a video about data center networking. If you want to check it out.
12:49 For my spine and leaf switches, I'm rocking Juniper. 800 gig ethernet.
12:53 I can go with either the QFX series or the PTX and look at these guys. I mean,
12:56 they're insane. And then to make sure my network is always smarter than I am,
12:59 I'll be running some software defined networking.
13:01 Juniper has this thing called ra.
13:03 It'll run the show and make your network awesome.
13:05 It'll make sure that things like PFC and ECN are configured correctly and tuned
13:10 based on the current needs of the network so you don't have to worry about it.
13:12 And of course, they'll have the full suite of DCB or data center bridging,
13:15 which is kind of a suite of protocols that include PFC and ECN,
13:19 and some other stuff. I'll list right here. So what team are you?
13:21 Team Ethernet or Team InfiniBand? Let me know below. And again,
13:24 shout out to Juniper for sponsoring this video and making videos like this
13:27 possible. I love getting to talk about networking and data centers and ai.
13:30 So let me know if you want to see more of that. And of course,
13:32 if you want to learn more about what Juniper's doing with AI and other crazy
13:35 switches and routers that costs more than your entire neighborhood,
13:38 check out the link below. I'll catch you guys next time.