The Power of Networking in the Cloud with Tom Scholl

Episode Summary

A cloud service is only as good as the team of network engineers who keep it up and running. In this episode, AWS Vice President and Distinguished Engineer Tom Scholl breaks down the importance of security and legwork needed to support the company’s massive infrastructure. Corey picks Tom’s brain while singing the praises of the AWS DDoS Protection Team, marveling at the scale of the modern internet, and looking ahead to the next generation of network engineers that could land at AWS. If you’ve ever wondered about the inner workings of the AWS cloud, then this is the discussion for you.

Episode Video

Episode Show Notes & Transcript

Show Highlights:

(0:00) Intro
(1:09) The Duckbill Group sponsor read
(1:42) The importance of a good network for AWS
(3:38) Evolution of networking
(6:03) Efficiency of the AWS DDoS Protection Team
(7:29) AWS Cloud and weathering DDoS attacks
(10:03) Policing network abuse
(12:08) Walking the SES tightrope and network attacks
(15:00) Ensuring the security of the internet
(17:53) The Duckbill Group sponsor read
(18:37) Scale of the modern internet
(20:47) Migrating the AWS network firewall
(21:54) Internal network scaling
(24:27) Preparing for DDoS disruption
(29:14) Finding the next generation of network engineers
(32:15) Where to learn more about AWS cloud security

About Tom Scholl:

Tom Scholl is a VP and Distinguished Engineer at Amazon Web Services (AWS) in the infrastructure organization. His role includes working on AWS’s global network backbone, as well as focusing on denial of service detection and mitigation systems. He has been with AWS for over 13 years.

Prior to AWS, Tom was a Principal Network Engineer at nLayer and AT&T Labs (formerly SBC Telecom). He also previously held network engineering roles at OptimalPATH Digital Network and ANET Internet Services.

Links Referenced:

AWS Security Blog: https://aws.amazon.com/blogs/security/
How AWS threat intelligence deters threat actors: https://aws.amazon.com/blogs/security/how-aws-threat-intelligence-deters-threat-actors/
Using AWS Shield Advanced protection groups to improve DDoS detection and mitigation: https://aws.amazon.com/blogs/security/using-aws-shield-advanced-protection-groups-to-improve-ddos-detection-and-mitigation/
AWS re:Inforce 2024 presentation on Sonaris and MadPot: https://www.youtube.com/watch?v=38Z9csvyFDg
NANOG 2023 presentation on AWS networking infrastructure: https://www.youtube.com/watch?v=0tcR-iQce7s
AWS re:Invent 2022 presentation on AWS networking infrastructure: https://www.youtube.com/watch?v=HJNR_dX8g8c
AWS re:Invent 2022 presentation on Scaling network performance on next-gen Amazon EC2 instances: https://www.youtube.com/watch?v=jNYpWa7gf1A&t=1373s
IEEE paper on Scalable Relatable Diagram (SRD): https://ieeexplore.ieee.org/document/9167399

Sponsor
The Duckbill Group: https://www.duckbillgroup.com/

Transcript

Tom Scholl: I mean, it's definitely, you know, in the many, many terabits of capacity. And it's different layers of the network, right? Because you have to think from an availability zone, a data center, you know, how do you connect this to the rest of the world, right? So there's, you know, large amounts of capacity within a particular AWS region.

And then you actually have to interconnect that too.

Corey Quinn: Welcome to Screaming in the Cloud. I'm Corey Quinn. My guest today is Tom Scholl, VP and Distinguished Engineer at AWS. Tom, thanks for joining me up. AWS, haven't heard of those folks. What do you do?

Tom Scholl: Hey, uh, thanks for having me. I am a, uh, engineer who focuses on our network, um, our overall infrastructure organization.

So, uh, that includes our data centers, to our hardware engineering, to our supply chain, some of our network edge services, and, um, our network infrastructure, and, um, Things like, particularly in the DDoS, anti DDoS use case, as well as, uh, some of our CDN work as well. And more specifically, I focus on our network infrastructure, kind of our global backbone and internet transit and pairing.

And I spend a fair amount of my time in DDoS protection and disruption.

Corey Quinn: This episode is sponsored in part by my day job, the Duck Bill Group. Do you have a horrifying AWS bill? That can mean a lot of things.

Predicting what it's going to be. Determining what it should be. Negotiating your next long term contract with AWS. Or just figuring out why it increasingly resembles a phone number, but nobody seems to quite know why that is. To learn more, visit duckbillgroup. com. Remember, you can't duck the duck bill, Bill.

And my CEO informs me that is absolutely not our slogan. There's, I think, a lack of awareness, societally, around the value of the network to something like this. I mean, without this, AWS becomes probably the world's largest collection of space heaters.

Because, without being able to talk to one another, computers don't tend to do a whole heck of a lot. It used to be something that was incredibly expensive. Incredibly top of mind for folks, because networks would break and things would stop being able to communicate clearly. But for most of the world, it's, it's gone to the level of being a utility, where when you turn on the faucet in the bathroom, you don't wonder, is water going to come out this time?

It just does. If it ever doesn't, that's momentous. And networks have sort of gone the same way, at least from the business user perspective. In no small part due to people who are doing the things that you do. How did you get into the space?

Tom Scholl: Well, it all started back in the 90s. Um, I used to dial into PBS's and starting to learn a lot about Unix and telephony and those sorts of, uh, systems.

And, uh, eventually got a job in an ISP, where you had to be a jack of all trades, where you had to know Unix sysadmin work, where you have to run the Unix radio servers, mail servers, to, hey, you have to learn some of that network stuff, too, in addition to being tech support, too. So you had to basically kind of know it all and kind of end to end, right?

And there was nothing that you could say no to, that that wasn't your specialty, and did that, and eventually got a job at the phone company in the Chicagoland area, which was Ameritech, which later got acquired by SVC, which had, uh, Pet Pacific Bell and SCT in Connecticut, Southwestern Bell, and worked on building our broadband network and our internet infrastructure and got involved in sort of the whole networking scene with Nanog and pairing, uh, the whole ecosystem.

And basically it was, you know, building large networks. And then we eventually acquired AT& T, which is even bigger network on top of that. And just did that for a fair amount of time. And then around 2010 joined Amazon and left. And then briefly came back and have been working on the Amazon side, primarily on our border network, which is.

Basically the ISP, transfer provider of Amazon that connects our data centers, inter region connectivity, connectivity to and from the internet. And then the last four years I've been spending a bit more time on the DDoS space.

Corey Quinn: I had the privilege of watching your talk at NANOG in Kansas City a month or two ago before this recording, and it was interesting seeing how so much of What you do, especially these days, seems like it shies away from a lot of the technical countermeasures for DDoS and leans much more heavily into being a human being, reaching out to network operators on the other side of the line when you start seeing bad behavior emitting from their networks.

Has that been something that's always been the case and I've just been blind to it? Is this an evolution in networking culture?

Tom Scholl: So I think in the networking culture, there's always been a strong operator community, and you build a lot of relationships and friendships over time where, you know, Hey, if there's a problem in another person's network, like you are that Rolodex, right?

For reaching out to somebody, a particular CDN or cloud or hosting provider and saying, Hey, we've got an issue and we need to troubleshoot it. And so that transition worked really well in the DDoS space where you would see the sort of abuse that might be occurring from different parts of the world.

It's like, well, who do I know there? Well, I know some of the networking side and let me go reach out to them. And depending on the nature of the issue, um, it, It is a human contact to basically engage somebody to say, Hey, I need, can you route me to the right person? Um, so it was kind of doing it for decades on the network side when it came to troubleshooting, um, and with the DDoS side, it's, it's kind of a natural evolution to kind of leverage those same relationships to, to make progress.

Corey Quinn: It's one of those areas where it feels like there's not a lot of public awareness of the fact that all the big hyperscalers who compete with each other, we've cut through our ways and in many business ways are very much working together around things like. I guess the dark forces that will attempt to destroy the internet around security, around abuse, around network peering.

There's very much a sense of we're all in this together in every conversation I've been a part of.

Tom Scholl: That's correct. There's very much a lively operator community where, you know, reaching out to people, engineer to engineer, operator to operator. When you have a problem, it's like, hey, there's a mutual thing.

It could be a mutual customer of ours, right? Or whatever it might be, but it's like, you know, we want to get the packets to flow. We're all in this together. Let's try to find a way to work the problem and get drive resolution. And so a lot of that can be, you know, directly through email or other back channels or slacks and things like that, where you need to reach out to people.

We've certainly found issues in other people's networks where it's like, Hey, Thing is on fire, you need to take a look at it. Um, and so that, that in addition we have formal ways to actually engage individual knocks and things like that. But uh, definitely having those relationships pay pays off quite a bit.

Uh, when it comes to networking abuse, DDoS stuff like that,

Corey Quinn: a WS offers a shield product that is DDoS protection and the basic levels rolled out to most of your endpoints. Customers benefit from that automatically. There's a DDoS Shield advanced product that comes into a fixed fee of $3,000 a month. Um.

Which at enterprise scale is drop in the bucket. It also does some weird economic things of changing how WAF rules wind up being charged. But what I found from customers who've had that and who have suffered from DDoS issues historically, far and away the thing that they say that the biggest benefit of that has been being able to coordinate more closely with the AWS DDoS prevention team.

Every story I've heard about those folks has been absolutely top flight and It's rare because usually when someone is undergoing an attack, they're not in a good mood. I'm just going to say it. They're angry. They're stressed out. They're wondering, will the website ever work again? So they're, they're inclined to lash out.

But I've heard nothing but positive stories about the team's work.

Tom Scholl: That's great to hear. And, uh, I'm sure the team will be delighted to hear that.

Corey Quinn: Because I assure you, if people have negative things to say, they find their ways to me. I'm sort of a negativity magnet by happenstance, I suppose.

Tom Scholl: No, I mean, that, that team, uh, and I work with them really closely, um, and, uh, they basically protect all of Amazon in addition to customers who have, let's say, Shield Advanced, where they directly engage with them, identify the attack, come up with medications, and work with customers pretty closely.

So it's definitely an area that we're, we're, uh, proud to have. And, uh, definitely enjoy working with them closely.

Corey Quinn: My experiences with DDoS historically, and I know when you start a sentence like that, it sounds good to go really negatively. But no, I was always firmly on the victim's side of it, where I was a network staff for a time for the Freenode IRC network, which was an ever popular target because, oh, well, what am I going to do today?

I'm just going to give people grief on the internet because. So there were constant challenges in dealing with sin floods and then more sophisticated attacks as time went on. And you saw it not just in my hobbies there, but I would see it with companies where, in some cases, suspected competitors would wind up launching giant attacks at unprotected endpoints.

And it was easier to do early on when someone had a few servers sitting in a rack in their office. You can overwhelm links pretty easily. As hyperscaling started to be a thing and people started realizing, Oh, maybe there's something to this cloud thing. At least publicly, it seems like a lot of those problems kind of went away.

Given that you have been talking about this for a while, including on stage to very smart network people like yourself, I get the sneaking suspicion that people just didn't give up on this. There's a, there's an awful lot of very hard work that you and people like you are putting into this. How has it evolved?

Tom Scholl: Definitely. In the last several years, there's different, when you think about DDoS, there's different types out there. There's what we call Layer 4 DDoS, and that's basically, you know, either bandwidth saturating, bits per second heavy, or packets per second heavy, which is eally there to kind of exhaust state, right?

So there's, traditionally that's been historically how we think about DDoS. And in the last several years, there's also been Layer 7 request floods, which are basically HVGET input attacks, so just overwhelm from a request per second perspective. But from a You know, what, what has changed is that in the last several years, there's been much more focus in actually identifying where the infrastructure that's being used to launch these attacks and actually focusing on disrupting that and engaging with the actual sources of this traffic to go and get the shutdown.

And that comes in different forms, right? Where it could be if it's spoof type traffic, which we can talk a little bit more about how we can, you know, with our global backbone and our global region. The amount of networks where you connect to gives us insight into where spoof traffic comes from. And that's a unique one because that's been a 20 plus year issue.

I know that goes back to IRC and smurf attacks and things like that that people used to do. So that was kind of a unique area where we stepped up and collaborated with other networks to actually chase that down. And then there's other areas where we look at things like botnets and, you know, finding the command and control servers and actually going to target them and, and reach out to the hosting provider to get that shut down and the domain registers as well.

That's some example of, you know, where we've started some of that, uh, work and, uh, pushed pretty aggressively on it.

Corey Quinn: I started my career in tech running, well, what I thought were large scale email systems compared to what you folks are doing. I'm at the scale of, haha, that's cute, at a university. But Managing a lot of the spam that was coming in was sort of a hobby horse of mine.

I wound up getting dragged along fairly far down that path. But today, if I were to set up a web server somewhere on the internet, sorry, set up an email server somewhere on the internet, and start turning to an open relay or sending ridiculous spam out of it, it would not be very long at all before every provider, Within some small rounding, some degree of rounding error would still no longer accept traffic from that server.

They would effectively black hole that. It would wind up on a bunch of blocked lists and that would be the end of it. I'm curious why that, that pattern doesn't tend to follow a lot of these network providers who do a poor job of policing the traffic that they are emitting. Is that just because they're so big that it's difficult to wind up, uh, seeing it all from their side?

Is it that they're too big to block because people are just not going to block AT& T for example? Or is there Is there something more to it?

Tom Scholl: I mean, I think every network has their own policy of how they deal with this. I think some networks, you know, actually are proactive and they look and are we sending any abuse out and I, you know, you definitely find cases where there's other networks that could do a better job.

I know from the AWS perspective, we certainly, um, have various different detection and mitigation capabilities. If we ever see anything anomalous leaving from our network and. One of the things that, uh, in the last few years, like we've actually up leveled that to, you know, look for communication to command and control servers.

And, uh, like that might be out there on the internet and actually block that communication that even prevents, uh, resources from actually launching attacks in the first place, as well as reaching out to customers and say, Hey, you're talking to this thing, um, that our trust and safety team will go and engage with.

So I think we do a Really good job of actually preventing that sort of preventative type of work, where I think a number of other networks out there just haven't gotten to that, that area. Maybe they just, you know, the abuse team may not be funded appropriately. I can't really speak to how other networks operate, but we definitely, it's a high priority for us for sure.

Corey Quinn: Something that I do want to call out is, in the early days when, even before SES came out, the EC2 IP ranges were generally, in some cases, a source of abusive traffic. And this is no necessary fault of your own. It's when you wind up letting anyone start using computers with the swipe of a credit card instantly that that's an incredibly powerful thing.

Not everyone is a good actor trying to build a business. Sometimes it's just, I want everyone to see my marketing, and it devolves massively from there very quickly. And you see that tension somewhere where people sometimes find it challenging to get out of the SES sandbox for some workloads. I, having worked for the SES team enough, I have, I am of the opinion that they make the right call most of the time.

But in the early days, AWS's traffic, especially once SES launched, was viewed in the anti spam community with some suspicion and distrust. I think on some level that's probably a function of scale, where, well, they're too big to really be able to communicate with anyone over there, so of course they're going to be a bad actor.

I don't see that anymore. There has been a tremendous focus somewhere on tamping out that behavior, but it's also happening from the perspective of not. Inconveniencing legitimate customers. That, that feels like an impossible tightrope to walk, but some of you folks have done it.

Tom Scholl: Yeah, I don't work with the SES team that closely, but I'm aware of some of the efforts that they've done in terms of how they control and, um, their detection systems that they built to prevent that sort of activity.

Um, but we could follow up with you with more details on, on some of that.

Corey Quinn: There's more to it than, I believe, just email. That's the one that I have the best experience with. But I do not hear particular stories. When you hear about the various forms of novel network attacks and the rest, and you start looking at some of the traces that wind up getting published, here are the bad actor IPs that are helping to slam this thing.

I don't see AWS represented nearly as much as I would expect relative to the sheer number, the sheer size of the IP space that you folks control. There is clearly something highly proactive going on. That is making the internet a better place.

Tom Scholl: One of the things that we've talked about in the last year, which is the system called Mad Pot, which is basically our honeypot system that we've developed internally for several years ago, which lets us basically be a sponge to any sort of negative activity that's going out there.

And so we can ingest that data, we can process it, and we can determine, like, where is it? You know, what? Where is it coming from, basically? And if it's coming from internal resources, such as from EC2, we engage with our trusted safety teams directly to, um, reach out, engage with customers or take any other sort of mitigating action.

So we have some of the systems in place to detect that and proactively engage and take action. Um, it's just one example, and that MADPOT system has been used for a variety of other systems on the DDoS side, but that's just another example where that, uh, and some of the work from our trust and safety team to, uh, identify and mitigate, uh, any sort of, uh, outbound malicious abuse of activity.

Corey Quinn: It's been said for a long time that, uh, at AWS, security is job zero. And I've always interpreted that to mean protecting customers from external bad actors, the end. And then also in many cases from hypothetical insider attacks at AWS. Here's how we guarantee that even Amazonians can't access your data when it's stored here.

Countless white papers on this to the point where okay, if there's, if there's something inaccurate in here, I'm certainly not going to be the one to find it. I have take, I take that at face value just based upon the sheer amount of work you folks have done. A lot of the work that you're doing seems to be in many respects aimed not at at protecting existing customers, but also Security aiming at the larger Internet's well being as a whole.

Is that accurate? Is that a wildly naive Pollyanna optimistic style misreading of the situation?

Tom Scholl: No, that's accurate. And as we went on this journey around like 2020, that's when I started pivoting into the DDoS space. And, you know, it was not just You know, protect AWS infrastructure, but protect our customers.

But, you know, looking at the data and collaborating with other external networks, it was just a few of us together that said, you know, we can actually take this further, like let's not just observe it and block it, but you know, this is, we can actually take some actions here that'll be good for the internet as a whole.

And so that's how we started looking at kind of those different, three different silos of attack traffic, where we saw, Hey, there's spoofing traffic coming into our network. Through pairs, like let's go directly engage with that pair to say, can you trace a spoofing back and go and filter it and prevent it?

Um, and just make that a daily habit, right? And now that one's a little bit more complicated because you have to go and engage with networks externally and explain to them what spoofing is. There's a lot of networks. You know, networks have grown. People who might have been there back in the day aren't there anymore, who are maybe more familiar with it.

So you have to also kind of get over the hump of explaining and with pictures, um, like, Hey, this is what spoof traffic is. It's yes, we know that's not your IP. Can you go use, you know, your NetFlow tooling to go and figure this out? So that was kind of one area. Um, and then when it came to botnets, it was just like, well, we've got our mat pod systems.

We can find where these botnet command and control servers are in the domains that they're using. Like we can go and actually automate. And generate the notes to these hosting providers to say, here's the data about what's on here. It's issuing attack commands to however many thousands of resources around the world.

Um, you know, please take this down. Um, and that also goes into the Layer 7 side where, um, you know, you have, um, resources where these booters and stressors, uh, we didn't get really too much into kind of where these attacks come from, but the booters and stressors, they set up a number of machines and they get open proxy lists and they just Basically go and blast away at them.

And so you could try to mitigate all the proxies on the internet, or would it be better to really just go to actually the source that's actually generating it, focusing on it? And it was just really just a few of us together that said it wasn't anyone's roadmap, really. We're like, this is something we should just go and do.

Um, let's, let's get it going and, uh, measuring the impact of it that it's had. It's been pretty exciting.

Corey Quinn: Here at the Duckbill Group, one of the things we do with, you know, my day job, is we help negotiate AWS contracts. We just recently crossed five billion dollars of contract value negotiated. It solves for fun problems such as how do you know that your contract that you have with AWS is the best deal you can get?

How do you know you're not leaving money on the table? How do you know that you're not doing what I do on this podcast and on Twitter constantly and sticking your foot in your mouth? To learn more, come chat at duckbillgroup. com. Optionally, I will also do podcast voice when we talk about it. Again, that's duckbillgroup.com. One thing that I continually have to remind myself of is the sheer scale of the modern internet. Uh, you folks recently announced direct connect availability in some locations at 400 gigabit per second, which is just monstrously fast. Now, I can make jokes because of how I see the world in terms of data transfer means money, but ignoring entirely the economic impact of that, the sheer scale of peering between AWS and Comcast, given the disturbing proportion of the internet, and sometimes it feels like you and your peers tend to represent.

And the sheer volume of traffic that must be, it's, it almost, it almost beggars belief to be able to even picture that sense of scale. Is that, at this point, it feels like at some point, even as big as I think it is, the reality is almost certainly much larger than that.

Tom Scholl: No, I mean, I mean, it's definitely, you know, in the many, many terabits of capacity and it's different layers of the network, right?

Because you have to think from an availability zone, a data center, you know, how do you connect this to the rest of the world, right? So there's, you know, large amounts of capacity within a particular AWS region. And then you actually have to interconnect that too, right? And so of our, our, a lot of our teams that focus on our backbone.

Uh, network topology, like what are the amount of routes you need to set up backbone links to, understanding diversity when it comes to, well, cables are going to get cut, terrestrial or subsea, right? And so how much additional capacity do you need to provision on alternate paths to, to plan for some of these cuts where a terrestrial, uh, cut might be short lived, it might be a day or two, whereas a subsea could be weeks or months, right?

So you have to. Put a lot of planning into actually having a lot of this capacity there and standing by. Um, and then there's the internet side. Once you get to the actual edge of the network, you actually have to go and capacity plan with all these external networks, right? And, you know, one of the things that's really been helpful for us is that, um, in the last several years, we've taken a lot of the data center technology, uh, network technology that we've used there.

And we've actually brought that into basically the border, kind of the ISP border backbone side of it, where we've taken some of these smaller commodity chipset devices. And I actually used them in the internet scale, uh, which is something that is, is, uh, not, not super common out there. And so that's really allowed us to basically get into these end by however many hundred gigs or end by however many 400 gigs.

And it's been able to allow us to scale up rapidly and stay ahead of things.

Corey Quinn: You folks, I think earlier this year had a blog post or, I don't know if it was a blog post or white paper. I know it was, it was esoteric compared to a lot of the stuff that you folks put out, which frankly I'm, I'm here for. It talked about migrating off of a bunch of networking appliances from legacy vendors, was the vibe that I got, onto the AWS managed firewall offering and how that wasn't just a matter of the capability of handling throughput at scale, but the ability to get observability into what those traffic flows looked like in ways that previously had been very challenging.

Tom Scholl: I'm aware of that project and know the team really well. That was a, an effort to basically move, move away off of a hardware based firewalls and a certain particular portion of the network. And it really, uh, you know, caused the team to, to look, you know, at network firewall and how are we going to leverage the capabilities, uh, of that system.

And it, it actually, in the end, it got us to a really good spot because it gave us a level playing field. Carpenter Analyzation that we like with VPCs. It gave us a level of visibility through Flow Logs and through some of the network firewall capabilities that we really like. And, uh, so it was a good success story of how like, hey, we can, we could run these workloads on our products.

Um, and it's worked really well. A

Corey Quinn: question I have about in the way that internal networks work there, I mean, obviously the way that you are peering with other folks, you, you aren't rolling out your own custom special version of BGP because as it turns out, when it comes to the internet, interoperability is kind of a big deal, but at reInvent two years ago, you folks talked about an internal TCP replacement protocol, SPF, or Something like that.

And it was, this is fascinating what you talked about. It makes latency to EBS a lot lower. I think the story that got told is this is fascinating from a protocol perspective. Can you tell us more about it? And the answer was no. Great. Awesome. We just sit here and be envious from the outside. My question is, is internally at AWS, when you start getting into the large scale internal networking piece, how much of a resemblance does it bear to what you might expect at a.

Commercial offerings or someone working in a Cisco lab to pass a certification, everything just scales up from there. Is it complete Wonderland style stuff, or is it just the basics you would expect anywhere else writ large?

Tom Scholl: I would say that it's certainly a network interconnection points within our, within between.

Let's say EC2 and sort of the border network. That's where you'll typically find still things like BGP operating. We certainly use BGP. Obviously, externally, we have to, um, to the internet. Um, within the data centers itself, there's a mixture of different existing open standard routing protocols. But in the last few years, there's been some effort to actually focus on, can we build additional protocols that can provide us like wrap more rapid convergence, right?

And more unique topologies. So there's definitely active work going on there to actually look at, you know, once Because, you know, some of these protocols between OSPF, like, they do have their own limitations, right? And you could modify them and twist them and turn them in certain ways. But there's also some benefits by saying, you know, can we rethink about how we do link adjacencies and how do you path calculations?

So certainly within data center space, there's some of that innovation that's been going on there. Um, but on the, uh, and, you know, Another part to also consider is that a lot of what we do in terms of traffic steering is through controllers, right? So we have different software based controllers. When you have traffic that goes, let's say, to the internet, um, to basically how do you, you know, routing protocols don't have a lot of things about performance, right?

They don't understand latency. BGP doesn't capture that. Um, so a lot of behind the scenes, we have controllers that actually look at You know, the system that feeds into CloudWatch Internet Monitor to actually steer things to say, okay, you need to move this prefix over this location. Okay, this other path latency has gotten better.

Let's shift it over there. Um, does it fit? So there's a lot of, it's, it's not just the protocols itself. It's also the controllers that actually manipulate the routers themselves and forwarding.

Corey Quinn: When you take a look across the large ecosystem, one of the things you talked about in your talk was explicitly about DDoS disruption.

My historical experience with DDoS, again, victim side, has been the only real guaranteed way to wind up beating it was to be able to throw more bandwidth at it than the attacker could summon. The problem is with malware being what it is in the scale of the internet today, they more or less wind up with infinite levels of bandwidth.

So at some point, that just becomes an arms race. Uh, what, what, how have you been doing around the area of DDoS disruption?

Tom Scholl: So, I mean, you're, you're accurate in that, like, yes, the attacks get bigger and bigger. And, you know, it used to be, you know, hundreds of gigabits. And now you're seeing into the low terabits level of bits per second.

And, you know, to, in order to address that, you need to have a really large front door, right? And so that is one of the things that AWS does have at, you know, at our scale, is that we do have those large front doors with a CloudFront, you know, application load balancer, where you can basically absorb some of that traffic level.

So that that's certainly critical in order to be able to kind of operate in that space. Now, in terms of disruption, it really comes down to identifying through some of our systems of MADPOT to actually identify where these attacks are coming from and then engaging with those external network operators to basically say, Hey, there's a C2 server that needs to be taken down.

Um, you know, can, it's clearly hosting bad things. Can you, can you shut it down, please? Um, you know, this domain register, can you take this domain down because it is hosting a, you know, a C2 that's there? Um, with the Layer 7 attacks, one of the things is, you know, being able to actually identify where those Layer 7 requests come from.

Um, because once you look behind the hood, under the hood of these things, a C2 is nothing more than just, you know, a computer somewhere with a, uh, listing on a port running that software. With the Layer 7 request slides, it's interesting because it's actually. Typically a lot of Node. js scripts running on machines with like lots of memory and a proxy list that someone imports.

And it has some orchestration. So typically you allow these DDoS operators, they have storefronts and those storefronts are kind of hidden a little bit further away from where the attacks actually get generated. So a lot of the focus that we've done is looking at the actual infrastructure that can generate these and direct engagement, those networks, um, to shut it down where possible.

Corey Quinn: There was a school of thought for a while that, oh, about hackback attacks, where, oh, someone is attacking you, you just go ahead and wind up breaking into their systems and the rest. And I was a little concerned because that's always been a dicey proposition at best. So when you started talking and your talk at NANOG about the idea of disrupting these attacks, it's like, oh no, this is about to go somewhere disastrous.

And no, you kept it very much in the correct direction. And I do keep a hand in the space just to make sure that people aren't increasingly Suggesting debunked ideas from the early noughts again, because enough time has passed, people don't think that, oh, well, this time it's sure to work. Your, your holistic approach to it has really been something of, of note.

Tom Scholl: I think with, uh, definitely the, on the spoofing side, there's a lot of collaboration with networks and, and occasionally we do get a network where it can be difficult to deal with, right? And so we'll sometimes talk to other, their, their peers as well, or maybe their upstream provider for, you know, we're not.

Uh, getting through to them, and we'll talk to them and be like, Hey, this is coming from your downstream network. Like, what are your options here that we can, we can do? So, um, we definitely focus on, you know, uh, being nice and, and communicating through email, um, or personal contacts to, to address whatever the issue is.

And, and it's a mixture of things of like education, right? Some of these networks just don't know, um, you know, Uh, it's interesting that like broadband networks have done a really good job of preventing spoofing, um, by default, right? You get a cable or DSL line, like you can't spoof on it, but it's typically kind of the hosting shops that we find that have typically, you know, if, oh, if you've got a dedicated server, then you can spoof, right?

So a lot of it comes down to education and saying, like, you should make this the default. Right. Or when somebody asks, sometimes people can ask their hosting provider, say, Hey, I need to spoof for whatever use case, right? Sometimes they call it IP header modification, IP header modification, IPHM, they'll ask for that to be removed.

It's like, okay, we've talked to hosters. We're like, Oh, this customer asked for it to be removed. They're like, well, you might want to be a little skeptical about it next time. If you can, please.

Corey Quinn: Yeah. Once it's been removed, what is the behavior that they start doing? What are you seeing going across the wire?

Yeah, Trust and Verify.

Tom Scholl: You see all these packets per second that spikes up, right? And it's all to UDP destination 453 or 389. Like that's a pretty good clue, right? And so that's some of the things that we do try to educate networks. So it's like, this is what it looks like. The, you know, here are the different like heuristics or things that you can look like as a network operator to find this going on in your network.

And so that's what we've been really spending a lot of time and trying to educate and be like, here's how you use some of your off the shelf NetFlow tools. And some of our open source that you can actually dig on this and find it on your own. And I think that's where we've had a lot of success. And there are some networks that are in that mode, or they actually do find it on their own and they deal with it.

And it's like, by the time you reach out to them, they're like, Hey, it's already taken care of. It's like, that's amazing. I'm glad we've got you in a good spot now.

Corey Quinn: It's been 10 minutes. If you looked at the Pocket Capture lately, yeah. My last question for you is, you've been doing this a very long time. And when I was at Nanog, I talked to a bunch of other people who have been doing this for a very long time.

Eventually, parts wear out and need to be replaced. Uh, as much as some of us might want to live forever, that is not an option that is currently available. Where does the next generation of people who are, who will do in the future what you do today,

Tom Scholl: Yeah, no, that's, that's a great question because I think we struggle with that too sometimes in terms of how, you know, how do you find talent and, and how do you, you know, the way one of the Amazon leadership principles I like a lot is learn and be curious, right?

And I think, you know, trying to identify folks who have that learn and be curious of like, Hey, I want to go deeper here. I want to understand this a little bit more, you know, don't maybe just treat this as yet another attack, but like actually understand what's going on behind it. Like what, what's actually generating this.

Right. So. A fair amount of it is just kind of identifying folks who, who are interested and, you know, presenting opportunities for them. Right. And I think that, that is the, you know, as senior technical leaders, like you have to present opportunities for others and sometimes it may not go the way you expect, but that's fine.

You have to learn and basically, you know, Allowing people, um, you know, connecting them with other folks externally, right? Whether that be external, uh, forums, different trust groups, and just how do you basically like, Hey, I want to, I want to get you into this. And I can, you know, serve as basically connecting them with other folks, um, given the opportunity to take something and run away, running with it, you know, talking about it after the fact, but it definitely requires like real effort, right?

To, to actually. Help and educate at the same time, which is like, Hey, I'm going to have to, you know, uh, let me try to explain this to you as best as I can. If you have any questions, let me know. No matter what silly, good, bad, whatever it is, like I'm here to help, right? I want to make you successful. And I think certainly as senior technical folks, we definitely need to be growing other folks.

And it's, it needs, you have to carve out the time and resources for it.

Corey Quinn: Do you find that those folks are matriculating into your org as having studied networking and that, that was the direction they wanted to go in, or are they basically phasing in from So from other technical areas,

Tom Scholl: I've seen all types.

It's not always purely people with a networking background. I've seen people, you know, and I've had this conversation with folks before in some of these areas where they're like, well, we're not security engineers. I'm like, neither am I. Like, this is just like, like, no, like this is just purely like, you know, this is an area to immerse yourself in.

And it was kind of my journey too, when I got in the DDoS space. Because I've always dealt with it on the receiving end, right? When we build the network infrastructure and seeing attacks come in. But I never said like, I'm going to actually try to understand this. And so I, I had to myself, like immerse myself in this domain.

And even internally, like we're working with other teams inside of Amazon, just understanding like, you know, trust and safety or the fraud team. And, you know, it was like, Hey, I'm coming in here as a newbie. Like, what can I learn? And, um, you know, I think for definitely with other folks, we've seen people come in from, you know, various backgrounds where it's like, okay, I want to, I want to go and learn.

Luckily, we have a lot of tools and data at our disposal where folks can pick up and go. Um, and I think it's just really about kind of tying people, connecting people to it. Um, and particularly when you surround it around a particular outcome, right? So, Hey, like we want to address. This particular issue, like, how do we go and lean in here?

And like, well, you know, what are the different people that we need to bring together? So yeah, it's, it's all types of backgrounds.

Corey Quinn: Uh, I really want to thank you for taking the time to talk to me today. If people want to learn more, where should they go?

Tom Scholl: So on the AWS, uh, security blogs, we've, we've definitely had a number of postings about some of the things that we've built.

So we've talked about things like if you search for a Mad Pot, A recent thing that we've talked about, which is Scenaris that we just were public about, which is sort of this basically service behind the scenes that actually detects people trying to do, uh, like, go after, attack customers, right? And it actually blocks them.

So I would recommend reading some of the things that we've done on Scenaris, uh, Madpot, um, Shield, Advanced. We've got a number of blog posts that are out there. Um, yeah, that's a good starting point to kind of learn some of the things that we've done in this domain.

Corey Quinn: And we will definitely make it a point to put those in the show notes.

Tom, thank you so much for speaking to me. I really appreciate it.

Tom Scholl: Oh, thank you for having me.

Corey Quinn: Tom Scholl, VP and Distinguished Engineer at AWS. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a 5 star review on your podcast platform of choice.

Whereas if you hated this podcast, please leave a 5 star review on your podcast platform of choice, along with an angry, insulting comment, so then I can block that particular platform from syndication, because that's how it works.

The Power of Networking in the Cloud with Tom Scholl

Episode Summary

Episode Video

Episode Show Notes & Transcript

Transcript

You might also like

Finding Engineers with Empathy with Lili Rogowsky

Making Dropping and Sharing Easy with Timo Josten

“Just in Case” vs. “Just in Time” with Aditya Bhargava

Get the Newsletter

Sponsor an Episode