Changelog & Friends — Episode 51

Kaizen! Let it crash

Gerhard is back for Kaizen 22! We're diving deep into those pesky out-of-memory errors, analyzing our new Pipedream instance status checker, and trying to figure out why someone in Asia downloads a single episode so much.

Speakers: Jerod Santo, Gerhard Lazu
Duration: 100:45

Transcript(329 segments)

0:15Jerod Santo
Welcome to Changelog and Friends, a weekly talk show about how good systems become bad systems. Thanks as always to our partners at Fly.io, the platform for devs who just want to ship, build fast, run any code fearlessly at Fly.io. Okay, let's Kaizen. Well, friends, I don't know about you, but something bothers me about GitHub Actions. I love the fact that it's there. I love the fact that it's so ubiquitous. I love the fact that agents that do my coding for me believe that my CI CD workflow begins with drafting Toml files for GitHub Actions. That's great. It's all great until, yes, until your builds start moving like molasses. GitHub Actions is slow. It's just the way it is. That's how it works. I'm sorry, but I'm not sorry because our friends at namespace, they fix that. Yes, we use namespace.so to do all of our builds so much faster. Namespace is like GitHub Actions, but faster. I mean like way faster. It caches everything smartly. It caches your dependencies, your Docker layers, your build artifacts, so your CI can run super fast. You get shorter feedback loops, happy developers because we love our time, and you get fewer, I'll be back after this coffee and my build finishes. So that's not cool. The best part is it's drop-in. It works right alongside your existing GitHub Actions with almost zero config. It's a one-line change. So you can speed up your builds, you can delight your team, and you can finally stop pretending that build time is focus time. It's not. Learn more, go to namespace.so. That's namespace.so, just like it sounds, like it said. Go there, check them out. We use them, we love them, and you should too. Namespace.so. How else would you learn? Let it crash.
2:21Gerhard Lazu
Exactly. The best things happen when things fail. Seriously. If it's in a controlled way, right? I think that's like something which isn't said, it's implied. It has to be a controlled failure where you have the boundary and things will not blow up. I mean, they'll blow up, but like, you know, like the fireworks sort of blowing up where it's a controlled explosion.
2:45Jerod Santo
Yeah. Right. Tiny little crashes to learn from. Welcome everyone to Kaizen 22 with the incomparable Gerhard Lazu. He's here to let us know how he lets it crash. It's like that song, let it snow, let it snow, let it snow, only you know how to replace. Hey Gerhard, how are you?
3:07Gerhard Lazu
Hey Jared, I'm good, thank you, thank you. Had a great holiday. It was a great couple of weeks where I've managed to finally disconnect. It's been, I don't know, like 20 years since I had two weeks completely off. Even my holidays are only a week. So this was very different, very enjoyable, and I feel so refreshed. So I'm firing an ulcer in this.
3:30Jerod Santo
You unplugged and now you're plugged back in.
3:33Gerhard Lazu
Plug it in. I stopped it and I started it and it's like brand new. It's like Glade, man. I'm like Glade over here, man. Plug it in, plug it in, you know what I'm saying? Smell the scent, the fresh new year scent called 2026.
3:48Jerod Santo
Some people are gonna say this is gonna be the best year ever. I've heard it said. What do you think, Gerhard?
3:53Gerhard Lazu
They keep saying that and I'm excited about them. They said that about 2020.
3:58Jerod Santo
2020, we have to admit it was off to a killer start. I mean, it was really going well.
4:03Gerhard Lazu
Right, pun intended, killer start. 2020. It was COVID, pun intended, killer start. That was 2020. 2020 was the year of COVID and everyone's like, ah, this is going to be like the best year ever. And then we had three years of misery. So I think, I think, I just want like an easygoing year. You know what I mean? Last year, 2025, 1st of January, we were building shelves. We were like redoing like studies and whatnot. And the whole year was full on. Like it was like, it was nonstop. Every week there was something significant happening. And this year would like just like to, for it to be a bit more chill, maybe a bit more meaningful. So that's what we're thinking. But how about you, Adam? How are your holidays?
4:50Jerod Santo
My holidays were filled with barbecue and good times.
4:54Gerhard Lazu
Wow, even in winter. So barbecue never stops. It does no seasons.
4:57Jerod Santo
Never stops in Texas. Actually, just to shower you all with a few of my picks from my most recent barbecue adventures. If you're in Zulip, go to the general channel, look for barbecue with three bangs after it because why do one bang when you can do three?
5:12Gerhard Lazu
Bang, bang, bang.
5:13Jerod Santo
Some recent ribs. My gosh, my ribs method is on point. My spatchcock chicken method is on point. No one is disappointed at my barbecue joint.
5:27Gerhard Lazu
Very nice. Look at that, we're going to add some meat on this slide. That's what happened in real time.
5:32Jerod Santo
Wow, real time meat added. This is like, yeah, this is intense.
5:37Gerhard Lazu
And that's again, just to be clear, it's Adam's barbecue. Okay, so like no joking aside, we're talking about barbecue.
5:45Jerod Santo
Well.
5:46Gerhard Lazu
I think we have to leave it there.
5:47Jerod Santo
Let's move on.
5:49Gerhard Lazu
I think we have to leave it there.
5:50Jerod Santo
I didn't show a burger, but I do make a mean burger too. Thank you, Gerhard, for assuming that is something I do rock really good. My smashed burgers are on point.
5:59Gerhard Lazu
Very nice, very nice. I'm looking forward to that. So one day. My favorite Christmas tree. This is what it looked like.
6:07Jerod Santo
What is that?
6:07Gerhard Lazu
And for those that are listening, it's a networking cabinet. There's lots of blue lights flashing. This is happening in the loft. You have many terabits of network throughput. There's some switches, there's UniFi, there's MicroTik. This is maybe five years in the works. And every Christmas, I take time to improve it little by little. So this year went really crazy. I read it like the whole thing. I read it like the whole, for example, DHCP network, VLAN. Oh man, it's beautiful.
6:43Jerod Santo
Your VLANs are beautiful?
6:45Gerhard Lazu
They are, they are.
6:46Jerod Santo
I wanna be a guest on your network, man. I'm gonna get blocked from everything, okay?
6:50Gerhard Lazu
Yeah, well, well, well, there's like a big story happening in the background. And it is going to be, I think this year amazing. This will be the best network that I have run, like in my life. But the blue and the darkness and it's like, that was like one more Christmas tree in our house and this was it, where I would just go and tinker for a few hours in between Christmas dinner and all the Christmas festivities. So it was nice just to spend a bit of time tinkering with hardware. And I'm sure that many of you listening, when it comes Christmas time, when things start quieting down, you get like the little projects that you didn't have time for throughout the year. And then you have some fun. So I'm wondering, did any of you do anything fun this Christmas, but nerdy fun?
7:40Jerod Santo
That's what I mean by that. Nerdy fun. Well, I got upset with something. That's not fun. And so I decided to just let it roll. You know what I'm trying to say? I got upset with the amount of RAM usage on my machine. And while I liked the application, I was like, you know what, I'm just kind of tired of having four gig. I think it was, no, it was like 1.2 gigs of RAM being used by CleanMyMac. Fancy little utility application, helps you tune and pay attention and stuff like that. And I decided to remake it and that was it. So I remade it, it's called MacTuner. I know there used to be a MacTuner.com, which was, I think, a Mac magazine, I believe, but MacTuner fit. I might change it, who knows? But for now it's called MacTuner. It does all the things, all the things. Analyze, clean up, uninstall. And not just that fake uninstall, the real one, where you get the dirty dirties out. You know what I'm saying? The dirties, all the dirties are out, okay.
8:41Gerhard Lazu
My mind is still on the dirty burger that you mentioned earlier.
8:47Jerod Santo
I mean, that's about as nerdy as I can get.
8:49Gerhard Lazu
I made a little utility that's from you for now. Soon to be open source though, soon to be. It will be soon. Yeah, I mean, why not, right? Share with the world.
8:57Jerod Santo
Well, I didn't create a MacTuner, but I found one. I also was thinking, clean my Mac, how long am I gonna run this thing? And the answer is, as long as I ran it, cause I'm done now. I found a tool called Mole, M-O-L-E, which is a command line Mac OS cleaner that does like everything. So you may get some competition here at it. Maybe you can come out and like throw some blows down. It's like, here's why I'm better than Mole. It's got TUI, it's all command line based. It does cleaning, optimizing, uninstalling, daisy disc, explorer, all from, yeah.
9:36Gerhard Lazu
I'm feeling intimidated over here, okay.
9:39Jerod Santo
You starting to sweat?
9:41Gerhard Lazu
Well. I think he just changed his mind about open sourcing it.
9:43Jerod Santo
Here's your domain name, Adam, betterthanmole.com, you know, better than grip.
9:50Gerhard Lazu
That's good, I could do that.
9:51Jerod Santo
So I've been using that, I'm very excited because who doesn't wanna just have all the things right there in their command line. And I didn't spend any tokens on it. Adam's got some tokens involved, but his also works the exact way he wants it to.
10:04Gerhard Lazu
Yeah, yeah, absolutely. My lever does some recast stuff as well, it's kind of cool.
10:10Jerod Santo
Sweet, open source that sucker.
10:12Gerhard Lazu
One day.
10:13Jerod Santo
Which day is that?
10:14Gerhard Lazu
Not today. Definitely not right now.
10:19Jerod Santo
But it's gonna be a one day is what it's gonna take.
10:21Gerhard Lazu
One day.
10:22Jerod Santo
There's a bigger launch awaiting, as I'll say. There's a bigger launch awaiting till I'm gonna open source some things.
10:28Gerhard Lazu
Been using AppCleaner for many, many years. There's no TUI, there's no CLI, it's just like a regular app. It's a really old one.
10:36Jerod Santo
Yeah, she's like drag and drop onto it, right?
10:38Gerhard Lazu
Pretty much, yeah, and you also have a list of applications. But it's so old, it is difficult to find it these days and it hasn't updated in a very long time. So I will check Mole out.
10:47Jerod Santo
Mole's really cool. Roo, install Mole and you're done. So you can check it out right here while we're talking. And I liked AppZapper. And I think AppZapper doesn't exist anymore. But the cool thing about that was that it would literally make the zap sound as it, yeah. You drop your app on and it zapped it. That's like that sound.
11:05Gerhard Lazu
That's the only feature that your application needs to have, Adam, if it zaps.
11:10Jerod Santo
Mole does not zap, so there you have it. Make it zap is our tagline, actually. Make it zap. Make it zap.
11:14Gerhard Lazu
There you go. I think that's a very good debate, actually, at everything.
11:18Jerod Santo
What about you, Gerard? Besides your Christmas tree, did you?
11:20Gerhard Lazu
I will come back to that. I will come back to the Christmas tree, yeah.
11:24Jerod Santo
This guy's got stories, man.
11:25Gerhard Lazu
Oh, man. Oh, yes, it's like, I have to tease them and be very disciplined because there's too much stuff. So I have to be very careful because it will be the hour and I will not shut up talking about this thing. I mean, it's just like, anyway. So we will come back to that, I promise. Okay. Last time, when we finished Kaizen 21, this was one of the last thoughts that we shared, which is what's next. So bam, remember bam? That was, that happened live. OOM crashes, out of memory crashes, and a bunch of other things. The good news is that only one thing happened. OOM crashes.
12:09Jerod Santo
Only one thing to talk about.
12:10Gerhard Lazu
But this rabbit hole is really, really deep.
12:13Jerod Santo
Okay. All right, take us down the rabbit hole. The OOM, out of memory.
12:18Gerhard Lazu
Who remembers this book? Erlang and Anger.
12:21Jerod Santo
Erlang and Anger.
12:22Gerhard Lazu
Stuff Goes Bad by Fred Hebert. Ferd.ca.
12:28Jerod Santo
Now I remember learning you some Erlang for great good, but I do not remember this one in particular. So I'm not sure why the other one hit my radar because he wrote both of them in the scenes. But when did this one come out?
12:42Gerhard Lazu
Wow. So this one, if I look, I just switched to the browser 2016, 2017, while he was still at Heroku. Remember Heroku? Those were the days. So about 10 years ago. And Fred, I mean, he just like, if you don't know his blog, I mean, it's just amazing. I'll just click it very quickly just to have a look. Oh, I think it's one of the best blogs out there. There's so much goodness here. So much. But one of my favorites is queues and queuing and how queues don't protect from overload. So queue, queues don't fix overload. And this is so relevant to today's conversation as well. But there's a lot of stuff in the Erlang ecosystem. And there's many, many things that Fred wrote over the years that are so relevant to today. So if I click on download PDF, right, by the way, this is like a, it's amazing. This book is open source. You can download it open source, freely available, Creative Commons license. And I'm going to make this a little bit bigger so we can see what's happening. And if I search for let it crash, it's page number one. It's in the introduction.
13:56Jerod Santo
Page one.
13:56Gerhard Lazu
Page one. And this idea of let it crash really comes from the Erlang ecosystem. It's very well renowned there because of how the Erlang VM works and how all the processes and the supervision trees, just, it was built this way. And we know a thing or two about Erlang, Jared, right? Because the application, Elixir, the Phoenix framework, runs on the same principle.
14:22Jerod Santo
I know a thing and you know two. So that's how we get to a thing or two.
14:25Gerhard Lazu
And Adam, I'm sure he knows the big one. But we don't know whether he's going to share it. The point is, the point is, when you think about let it crash, Jared, in your, like from your development experience with Erlang, with Elixir, Phoenix, is there any situation, any moment, where you could experience it and you realized, huh, that's nice.
14:45Jerod Santo
When I let it crash.
14:46Gerhard Lazu
When you let it crash.
14:48Jerod Santo
Well, it's nice that the Beam seems to handle a lot of the problems with letting it crash. You know, it just goes again or there's a supervision tree and things watching each other. And I don't have to think about it very much. I can't think of an instance in development where I was like, this is really useful. But I'm sure you could come up with one.
15:05Gerhard Lazu
Yeah. So you know when you write code, we tend to write code very defensively. Typically try catch, so you feel like you need to account for every single scenario. And the let it crash philosophy is about not preventing failure, learning from it. And what that means is you need to have a context where it's safe for things to crash. And the overall system will still remain stable. So how can you build a resilient system, and really this is about resiliency, where the core of the system will remain running and the system as a whole will remain running, even though parts of it may experience failures. But those failures will not bring everything down. And that's really important. So fewer try catch blocks. Don't code defensively, let it crash. And separate the code that solves the problem from the code that fixes the failures. And the more you can lean into the framework or the VM or whatever you have, the system, to deal with failures, the better off you are to focus on the things that are unique to your application. And Erlang is well-renowned for that.
16:14Jerod Santo
Kind of the opposite philosophy that Go took, as I write some Go code and I write some Elixir code, where with Go, it's like handle every error condition right after you potentially raise one and make sure that there's no error. And if you're not dealing with it, then you're not writing robust software. And the other philosophy is let it crash and deal with it elsewhere. I think they're both legitimate depending on what you're building.
16:40Gerhard Lazu
Agreed. Well, in our case, we had a lot of crashes to deal with.
16:44Jerod Santo
Yeah, we're dealing with Erlang style.
16:47Gerhard Lazu
So what we are going to have a look at is all the times that the pipe dream has been crashing since our last Kaizen. So since Kaizen 2021, which is October 17th, we had a lot of crashes. And there's a certain property about the system, and this is Varnish specifically, that made these crashes pretty okay. And the property which I'm referring to is when you start the Varnish D, the daemon, Varnish itself runs as a thread and you have many, many threads that do different things. So when we had these out of memory crashes, all that happened, the thread was killed, which means that the system as a whole didn't crash, the VM didn't, the Firecracker VM didn't crash, the application need to restart. It was just a thread that was using too much memory and it restarted within seconds, as in maybe two seconds and everything was back to normal. Obviously the cache was cold, but it was good. I mean, the VM, and that's why the memory looked a bit interesting in that it doesn't release all the memory, the VM doesn't restart, there's not many hangs, it restarts and it crashes really, really quickly. So that's a nice property.
18:01Jerod Santo
Well, that confuses me. So how does Fly know about it then, if it's just happening inside of Varnish?
18:06Gerhard Lazu
So it's looking at the process, it has looking at the process ID, which process uses the most memory, and it's the same process that's asking for more memory. So it just basically will just send the signal to that process and kill that process. But that is just a thread that maps to threads. So Varnish itself didn't crash, it's just a thread that maps to process ID that crashed, and then it was restarted by the Varnish daemon.
18:29Jerod Santo
Okay, so where is Fly involved in that? Because Fly's aware because I see all these Fly notices and I get the Fly emails.
18:35Gerhard Lazu
Right. So Fly's aware that there is a process on the machine that is using too much memory and more memory is being requested. And then it looks like, okay, which process do I kill? And in this case, the process with the most memory will get shot and will get killed.
18:52Jerod Santo
So Fly as a platform can actually reach it and kill that process without killing the machine, rebooting the VM or the Firecracker or whatever.
19:01Gerhard Lazu
So the Fly platform, it integrates with that functionality, which is a kernel, it's a Linux functionality. That's why an out-of-memory crash would happen, even if you have a single machine, you have too much memory, you don't have any swap, how do you basically give more memory when there's no memory left and when the system is becoming unstable? So then you get just a single process which gets killed. In Fly's case, they surface that. They surface the fact that there was an out-of-memory crash, there was an out-of-memory event, and they send you an email when that happens. It doesn't mean that the machine had to restart. It doesn't mean that he stopped serving traffic. It just means there was something that just had to go away because it was using too much memory. When I say too much memory, obviously it's a bit more complicated than that because something was asking for memory, the kernel didn't have any more memory to allocate, so it just had to look at what needs to be killed so that I can allocate more memory because something is using too much memory and it just so happens it would be this process and this threat. So how many crashes do you think that the PyDream had since Kaizen 2021, since October? So we're talking about three months, maybe a bit more than that?
20:16Jerod Santo
So Gerhard has presented us a multiple choice quiz. A is 20, B is 40, C is 80, D is 160. Now I know that I personally receive an email every time this happens. And so I have a little bit of a feeler into this. I delete them so I can't go do a quick search. Adam, do you get emails when these fly things crash?
20:35Gerhard Lazu
I don't.
20:36Jerod Santo
Okay, good for you.
20:37Gerhard Lazu
Not to my knowledge, and if I do, they're in a box that doesn't get looked at.
20:41Jerod Santo
You've been saving on some email bandwidth.
20:44Gerhard Lazu
But thank you, no, because when we send the email, so let's go back to this one. If I click on this one, let's take this one, and you can see everyone that gets an email, so I'm just going to make this a little bit bigger. So you can see services, Gerhard. I do get it. So there must be a filter.
21:00Jerod Santo
He just doesn't look at it.
21:01Gerhard Lazu
Superhuman saving me, nice. That's okay. So what do we think?
21:07Jerod Santo
Good thing other people are looking at it.
21:09Gerhard Lazu
It's not an Adam problem, that's the thing. So that's a good thing. He's doing the right thing. He's just saving his inbox for more important messages.
21:17Jerod Santo
They ran an LLM on that to the side. So I feel like 160 is too many. I don't think I've gotten 160 emails since October on this particular thread. 20 feels like not enough. I've certainly gotten more than 20 emails. So I'm between 40 and 80, and I'm going to think that, gosh, that's a tough one. I'm going to go with 40.
21:41Gerhard Lazu
Adam, what do you think? I'd go with 40 as well.
21:43Jerod Santo
Oh, I got it. Yes.
21:46Gerhard Lazu
43, exactly. The price is right.
21:48Jerod Santo
The price is right, all right, cool.
21:50Gerhard Lazu
Yeah.
21:52Jerod Santo
43 crashes from October to December, through the end of the year, roughly.
21:57Gerhard Lazu
Yeah, and then obviously there were periods when we had quite a few. So if we were to think about what could be happening in Varnish that is running out of memory and crashing. So this is us trying to think about the sort of traffic that we serve, trying to think about everything. I mean, now we see every single request that hits changelog, the CDN as well, and it's a lot of requests.
22:29Jerod Santo
Yeah.
22:30Gerhard Lazu
So there's something in the system, there was something in the system that was using way too much memory. And as a result, the process, or the thread in this case, was crashing.
22:43Jerod Santo
I mean, I could guess it, but I might even have some insight. So should I just say it or do I add on a guess? I mean, my guess, based on, also I saw some emails flying through, but already I would have suspected that we just have too many large files. These 60 to 80 to 100 megabyte MP3 files loading into memory, flying every which direction. And you just can't load up that much memory without some sort of fancy freeing mechanism. And it's just trying to hold all these MP3s and RAM, I think, and then I just can't do it. So that's my guess.
23:20Gerhard Lazu
Yeah, that was a good guess. And I think the next question is going to be to the audience, because we know too much.
23:26Jerod Santo
How are they going to answer it? It's not real time.
23:29Gerhard Lazu
Just think about it. Like, we will give some time for people to think.
23:32Jerod Santo
Okay, we'll do like a delay here. So if they have, what's it called? The feature where you pause silently, skip silences on? They're not going to have any time to think about this.
23:41Gerhard Lazu
Right, okay.
23:42Jerod Santo
So quickly turn that feature off. Give yourself some time to think. Go ahead.
23:46Gerhard Lazu
Yeah, or pause. We can also say pause. Now is a good time to pause. And then what could be the problem? So you're right. There are all those large files. We had all the MP3 files, many, many MP3 files. They're large, all trying to be cached in memory. And that was a problem. So what is many? Well, we have thousands at this point of MP3 files across all the podcasts, like since the beginning of time. Large, large means anywhere from 30 to 40 megabytes to 100 plus megabytes. So that's, I mean, just think if you had to load a thousand files that take a hundred megabytes, that's a lot of memory that you would need to have available. And the problem is that once you store these large files, as we discovered, you get memory fragmentation. In that, imagine that you have all the memory available, you keep storing all these files, and at some point there's no more memory left. So what do you do? Well, you need to see what can you evict from memory so that you can store the new file. So imagine that you evict a few of those objects, but maybe they aren't big enough and you haven't evicted them fast enough. So then you have like this big file that can't fit anywhere because the size is like the holes that you have in memory aren't big enough for this file to fit. And there's no defragmentation or nothing like that that runs in the background, which means that even though technically you kind of would have space in the memory, for the specific files you may not. And then it can't be stored in memory. Now, the thing in Varnish is actually called, I kid you not, n underscore L-R-U underscore nuked. So I think the connection to the nuke and to the book and to let it crash is right there. So L-R-U nuked basically, it's like a forced eviction. So it's an event where an object has to be evicted from the cache just to make room for a new one because the storage is full. So you can see how many times this has happened. And that's like an important metric that if we look at, we can see we had too many of these events, right? Like many objects were being nuked from memory to make room for new objects, but sometimes they wouldn't fit. So how badly did it nuke? Because we can measure this or we can look at this. And this is what that looks like from a memory perspective. So you can see that the instance was running about maybe four gigs of memory. And then we had a massive spike within minutes, like one or two minutes to 16 gigabytes. So that's a lot of data that had to be fit in memory. And you can already see where this is going. Scrapers and bots and LLMs. And we have so many things happening. And then you can see the memory, it went up. The thread was killed. The child was killed. Like the varnish once the memory came down again and then it went up again. So the graph that we see here, we can see the first spike, just like maybe a minute apart. The second spike, another crash, it took a little while for it to restore. We're talking maybe 10 seconds. And then we stabilized around 10 gigabytes. From a CPU perspective, we got like 100% CPU utilization when this happens. Like everything is full on, everything like, the instance is really struggling to allocate and deallocate and free up memory. And more importantly, we have a lot of traffic flowing through. So how much? 2.29 gigabyte, or gigabits specifically. 2.29 gigabits.
27:26Jerod Santo
Per second.
27:27Gerhard Lazu
Per second, exactly. And these happen so quickly, have like a huge rush of traffic coming in and then nothing.
27:39Jerod Santo
Well friends, I'm here again with a good friend of mine, Kyle Golbraith, co-founder and CEO of Depot.dev. Slow builds suck, Depot knows it. Kyle, tell me, how do you go about making builds faster? What's the secret?
27:55Gerhard Lazu
When it comes to optimizing build times to drive build times to zero, you really have to take a step back and think about the core components that make up a build. You have your CPUs, you have your networks, you have your disks. All of that comes into play when you're talking about reducing build time. And so some of the things that we do at Depot, we're always running on the latest generation for ARM CPUs and AMD CPUs from Amazon, those in general are anywhere between 30 and 40% faster than GitHub's own hosted runners. And then we do a lot of cache tricks, both for way back in the early days, when we first started Depot, we focused on container image builds, but now we're doing the same types of cache tricks inside of GitHub Actions, where we essentially multiplex uploads and downloads of GitHub Actions cache inside of our runners so that we're going directly to blob storage with as high of throughput as humanly possible. We do other things inside of a GitHub Actions runner, like we cordon off portions of memory to act as disk so that any kind of integration tests that you're doing inside of CI that's doing a lot of operations to disk, think like you're testing database migrations in CI. By using RAM disks instead inside of the runner, it's not going to a physical drive, it's going to memory, and that's orders of magnitude faster. The other part of build performance is the stuff that's not the tech side of it, it's the observability side of it. The other thing is you can't actually make a build faster if you don't know where it should be faster. And we look for patterns and commonalities across customers, and that's what drives our product roadmap. This is the next thing we'll start optimizing for.
29:30Jerod Santo
Okay, so when you build with Depot, you're getting this, you're getting the essential goodness of relentless pursuit of very, very fast builds, near zero speed builds. And that's cool. Kyle and his team are relentless on this pursuit. You should use them, depot.dev, free to start, check it out, one-liner change in your GitHub actions, depot.dev.
29:57Gerhard Lazu
So why is more traffic coming into the instance than going out? So this is the traffic that the instance is receiving. So we're receiving 2.29 gigabits, but we're only sending 145 megabits. Now is a good time to pause and think about why this is happening.
30:23Jerod Santo
Yeah, don't skip silence. So when we say the instance, we mean the Varnish instance.
30:26Gerhard Lazu
The Varnish instance, yeah.
30:27Jerod Santo
Which sits between our end user, whatever that is, our users, and our application. Well, actually, and our Cloudflare, not our application.
30:37Gerhard Lazu
All our backends, and we have a couple of backends.
30:38Jerod Santo
Yes, but in the case of MP3 files, it's our R2 origin. So Varnish is receiving a bunch of data and sending back significantly, in order of magnitude, less data. And what's it receiving? I don't know, man. I mean, my guess would be like we're uploading MP3s. Now that's gonna go straight through the app to R2. Just a DDoS? I mean, what is it?
31:03Gerhard Lazu
I don't know. Yeah, so it is a DDoS, but it's specifically downloading MP3 files or starting to download MP3 files, but never finishing.
31:11Jerod Santo
Hanging.
31:12Gerhard Lazu
Right, so you get like all these requests for MP3 files, for large files, Varnish is going and fetching them as quickly as it can. So pulling all this data in, so it has in memory, but the client is never around long enough. Exactly, so they basically abort, but Varnish is still pulling in all the data. Now, there is a property. It's called beresp.doStreamTrue. So what this does, very weird thing, it tells Varnish not to buffer the entire backend response if the client is slow, right? So I'm not going to fetch the entire MP3 file if you only want the first minute or two or a range or something like that. Now, this is on by default. So by default, that's how Varnish behaves. So we wouldn't need to enable this. But if the object is uncacheable, it cannot be stored in cache. You see where I'm going with this? Memory, you can't store it in memory. So you keep pulling these files over and over again and maybe even just fragments of them. So even though the client never receives them, you may be pulling hundreds of files and the client just goes away. So you're not pulling the entire file, but you're still pulling enough and not able to fit it anywhere and it just becomes a mess.
32:29Jerod Santo
This reminds me of the 90s when you used to go jean shopping, right? And you'd go into-
32:35Gerhard Lazu
Tell us.
32:36Jerod Santo
Do tell us. I'd be in fish, which I would never shop that,
32:38Gerhard Lazu
but let's just imagine I did, right? I'd go in there and be like, I like all these jeans, get them all. I'm trying them all on and I just bounced.
32:46Jerod Santo
Yeah, the person goes to collect them all. They come back and you're not there.
32:49Gerhard Lazu
Here's the dressing room full of jeans and Adam's gone. Bye-bye, see you.
32:53Jerod Santo
This really sounds like you're speaking from experience. Was this like a, was this a prank?
32:56Gerhard Lazu
I just made it up just now, just creative like that. On the fly, creativity. That's a good one. That's a good one. On the fly, yes. So-
33:06Jerod Santo
It is on the fly.
33:07Gerhard Lazu
There it is. Onthefly.io, boom.
33:12Jerod Santo
Well, what could we do then? What's going on here?
33:15Gerhard Lazu
Exactly, so this was one of the things which I had to deep dive and understand what on earth is going on. Like where do we store, like what's happening? So there's a lot, lot more that went into this pull request. It's pull request 44. I'm calling it the elephant in the room. I'm going to switch to the browser just to have a look at that. So the title of the pull request is Storing MP3 Files in the File Cache. But that's like the tip, right? Like the most obvious thing is, well, you either have lots and lots of memory to give Varnish, which honestly would be impractical in the sense that would be way too expensive to store all these files in memory. The next best thing is to have something like a file cache. And by the way, we're talking about open source Varnish. That's really important. Like anyone can use this. Anyone can configure this. You can configure a file cache, which will basically pre-allocate the file on disk. And that's where these large files will be stored. Pull request 44, the one that we're looking at, is in the pipeline repository. That's what this adds. But there's significantly more stuff. And if I'm going to, let me go, there's quite a few files. I highlighted a few, so I'm going to look at this one. So it's not just that. You also need to tune, for example, thread pools. You need to tune the minimum, the maximum. You need to tune the workspace backend, like how many memory structures get allocated. You need to configure the nuke limit. And there's a couple more things that we had to go through just to make things stable. Now, I just came to very quickly mention these things. You can go and have a look at pull requests to see what else went into it. So this was the one file. The other one was the regions. That's another thing. Not all regions would suffer from this. So you don't want to allocate too much memory or too much CPU to regions where maybe they don't get a lot of traffic. And you would think that this thing is easy, but oh man, I have a surprise for you. You can't mix and match sizes easily in FLY. So you can't say, create like application groups, and this group will be like the small group, and that group will be the big group, and this is just one application. It's not straightforward. So you have to, again, this is how I solved it. Maybe someone listening to this will tell me, hey Gerhard, you're wrong. I would love to know that, seriously. So the way I solved it is we deploy in all the regions, because you specify the size once. So you say, my starting size is the large instance type. It has a certain number of cores, certain number of memory, and by the way, the disk is the same in all of them because that's like another problem, so we will sidebar that or put a pin in that. So when it comes to the initial deployment, you deploy the one size across all the application instances, and then you go and need to check to see which instances should be scaled down so that you have the capacity, but the regions that don't need the capacity, you can just bring them down. And you do like a rolling deploy in that you replace one for one, you have plenty of capacity to handle the traffic while instances are being rolled, all that good stuff. But we have hot regions and we have cold regions, and there's quite a few things here. Again, if someone knows how to do this better, I would love to hear about that. And we have the TOML, we have the primary region, there's a couple of things here. We'll come back to services and HTTP services. That's a fun one. We'll leave that for a little bit later. Flyjust, we can see how we do the flyctl deploy. We disable HA because we want only one instance per region. We have 15 regions in total. We specify the CPUs, the memory, all that good stuff, including environment variables. Oh, that's another thing. We need to adjust the varnish size based on the memory the instance has, right? We need to say like, hey, varnish, you get 70%. And that's the other thing that this does. Same thing for the file size. You can't take up the entire disk. We tell you, based on the disk that we provision, how much space you should use from the disk that gets created. There's a scaling there, so that's another good one. I'm going through pull request. Is there anything else? Oh man, this was a pain. So recreating, like writing tests for this, everything is tested in the sense that which requests would go, or basically which files would get cached in the file store and which files would be cached in the memory store. So how do you write the tests? Some varnish logging is included. You have to have anchors. There's quite a few things. So that's assets, backend.vtc. And part of this, it was a huge refactoring. So if you look at the lines of code, I wouldn't say it's that many. 1,500 were added and 1,470 were deleted. So not much changed. I mean, the net is 30 new lines were added, but there was like a huge massive refactoring part of this. So there is, again, this was, I think, two, three days of like figuring it out, trying things, refactoring things. And if you think that an LLM can help you, well, you try this. And it takes longer to go through those iterations than if you know what you're looking for, it tends to be easier. Anyway, it's very dense, very specific, very difficult to make sure that it's doing the right thing. But it's all there. We have the mock backends, we're reusing things. We split the VCLs. By the way, you finish like the splits, it's easier to reuse them. So there's quite a few things there. Now, this is Kaizen. So we are wondering what improved. After all this work, right, we rolled it out, what improved? And to answer this question, we need to figure out which region is the busiest one. So out of all the regions that we serve, we have 15 in total, which ones get the most traffic is those hot regions. We're looking at the Fly, the Grafana dashboard for our Fly application, the instance of the pipe dream, the current one. And we can see that SJC, San Jose, California, is a red, nice big red circle, which means it has the most traffic. And also NRT, which is Tokyo, apparently.
39:48Jerod Santo
We're big in Japan.
39:50Gerhard Lazu
Yeah, and Europe, there's quite a few, see if I'm going to pull this down a little bit. Let's see, no, I wanted to go here.
39:55Jerod Santo
What about our new continent, are we big there?
39:58Gerhard Lazu
The new continent, Australia?
40:00Jerod Santo
Oh, there's a new one, there's a new, new one.
40:02Gerhard Lazu
Well, what's it called? Which is a new, new one.
40:04Jerod Santo
I don't know, there's a headline I heard. I thought you would get the joke. No. Over the holiday, there was speculation there was a new continent being announced.
40:11Gerhard Lazu
Narnia?
40:12Jerod Santo
Maybe, could have been Narnia.
40:15Gerhard Lazu
No, no, no, so.
40:16Jerod Santo
With the closet.
40:17Gerhard Lazu
Right now, even this list is basically, if you think about it, it kind of makes sense, right? It's U.S. East, U.S. West, Europe, but we have quite a few instances in Europe. We have four, it's more geographically spread in Europe, and we have Asia. So these are like the big ones. Australia, Africa, and South America, they're not as busy. They are like these busy regions, cool. So which instance would you like us to have a look at? So I have a queue right here.
40:49Jerod Santo
SJC, baby, let's go, let's go, baby.
40:51Gerhard Lazu
SJC, baby, all right, let's see that. So I'm running FlyCTL, SSH console. I'm using two flags, dash S, which is a short one for dash dash select. It'll prompt me which instance I want to select. And then I have dash C, capital C. It's different than lowercase C, they do different things. I give it the command to run. And it's varnishstat dash one, which will give me all the statistics from varnish at the point in time. So since this instance was running, I will select SJC. There you go. And it will give me all this data, which is like all the counters that varnish is incrementing, is keeping track of different things, of the origins, backends, the memory pool, the disk pool, the log counters, there's so much stuff. I'm really, really impressed how many things varnish has. So this is what we're going to do. We, because AI, right? We're going to copy all of this, and we're going to ask AI what it thinks of this, about that, okay? It's just too much data here, so let's be serious about it. So question to you, which is your favorite AI, Jared? Which one do you use?
42:03Jerod Santo
Oh, I don't like any of them. I would probably start with Claude, and then I would go to Grok, and then I would go to ChatGBT third.
42:11Gerhard Lazu
Okay, so Claude, which one, which version, which model?
42:16Jerod Santo
Opus, man, give us the Opus.
42:17Gerhard Lazu
Opus, okay, so we're looking at abacus.ai, something I've been using for a long, long time. It allows you, I'm only paying $10 per month for it, not sponsored, you know, not affiliated in any way. It's just something that I've picked for myself, and I can basically pick any model, and I can just run this. So I have something prepared, so I'm going to drop this. It's all the data, and we're going to read through something that I prepared ahead of time.
42:45Jerod Santo
You pre-prompted this?
42:47Gerhard Lazu
I pre-prompted this, exactly.
42:48Jerod Santo
Okay, engineering this prompt for weeks.
42:51Gerhard Lazu
Exactly, the prompt, not really, but. Oh, that's a long prompt. So we're going to read it, and in the meantime, Adam will think about his favorite LLM to try, and I have mine, so we'll try three LLMs to see what they say. So I'm going to read the prompt now while everybody thinks, no, you should be using whatever LLM you should be using. You are a Varnish 7 expert. You need to prepare four distinct responses and be explicit about the person that you're addressing. One, a seasoned sysadmin that has been living and breathing infrastructure for the last 20 years. Be precise, think deeply, and approach the setup from a hardware perspective. Two, an Elixir application developer that embraces Erlang's let it crash concept. You need to give it straight, give it fast, and keep it relevant to their application. It's the app and the nightly backends. Assets and fees are important, but less relevant. Cloudflare R2. Three, the business person that is selling this thing. They care about costs, efficiency, and simplicity. Keep it high level and relevant for someone that doesn't care about the tech, but cares about the outcomes. And four, the audience of a podcast where this is being discussed. Make it general, relatable, and fun. Make analogies, keep it light and engaging. I have fun too many times. We don't want to make it too fun.
44:13Jerod Santo
That's a lot of fun.
44:14Gerhard Lazu
So yeah, that's one too many funds.
44:16Jerod Santo
That's right, wow.
44:17Gerhard Lazu
Now that you understand your audience, please analyze the following VarnishNet output for the SJC. Look, I already knew that you would pick.
44:25Jerod Santo
How do you know I go for the big one?
44:26Gerhard Lazu
I have no idea. Focus on things that work well, things that could be improved, and anything else that you find interesting. And by the way, ignore the synthetic request. It will keep mentioning these. Like, I get so fed up with this. We have health checks that run every five seconds, so they are normal. So I'm going to copy this, I'm going to run this, and I'm also going to open a new window for Adam. So, which LLM should we pick, Adam? Which is your favorite?
44:55Jerod Santo
You mean model?
44:56Gerhard Lazu
Model, yeah. Which model?
44:57Jerod Santo
We just used it, but I'd probably back up to, like, Codex.
45:02Gerhard Lazu
Codex?
45:02Jerod Santo
Which is, like, GPT-5 latest. GPT-5-1. 5-1, 5-2.
45:07Gerhard Lazu
There you go. So GPT Codex. My favorite one is Gemini. So I'm going to drop it, and let's see how do they compare.
45:13Jerod Santo
Okay, Gemini, you're in a different tab now, so Abacus can't do Gemini?
45:17Gerhard Lazu
It might, but I have, like, my own pro account. So that's something else. Like, I use quite a VO, I use nano-banana, quite a few things. Transcripts, it's all, like, part of the package. So it can, but that's what I prefer. Cool, so. Claude Opus 4.5. For the season's assembly. This is you, this is you. This is me. This is me, exactly. Thank you for noticing, for like, who's who.
45:44Jerod Santo
I'm following.
45:44Gerhard Lazu
So, what's working well? Rock solid stability. So by the way, the instance has been running for 5.4 days. We had, like, all these improvements shipped, and we are able to observe how our busiest instance works, and that's what this is, basically. That was, the window moved. Cool. So, after 5.4 days, zero child panics crashes. Zero threat failures. This is important. It means no threats died. No threats had to be restarted. Everything is healthy on this instance. It didn't crash. So this instance didn't crash. Zero lock contention across all subsystems. Your CPU cache lines are happy. Excellent hit ratio, 93%. We like that. We really like that. We have back-end connection pooling with a two-to-one reuse ratio, and memory pressure is minimal. 132 LRUs in the last five days, LRE nukes. So very few objects had to be removed from memory. Thread pool property, 300 threads, zero queuing, zero drops. That's perfect. Areas to investigate. Disk storage allocator failures. We have disk C fails. We are hitting storage fragmentation. The disk is 97% full. We have 48 gigabytes used. That's how many MP3 files are stored. By the way, how many MP3 files total do you think we have?
47:09Jerod Santo
Size or file count?
47:10Gerhard Lazu
Size. Size.
47:12Jerod Santo
Well, if we had 1,000 episodes at 100 megs each, which neither of those things are true, that'd be 100 gigs, right? So 100 is too big, but 1,000 is too small. I'm gonna say 80 gigs.
47:30Gerhard Lazu
Adam, can you guess? That math checks out.
47:34Jerod Santo
I was gonna say like a terabyte, but that's probably raw WAV files versus not, but.
47:40Gerhard Lazu
All the files have to be stored in R2, and this includes all the assets, but we know that the MP3 files are the biggest. It's close to 250 gigabytes. We may have some duplicates. I don't know. I haven't checked, but that's how much files we have in R2.
47:58Jerod Santo
Yeah, well, we also have plus plus for the last couple of years, which means every episode has two files, not just one. So that makes sense.
48:05Gerhard Lazu
So we should go higher. Now we use this in every single region. So maybe we want to reduce number of regions,
48:14Jerod Santo
but I think- Even the third category called super hot.
48:17Gerhard Lazu
Super hot, yes.
48:17Jerod Santo
Which is like SJC in Tokyo, right?
48:20Gerhard Lazu
That's possible, yeah. There's four, which we know they're really, really hot. Yeah, yeah. But honestly, this is happening across multiple regions, and we'll get to some interesting things. So, okay. Synthetic responses, grace hits, all good. It's for the Elixir developer, and I think this is you, Jared. Do you wanna read it out?
48:37Jerod Santo
Oh, well, the TLDR is varnished to do its job. Your app backend is well-protected. You want me to read the whole thing?
48:43Gerhard Lazu
If you want. I mean, how it's shielded.
48:46Jerod Santo
It's 95% shielded. No failures, zero backend failures. That's because of, you know, my code doesn't really let it crash very often.
48:55Gerhard Lazu
Exactly. Your code is, yeah, it crashes internally, not externally. That's right.
49:00Jerod Santo
My thing is doing its thing. It is generating some uncacheable responses, but you know, we do have some that we just don't want to be cached. Ooh, one fetch failure, negligible. Yeah, I agree. You don't need to worry about that. And in the end, it says, whoever wrote this is really good at what they do.
49:17Gerhard Lazu
I agree. That's exactly what it says.
49:20Jerod Santo
And congratulations on such a great hire.
49:23Gerhard Lazu
Yeah, I agree. I agree. I think the hire needs a promotion and a bonus. There you go. All right, for the business person, the caching layer is performing excellently. Adam, do you recognize yourself, or shall I continue with this?
49:39Jerod Santo
You can read it.
49:40Gerhard Lazu
93% of requests never touch your servers. Massive cost savings on compute. Do you know how many requests per second the application is serving? Like maximum, by the way. What's the maximum RPS for this amazing Elixir Phoenix application for the homepage?
49:59Jerod Santo
Probably a lot. Gosh, thousands?
50:03Gerhard Lazu
Tens of thousands? Maximum. Okay, Jared?
50:06Jerod Santo
100,000?
50:07Gerhard Lazu
The database connection is involved. Concurrently, yes.
50:12Jerod Santo
I don't know, I'd say not very many. To our homepage? I'd be like 12. 12 requests a second?
50:18Gerhard Lazu
Yeah, 17.
50:20Jerod Santo
17, I'm right in there, baby.
50:22Gerhard Lazu
Some of those are code, so 17 requests per second. So if all these requests were hitting the application, it would need so much compute to serve that. So much caching, obviously we've removed all the caching. Now we're joking about this because we purposefully removed all the caching from the application. I remember that a couple of years back because we said this has no place in the application. The application gets restarted. We need to store this somewhere. We need to cluster. It was just really messy to handle it at that layer, which is why we introduced this. Five plus days running without any issues. By the way, this is like the last deploy. So maybe by the next heisen, if we do no more deploys, we'll be able to see how well it handles. Zero failures on the infrastructure side and three terabytes of data served to users. Three terabytes, so in five days, this one instance served three terabytes. Without your application servers breaking a sweat, storage is getting full. So we need basically more storage. For the podcast audience.
51:27Jerod Santo
Oh yeah, that's gonna be fun.
51:29Gerhard Lazu
Imagine a really good receptionist at a busy office. This Varnish server is like having someone at the front desk who remembers everything. Out of a hundred people who walk in asking questions, 93 of them get their answers immediately from the receptionist without ever bothering the experts in the back office. What's cool, it's been running for over five days straight without a coffee break or a single mistake. That sounds cruel to me, but let's go with it. It served three terabytes of data. That's like streaming about a thousand HD movies. This one instance streamed a thousand HD movies in five days. And the experts had only had to answer 7% of the questions. The one quirk, the filing cabinet is getting full. It's like when your receptionists desk drawers are stuffed and they occasionally have to throw away old notes to make room for new ones. Not a crisis, just time to get a bigger cabinet. Okay, I think the last of the fun stats of 300 workers, I think that's 2D. That's good fun there.
52:34Jerod Santo
Good job that one.
52:37Gerhard Lazu
Do we care about GPT or Gemini? We can only use one. We can only pick one.
52:42Jerod Santo
Gemini's getting some good hotness. Let's check Gemini. We'll see how it adds up. Oh, it's still thinking.
52:47Gerhard Lazu
Let's see. I think it's finished. Maybe that's, let me just close that. Did it finish? I think it did. All right, so let's go up. Slow thinking. I did like the thinking. Could have gone pro as well. Show thinking. Show thinking. Show, slow think.
53:05Jerod Santo
I thought I said slow thinking. I was like, come on, speed up, Gemini.
53:07Gerhard Lazu
Show thinking. There is quite a lot there. Anyway, we're not going to look into that. So the instance has been up for 5.3 days. The MGT uptime, I like it. It's telling me which of those, that long list of counters is important. From a system perspective, the threading model is perfectly dialed in. 300 threads across two pools with zero threads limited and zero thread queue length. The kernel isn't context switching unnecessarily and we aren't starting for execution units. However, we have a critical IO and storage allocation bottleneck in the SMF. The file, I'm not sure what Steve Dore means. That's a weird one.
53:46Jerod Santo
I'm not sure Gemini knows what Steve Dore means either.
53:49Gerhard Lazu
The problem is SMF disk C fail is at 19,000 plus against only request at a 73% failure rate for disk allocations. In the G space, we have only 145 megabytes while we have 48 gigabytes used. You're effectively out of contiguous blocks or hitting the ceiling of the pre-allocated file. That sounds right. Varnish is forced to fall back to SMA RAM or simply not cache large objects. While SMA has 2.1 gigabytes free, the disk is thrashing. The hardware, if this is spinning glass, your IO is likely spinning. It's not, it's NVME. So yeah, there's quite a few things there. Interesting, is low relative, suggesting the lock contention on the shared memory log is negligible. Nice, we have low lock contention. The bottleneck is strictly at the storage layer. We just need basically a bigger disk. Elixir developer, let it crash proponent. I like it. Your app backend is rock solid. Well done, Jared. Varnish hasn't seen a single backend fail or an unhealthy event in five days. The let it crash philosophy is working. If the app is crashing, it's recovering so fast, Varnish doesn't even notice. I don't think it is. I think the application is good. The good news is we're seeing an effective 75% hit rate. Now this is different, right? We had 93% there. So I would like to understand the discrepancy. That's another reason why you should use more than one, by the way. And then obviously double check the data because sometimes they give you things which aren't true. The crash concern, the disk cache is crashing. Again, this is not for the application developer. I think this is for the sys admin. In me, I think the straight talk. Your main synth is high. So we have health checks. There's a delta responses. So we have a lot of synthetic requests again, sorry, synthetic responses. This is again like a Varnish thing. The business person, efficiency we're currently serving or traffic from Varnish. I think I know what's happened. I don't think it's taking into account the synthetic requests. Those should be removed from the total number of requests.
56:01Jerod Santo
So you think Claude has the right number?
56:02Gerhard Lazu
I think so, yeah. Yeah, I think so. This means for customers, we have cost efficiency. That's good. The risk, there's the bottom line. I think this was the fun, but I think this is a library. I think we can stop it here.
56:19Jerod Santo
You say library analogy versus the secretary analogy.
56:21Gerhard Lazu
I think that was a better one. I got a barista one. I thought it was like a very good one.
56:25Jerod Santo
Oh yeah, for queuing or for what?
56:27Gerhard Lazu
For queuing, yeah. Like the barista analogy I thought was very good. This is using books and whatnot. The library hasn't burned out. That is fun. That is fun. So I think JBL is getting a bit funnier. The nightly feeds in the app are still humming along. Nice. So that's what we have. And that was only half the problem.
56:53Jerod Santo
Well friends, this episode is brought to you by Squarespace, the all-in-one platform for building your online presence. Whether that's a portfolio, a consulting business, or finally shipping that side project landing page, you've just been meaning to do, but never get to. Here's the thing. You mass produce code on the daily. You deploy new services, new infrastructure, new hardware, your versioning, your APIs, your severing all over the place. But when someone asks you about your own personal website, it's like, ah, I'm still working on it. Does that sound familiar? Squarespace exists so you don't have to treat your personal site like a weekend project that never ships. Pick a template and drag and drop your way to something that actually looks good and move on with your life. No wrestling with CSS. No, I'll just build my own static site generator again. It's just done. If you do consulting or freelance work on the side, Squarespace handles the whole entire workflow. Showcase your services, let clients book time directly on your calendar, send professional invoices and get paid online. It's the boring infrastructure that you don't want to build for yourself. And for those of you out there who are doing courses or gated content or educational stuff, tutorials, workshops, that intro to whatever series you keep talking about, you can set up a membership area with a paywall and start earning recurring revenue. Set your price, gate the content, and you're done. And they've also added Blueprint AI. This generates a custom site based on your industry, your goals, your style preferences. It's not gonna replace your design skills by any means, but it'll get you about 80% of the way there in about five minutes. And here's the call to action. This is what I want you to do. Go to squarespace.com slash changelog for a free trial. And when you're ready to launch, use our offer code changelog and save 10% off your first purchase of a website or a domain. Again, squarespace.com slash changelog.
58:48Gerhard Lazu
That was only half the problem. So we're like at the midpoint.
58:51Jerod Santo
I was feeling good. I feel like we had it all fixed. What else is the problem?
58:54Gerhard Lazu
Oh, wow. This is like when all the fun begins. So you remember this, Jared?
59:01Jerod Santo
MP3 requests intermittently hang in Newark, New Jersey. This was our good friend, John Spurlock, who's been on the show before and is a podcast nerd. In fact, he runs op3.dev and other podcast nerdery things. And so he really knows his stuff. And so when he reports issues, I don't say, did you try rebooting? I take it seriously. So I shared it with you and he actually did some additional digging for us.
59:30Gerhard Lazu
Go ahead. So in terms of you tested this, I think you had issues as well. So we've confirmed this for sure.
59:38Jerod Santo
I did. Like certain times, certain files, actually it would be all requests at certain times. I assume that it was that particular pop as we could call them or pipe in the pipe dream was hanging and then it would go away. And he actually had the same problem. He had a Friday night deploy of friends and he's trying to listen to it on Friday. He couldn't get to it by Saturday morning. He can get to it. So it's intermittent hanging, very difficult to diagnose, very difficult I assume to debug. And then it just comes back to normal. I thought it was maybe the out of memory thing. Like it's just in some sort of fugue state until it reboots and then it works again. But you go ahead.
60:18Gerhard Lazu
That's what I thought. That's why like a deep dive on this, this was November, end of November, beginning of November. So November I was just trying to figure out what on earth was going on. Just like from the sides, I didn't have too much time. But if you look at this response, there's quite a few things there. This is like my initial one, like an investigation trying to understand what's happening. Giving a couple of debug headers, like a couple of extra headers that the request can be made of, sorry, can be run with. So we just get a bit more details. Forcing regions as well. So there's quite a few things there. I was checking into that. This is Don McKinnon. He also had issues today. So he pasted some results. So thank you, thank you, Don, for adding this. This was helpful. So, I'm still scrolling, I'm still scrolling. There we go. Super helpful. I have confirmed, so I've confirmed that the requests have been hanging. You are getting the hangs this afternoon as well. This was only three weeks ago. So this has been going on for a while. I dug deeper and I found the problem. The problem was that in the fly config, we had the concurrency set to connections, not requests. So it's possible to configure an application. Again, you're configuring the fly proxy that sits in front of the application to limit how much traffic hits your application. So requests, how many requests per second should the fly proxy forward to your application before it stops, because you don't want to get overloaded. So before it starts throttling, it starts slowing clients down. And then that's when you start seeing fly edge errors. Connections, you would use for something that has long running connections, like a database, for example. In our case, it's not a database, it's an HTTP application. So requests would have been the right concurrency. I have no idea why I picked connections. It was the wrong one. But the effect was, as you can see here, we had 2,700 long running connections on that edge, so on that region. So in this case, it was, I think, orange one. Think EWR, right? So EWR was getting, had like all these connections opened. The clients were enclosing the connection. The proxy was full. No more connections could be forwarded to the application. Long running connections, they're usually clients which are not doing the right thing. Right, you shouldn't have that many long running connections. So the problem was a misconfiguration on our side, which meant that connections like slow connections, long running connections, were basically blocking other connections from coming through. So that was the problem there. And I thought that was it. But, but there was more. So there was a last comment last week. We now have a check that runs every hour. And what was interesting, and I'll talk about the check as well, we had response bodies timing out in two regions. So 13 regions were fine, but even after this configuration, there were two regions, IAD and EWR, where when we were using HTTP2, and for some reason this is important, when we're using HTTP2 and the proxy, the fly proxy would see this, it would not forward the connection correctly. As in, it would start, it would like serve the response, like we could see the headers coming back from our instances, what we wouldn't get is the body. So the body would always be like zero bytes served. And we could see this happening. We could see the connections, that by the way they were opened, they shouldn't have been opened because the application changed, so these connections should have been dropped. There was something not quite right. My suspicion is with a fly proxy layer, because when we were forcing HTTP1, everything was working fine. And by the way, the fly proxy, when it talks to our varnish instance, it's using HTTP1, and you can see that in the headers. So the proxy to the varnish was fine, but the client to the proxy was not fine. And HTTP 2.0 is a very complex protocol. There's so many things which just don't work the way people would expect. So anyway, the issue fixed itself. That's the important thing. So opening this-
64:57Jerod Santo
Not super satisfying.
64:58Gerhard Lazu
Yeah, that was very nice to see. And there was something, myelorus, how would you read this? My-
65:07Jerod Santo
Myelorus.
65:09Gerhard Lazu
Elorus, there you go. So there's someone on the fly community forum that was very helpful. They noticed that we had a misconfiguration in our fly tunnel. And we were using services as well as HTTP service. And this is bad, by the way, this is very, very bad. So everything was happy. Like we could push this config, the applications were running, everything was fine. But because we had these two things together, it was apparently creating some issues. And all we did, we were explicitly setting the idle timeout. And the idle timeout, that's the one where if after 60 seconds, the connection isn't doing anything, it will be forcefully terminated by the proxy. So that part was important. So anyway, we made the change, we pushed the change, but even before we pushed the change, the proxy started behaving. And now there's pull request 49 has like, we right sized it, we made a few changes. I captured like all the details, the configuration, the commands, it's all there if you want to read it. But most importantly, now we have a check that runs against all regions every hour on the hour. CI CD is using hurl. And what I'm thinking is, shall we try running that locally to see how it behaves? Because that's how we started it. Like I was starting to do it locally. So on the left-hand side, I'm back in the terminal. On the left-hand side, I am monitoring my internet connection. Remember that Christmas tree? This is related to that Christmas tree. So I'm at the top of the Christmas tree. I'm at the gateway, the core router. It's a Microtech CCR 2004. Pretty good, 10 gigabits per second maximum. Now my internet connection isn't 10 gigabits, but it's 2.5, which is plenty for this test. So we, every second is showing me how many packets and how many bits we're receiving and transmitting. Okay, and again, we are recording, everything's happening live. So you can see jumping right as Riverside, we're pushing more data to Riverside. Cool, so I'm going to run now, just check and just check by default. It's one of the commands, the just command that we have in the pipely repository and check all it does. It runs hurl with a couple of flags. It downloads an MP3 file. It downloads feeds. It basically connects to all the different backends and it sees how quickly it can get data back. We're transferring about those quick, those eight seconds. I'm going to run it again as I run this, pay attention to the left-hand side. It will go to 120 megabits per second. So that's that MP3 file being downloaded. So every single time this runs, a full MP3 file gets downloaded alongside a few other things. Okay, I can open the reports. We're not going to look into that because we're going to run something more interesting now. We do check all and what check all does, it runs the same command against all the regions. I'm at 2.3 gigabits per second. We're downloading all the files. We can see the response is coming back. EWR just sped by, I had sped by, so all the different endpoints are returning. Now I'm based in London, obviously the further away you are. So for example, this was South America, those LAX. So a couple of instances are slower to respond. And all this happens via headers. When you connect to Fly, you can tell it, hey, I want to connect to a specific region. And then that's what routes the request to that region. That's cool. And again, it's all captured in that pull request and you can see what it looks like. The check all one, Johannesburg, that's usually slow. And the slowest one is Tokyo for me. Sydney as well can be slow. So we still haven't received the responses from there. We should get that shortly. You can see I'm pulling now 50 megabits, 20 megabits. It's just slowing down. And it's just the connections between now and there. The last one where there we go is Tokyo. In 60 seconds, I pulled about two gigs, roughly. It's a lot of data that gets pulled down. The feeds between all of that. And anyone can run it. I would recommend you not to run this because we have to pay for this bandwidth, but our CI runs it just to make sure that everything works. And if we look at every hour, I think I'm going to tune this down. You can see there were no more connections hanging. So we got to the bottom of that as well.
69:40Jerod Santo
If it ever comes back, because it went away on its own. If it comes back on its own, we're going to know about it.
69:45Gerhard Lazu
Exactly. Now we have a system that is able to inform us when there's a problem. So let's go to three. We're on page number three. This one, for example, took more than five minutes, right? So sometimes when the connectivity is a bit slow, some regions can be slow, that's when you get these timeouts. So this is capped at five minutes. The last one that failed was a while ago. So you can see we're January 5th. There we go. There's one that failed January 4th. Check all instances. So let's see run, and we'll see exactly which region failed. Execution, NRT, that's Tokyo. And as you can see, we have a hundred seconds, right? So if after a hundred seconds, it doesn't download, it just times out. And we were pulling data, but it didn't finish downloading the entire MP3. And we're downloading a hundred and something megabytes.
70:34Jerod Santo
Very cool. So, I mean, not cool that it didn't finish, but cool that that was a while ago. And we can actually test this. Now, do we need to be doing such a large file? Is that part of the test? Or could we test a smaller file and still get the same results?
70:52Gerhard Lazu
We could, yes. This was the file that was reported. So we need to find an MP3 file. Absolutely, I think we can also reduce the frequency. We don't have to run it every hour. This was always in preparation for this conversation.
71:05Jerod Santo
What about episode four, five, six?
71:10Gerhard Lazu
That's coming up. That's the deepest rabbit hole. So I'm leaving that for last. That's coming Adam.
71:18Jerod Santo
One thing I suggested though in our Zulu, but I don't think this is, I didn't check to see if this is even a thing, but to validate, you know, if the fly CLI could validate the TOML file for you. Cause you could have been, you could have checked the TOML file for syntax errors or just do's and don'ts essentially. And it didn't.
71:39Gerhard Lazu
It does have a validation subcommand. Syntactical is correct. The config is valid. I mean, it was applied, but because it combines two things, it shouldn't. So at least I would expect a warning, like, hey, you're using both HTTP services.
71:53Jerod Santo
Yeah, validate syntax and validate, you know, expected, you know, true TOML file config. You know, don't combine or conflate two values or overwrite one or, you know, just that kind of thing. That's how I would defensively do something like that in a CLI to protect my user from a poor config. They could have just not been holding it wrong for so long.
72:16Gerhard Lazu
Yep. I agree. So it's the impact of that configuration indeed. Yeah. So this is something we can see again, the same logs. We can see this one here would go like to 15 megabytes per second. That's 500 megabits. When you have these peaks, when you see this in the fly config, we can see this one usually when the benchmarks run or like when the checks run because they put significant pressure on the instances and we can see them and we can pick them up straight away. So that's what this is. All right. So remember this guy? This guy was saying March 29th. So it's almost two years ago when this guy was saying, we will run into all sorts of issues that we end up sinking all kinds of time into. So this guy had a good hunch. This is Jared, March 29th. And we just went through a couple of examples of issues that we had to deal with part of this. But because of this, we understand the traffic and we understand how the application behaves and the backends behave at a very deep level. So you're right, Jared. We did sunk all sorts of, how many lines? Let's see how many lines do we have now? So how many lines?
73:38Jerod Santo
You wanted 20 lines.
73:39Gerhard Lazu
590 lines. 590 lines we have in total varnish config. It's more than 20 lines. By the way, we have like the roadmap to 2.0. This is 1.0 that we tagged and shipped. It solved like a lot of issues. But that was the easy stuff. Okay, so for everyone that stuck with us, something really good is coming up. And Adam was already mentioning it, episode four, five, six. There's something special about episode four, five, six. So what is special about it? What stands out to you, Jared?
74:11Jerod Santo
Oh, it's just getting rocked with downloads.
74:13Gerhard Lazu
So episode four, five, six, OAuth, it's complicated. By the way, this was recorded in 2021. It was published again, August, 2021. For some reason, it's been downloaded a lot in recent months. It has over 1 million downloads. This is the most popular episode on the changelog ever.
74:38Jerod Santo
The most downloaded episode.
74:39Gerhard Lazu
It's crazy. It's crazy.
74:42Jerod Santo
Oh, so you guys looked into this.
74:43Gerhard Lazu
We did, yes. We dug into this.
74:45Jerod Santo
Okay, I didn't know you guys were doing this.
74:47Gerhard Lazu
So we just had a quick look to understand what is happening here. So we have Honeycomb open up. Remember every single request which comes through the pipe stream, through pipe.ly. Every single request we send to Honeycomb. We're able to look at it. This is the last 60 days. And I have filtering done in such a way so that I'm only looking at this one file. How many times has this file been downloaded in the last two months? And you can see the peaks, right? You can see, and by the way, this is gigabytes. So, and this is, the period is four hours. So we are peaking at about, well, actually it was like this peak was here. We had 200, almost 300, 300, 400. Anyway, close to 400 gigabytes in a four hour period.
75:34Jerod Santo
That's just too much.
75:36Gerhard Lazu
Or I think so. It's just too much. I know this is a great episode, great conversation.
75:41Jerod Santo
But who- I remember that conversation. It was good.
75:43Gerhard Lazu
Like who is downloading this file 400, I don't know, times or actually more than 400 times every four hours consistently for months on end.
75:53Jerod Santo
And- Super fan.
75:54Gerhard Lazu
Super fan. So we can see like the different regions. Now this is spread across the entire world. It's not just one region. This is really, really big. I think if there was a DDoS attack, I think this would cause this one. And like in the last six months, in sorry, in the last two months, 60 days, we served 30 terabytes in San Jose, California alone. In Tokyo, we served five, 15 terabytes. This is a big number. And if you look in this column, the distinct IPs, the client APs, we had over 10,000 IPs downloading this file. So this is not one or two IPs. This is thousands and thousands of IPs which keep downloading this file over and over and over again. So I don't know how we would block 10,000 IPs. Right. That would be, the VCL would be crazy.
76:52Jerod Santo
Well, that episode was starring Aaron Parecki, who is a very talented person. And he is the co-founder of Indie Web Camp and a big fan of the Indie Web, as well as OAuth, obviously. So my hunch is Aaron's very interested in being the most downloaded episode ever. And he controls a fleet of machines from all around the world. And he points them wherever he wishes. And he thinks, you know what I'm gonna do? I'm gonna get to the number one spot on these guys' download charts. And so I'm thinking Aaron Parecki is, you know, the man with the mask on, we pull a mask off. That's him this whole time. What do you think, Gerhard?
77:29Gerhard Lazu
I think that we need to speak, see, I don't wanna say the specific language. I think we need to go to Asia. I think we need to visit a couple of cities in Asia.
77:40Jerod Santo
Okay.
77:41Gerhard Lazu
Find the IPs which are responsible for this because this is a crazy amount of traffic. Asia, it just so happens if we look at, so Asia is basically the continent where we are getting the most downloads from because of this one episode. And this is actually traffic being served. This is not like head requests or get requests. These are bytes being sent to thousands and thousands of machines in Asia every single hour. So whoever's doing this, please stop.
78:11Jerod Santo
Please stop. It's a cycle. So we need to like knock on doors. We need to go over there and knock on some doors. And say, excuse me, is this IP address at this home? And then they might say yes and say, would you please stop? What's going on over here? What could they possibly benefit from this? Like what could they be getting?
78:30Gerhard Lazu
Maybe, maybe we're the speed test. Someone is using us to speed test their connection. Who knows?
78:38Jerod Santo
Yeah, maybe.
78:39Gerhard Lazu
That's the only thing I can imagine.
78:41Jerod Santo
But that's a lot of IP addresses.
78:42Gerhard Lazu
It is.
78:43Jerod Santo
And it's across multiple regions. Which?
78:46Gerhard Lazu
Multiple data centers, yes. So multiple regions, fly regions are serving these IPs, yes. They're all coming from Asia, by the way. Again, I don't want to mention any names, because there's no bad guys here. We just want to assume that someone left the oven on.
79:05Jerod Santo
I don't know, man. It's like the blinker on when you're driving. It's like, hey, you're not turning. It's time to turn that blinker off.
79:13Gerhard Lazu
So the way I can see us mitigating this, and this is a hard problem because of the number right of IPs which are hitting us, we can basically start blocking entire net blocks, entire network blocks. Unfortunately, some genuine listeners might be caught in this. And basically, a changelog will not be available, or at least the MP3s will not be available to a portion of users. The other one is, obviously, and we should, this is like the next problem, we should enable some throttling because there's more stuff happening here. So we don't have any sort of throttling. We assume fairness, we're assuming goodwill, we're assuming decency, and we're not seeing that here.
79:54Jerod Santo
That's the internet.
79:55Gerhard Lazu
So to be honest, like whoever's doing this, and it's not LLM, so look, we have that problem as well, but in this case, it's not LLM. This is something completely different. So my hope is by someone that listens to this episode, maybe we put this in the intro, whoever's downloading episode four, five, six, please stop, because we'll need to take the next step. I know it's a bit of a cat and a mouse game, but that's what will need to happen, because we need to pay for this bandwidth.
80:21Jerod Santo
This is only varnished, right? This is only the cache layer this is happening.
80:26Gerhard Lazu
This is only the cache layer, yes, yep.
80:28Jerod Santo
And so what mechanisms are in varnished to do throttling or rate limiting or just anything like that whatsoever?
80:34Gerhard Lazu
There's VMODs, which are basically modules that varnish loads that just give it extra functionality. One such VMODE, and I've looked at this, it is free and open source, is the VMOD throttle. Now that means that we need to start keeping track of IPs, and it will use a bit more memory. That's okay, we have more memory, and then we can need to start basically applying limits to how many downloads specific IPs can do. And we can limit it to MP3 files only. So if we have a bot or if we have, for example, like an RSS aggregator or something like that, we're okay serving those requests, because again, that's what varnish is meant to do. The problem here is that we're serving a lot of bytes for MP3s, the same MP3 that cannot be real traffic.
81:20Jerod Santo
Yeah, I mean, even in this case, you can tie it potentially just to this MP3, like you just said, which is not an all MP3 scenario. If you request this MP3 with this kind of request signature of X per whatever, I mean, I didn't examine the actual signature of the request, but that's how I'd probably investigate it, is begin to isolate. Does that require us to write a lot of defensive code against that kind of scenario?
81:47Gerhard Lazu
I don't think so, it's just a system configuration. We just need to add more configuration. And back to Jared's point, we're just chasing now new problems that we didn't even think we would have. But we have what looks to me like an actor that's not very, I want to say this in a nice way, an unfriendly actor that is not very happy. And they are very angrily downloading our MP3 over and over and over again, thousands of times across thousands of IPs. And this is not cool because ultimately we end up paying for this bandwidth that is not helping anyone. But that's one, it's not the only one. So we have one more, so you can see here, for example, this is the last seven days. We have seven terabytes that were transferred in the last seven days. Seven terabytes? Maybe that's more than that. It needs to be more, and did I actually, geo code does not exist. I was expecting to see more than that. Anyway, Asia is the one that we can see that pattern. But we also have in Europe, sometimes we have these spikes. And it's like this spike, which I wanted to focus on. We know that someone in Frankfurt that connects to Frankfurt downloaded the static favicon 170,000 times in the span of, I don't know, like an hour or two. So they downloaded this two, three hours. So it gets requests like this that are putting stress on these instances. I mean, what potentially, that was like a pass request as well, which means it went through the cache, which means that they must have had like a cookie set up or something like that that basically was preventing the cache from working in this case, which again, that's how it's supposed to work. So anyway, I think that was, unfortunately, not the best thing that we could have ended on, but it's a thing, and it's something, food for thought, like more work to be done. There's many things that we didn't get to talk to, we didn't have time for. For example, we didn't talk about the nightly. By the way, nightly now is being served by the pipe stream as well. And the reason why we had to do this, because that sometimes would get scraped, would get hit really heavily. It's a very small app, it's Nginx. But if I open it, so let's just click on that one, and that's pull request 46. Before, it was basically topping up at 141 requests per second. Now it's 1,300, so it's almost like a 10x, in order of magnitude faster. The latency went way, way down. So, and it's just, the only thing we had to do is basically put varnish in front of it.
84:28Jerod Santo
Nice, well, that was nice.
84:30Gerhard Lazu
Yeah, that's one more thing there. And you can go and have a look how it works. It'd be like a benchmark here, a small benchmark here. That's it. We have, we have last one for the road. But before we do that, anything else we want to talk about before I share one last thought?
84:47Jerod Santo
I suppose what do we do? You know, if we know these demos are happening, we're here on the podcast just politely asking to stop. We just let it keep happening.
84:55Gerhard Lazu
Well, we could set up some sort of throttling. I think it would be the easiest thing. Now it will impact everyone. I don't want to start blocking, again, IP ranges, net blocks, because we don't know who's going to be called there. They may change to other IP blocks. So that's entirely possible. We don't know how this will work. We can't block an entire country, an entire continent, especially if it's a big one. I don't think that's reasonable. So really throttling is, I think, the fairest thing. And then we can throttle MP3 specifically. Because we do have, for example, I see them, like for example, we have a Python client and a Go client that every week they come and they download all our MP3s. I don't know why they do that, but every seven days, they basically request every single MP3 that we have. So they're like scraping the website and then pulling everything down. I don't know why.
85:48Jerod Santo
Yeah.
85:49Gerhard Lazu
Again, the closer, like the more I was looking at, and again, because I was working so deep in this, I started noticing like these behaviors that you would normally not see. So it's one of the advantages, I suppose, to working so close. With the traffic, with all the requests, and having this level of understanding and visibility into every single request. So it really helps. Down to the IP level.
86:13Jerod Santo
Something like that, the Go client and the Python client, where would you, would that be a Honeycomb thing? Where would that be?
86:20Gerhard Lazu
Yeah, it's Honeycomb, yeah. You can filter by user agent, for example. And you can see that there'll be, for example, say, no, I don't want to show any IPs or anything like that. So that's why I'm not going to screen share that. But once we start digging into that, you can say group by client, by agent, and you can say filter by MP3s. So like URL contains MP3. And that will be able to group. And you can say, oh, and by the way, only show me where there's more than, for example, 100 downloads. And then you'll start seeing like the outliers, which are the clients that are downloading certain MP3s or MP3s in general, excessively. Now, that can be spoofed. That's the other thing. Like we have, for example, the request agent, like the user agent, it's empty. It's an empty string. That also happens, right? Because you don't have to send the header if you don't want to.
87:16Jerod Santo
Yeah, you can also send whatever you want to.
87:18Gerhard Lazu
So that can be spoofed.
87:19Jerod Santo
Yeah. It's like whenever you build systems like this, and then even when you observe them, I guess you don't expect, that's what I originally thought, but you kind of hope that clients, AKA people, behave. You know, they're going to use the system for the system's purpose, not to once a week download and scrape the entire thing. And I mean, in that case, somebody could have their own web archiver and they could have altruistic reasons for it. I think that's kind of silly, but you know, once per week, download the entire contents and somebody's disk seems like I want your thing. I want to keep getting your thing. And if it ever changes, I want to make sure I have that snapshot. I don't understand it. It doesn't make any sense. Like what would make anybody do that? What is the purpose and motivation to keep doing that? To even commit the compute or the script or the time to do that? Like what are they getting from it? I don't know.
88:20Gerhard Lazu
We need to go over there and knock on some doors, man. I'm going to ask them. Yeah. Why are you doing this?
88:26Jerod Santo
Every door in Asia, do you listen to the changelog? Yes, I do. How many times?
88:33Gerhard Lazu
Yeah. Tell us about four, five, six. You know what four, five, six means, don't you? Yeah, it's just, see, this is how, so this is, I think, a really delicate and a really important point to discuss because this is how good systems become bad systems.
88:51Jerod Santo
It's true, yeah. You have to treat everybody bad.
88:54Gerhard Lazu
Exactly. Like we don't want to be doing this, but we are forced to do something against something which isn't good. So it's not benefiting anyone and we have to step in and do something about it. Now we have to do it. It's been like, I was expecting this to stop, but it's still even to this day. We made Varnish. I mean, now that it's stable, it's able to serve more traffic. It's able to, like, we just had like the biggest spike because now the system is more stable, but it means that bad actors, again, I shouldn't be using that, unhappy people, unhappy clients,
89:28Jerod Santo
unhappy clients, yeah. The only person you can offend is the one who's doing this and I'm fine with it.
89:33Gerhard Lazu
Yeah.
89:34Jerod Santo
They need to knock it off. Here's a, this might be a cudgel, but if we're trying to solve the problem of they're taking our bandwidth for something that's no longer relevant or interesting and it's been out there for years, what if we could just toggle certain episodes, and this might be a cat and mouse game as well, but like at a certain point, it's like, well, just give them the R2 URL and not the CDN URL and just let Cloudflare deal with it. You know, like just let them download directly from Cloudflare and we're just out of the equation then. We don't care about the stats. We don't care about anything. We're just like, you know, we've served this file plenty of times through our CDN. Now we're going to just let R2 serve it. What do you think about that idea?
90:19Gerhard Lazu
I can see for this specific episode being a very simple fix, right? Because we can just serve basically a location header and we just do a redirect and that's it. We're done with it. So it'd be like another synthetic response.
90:30Jerod Santo
We could do that. The question is if they're actually malicious, then they switch to a new episode and start doing that one.
90:34Gerhard Lazu
Exactly, exactly. And we have other clients which are, for example, we've seen that pass, where they're basically busting the cash and purposefully going to R2 directly and just varnishes like a, almost like acts like a proxy in this case. So we have that as well. We have, every now and then we have like this, this random client that comes and downloads all the episodes and that's not the problem. So I think that some sort of a throttle would make sense, which would keep the system fair to everybody, but the throttle will need to be high enough so that it doesn't impact anyone else. Now, if, for example, our requests, or like if our audience grows or we become more popular and we get more requests, obviously we would need to be aware of where the limit is and start increasing the limits, right? Once we are throttling too much, maybe. But that seems to be more like long-term and it seems a more, I don't know, like a well-engineered approach in a way. But certainly the simplest thing would be just like, take this one URL. I mean, that could be done in minutes, roll it out. And then we would stop this abuse for this specific MP3. That would be the easiest thing for sure. So yeah, I can see how pragmatic that approach is and I like the pragmatism.
91:52Jerod Santo
Well, it's at least worth checking to see if, you know, the mouse is still alive over there, you know? And if they are, then we'll know that this is a cat and mouse game. But if it's just like somebody left the blinker on, we're just gonna turn their blinker off for them and see if it just, the problem goes away. And if it changes to a new MP3, then yeah, we need more generic solutions. We may not need that at all.
92:18Gerhard Lazu
I do have to say that the internet these days is very different from the internet even like a year ago. With the rise of LLMs and AIs, I'm starting to see patterns in our traffic, which are unlike any other time. We have these very big spikes when a lot of data is being requested in very short periods of time from, I mean, the user agents, they don't make much sense. I mean, I know they're spoofed. There are many IPs which are being used. So it's almost like there's like a, I don't know, some system which wants a lot of our content is doing silly things because some requests, they just don't make sense. Like for example, what benefit does static favicon have? Like what's up with that? That just makes no sense.
93:04Jerod Santo
It's a small file, maybe it's a heartbeat or a version of a heartbeat.
93:08Gerhard Lazu
Maybe, but this is the first time I've seen this specific file being downloaded this many times, I haven't seen this before, which makes me think, is this a trend that we'll start seeing more and more requests that don't make sense? And then you start having to set up like some form of protection for all sorts of clients that are just doing the wrong thing.
93:26Jerod Santo
You need like a defensive layer by default.
93:28Gerhard Lazu
Exactly, exactly, yeah. And something that would be fair to regular clients, like for example, when I want to do like a benchmark, I mean, sure it's me and I wouldn't want other people to do that because I'm testing the system, making sure like the real world, the production system everywhere in the world is working correctly and I'm aware of what that means and how it costs and by the way, my IPs are removed from all like the stats because otherwise you see like those massive benchmarks. So we account for that, but we can't account for all these like weird clients. It's a challenge, I think it's a good one, but it just sets us up to, you know, when you become older, it feels like this is like an adult problem, right? So we like got the thing barely working, we got it out there, we know we made it stable, reliable, all that, now we're hitting almost like feels like a new layer of problems and then this to me is like a hint as to like the next phase.
94:29Jerod Santo
Oh, to be a kid again. Yeah, well, one positive thing I think is the robustness of our observability, like being able to have this visibility is great because otherwise we're like, you know, wow, pat ourselves on the back, Aaron Perecki, let's get you back on the pod because man, you are big all over the world. That's amazing. They can make downloads, you know? So what's your one last thing for the road ahead? I agree.
94:56Gerhard Lazu
What's my one last thing? So we'll keep it short, we'll keep it fun. I mentioned about the Christmas tree, I mentioned about the various things which I had going over the holidays. So make it a work club, that's the place, you're there, both of you are there, so you can join whenever you want. Next Thursday, yeah, next Thursday, I'm going to talk about the 100 gigabit WAN, the 100 gigabit WAN. So why would I need such a thing?
95:29Jerod Santo
Smoking.
95:29Gerhard Lazu
It's smoking, for sure. So I thought the CCR 2004 has like four CPUs, has like multiple 10 gigabit SFP plus ports, it even has 24 SFP 20, sorry, SFP 28 ports, but it doesn't have a switch chip. And people that know a little bit about hardware, you want to switch chip to the hardware offloading L3 and even L4. So after I bought the CCR 2004, it was like almost like a Christmas present, I thought, surely this will be enough for the rest of my life. And no, I had to get the flagship. So I'll be talking about that, the LAN, the setup, quite a few things coming up. And it just goes to show how much I enjoy the hardware side of things as well, the networking side of things. Like I shaved two milliseconds off my WAN, it's amazing. Like little things like that, it was already good, it was already like sub five milliseconds, but I wanted like sub three milliseconds. It is now 2.4 milliseconds. So the next, and what it means, like why would I do this? So first of all, I'm all about improving and every winter I improve the network. In this specific instance, I wanted the pages just to be snappier, like things to load a lot quicker, to handle a bit more traffic, but also to not have an impact. Like I was running that benchmark, like 2.5, look, I'm going to do another one right now, let's see. I have a speed test, I have speed test right here, speed test, London, let's go for this one. So we're recording, we're streaming, right? And I'm just pulling 2.5, 2.6 gigabits down and there's no interruption on my network, right? It's just my bread and butter, that's how I work. And by the way, if you see any buffering or any slowing down, let me know. I see Adam a bit more pixelated, maybe you can see he pixelated too, I don't know. But yeah, I just pulled six gigabytes, three down, three up, and it's just what I do every day. I work with this stuff and yeah, I enjoy it. And by the way, this is the slower gateway router. I'm getting the proper one set up and I'll talk about that. And there's so many things there, like VLANing is quite a thing. I have a new IPv4 block, by the way. So some would say that I'm preparing for hosting something and maybe I am, I don't know, we'll see how that works. But I just realized that my home connection, obviously I couldn't serve all the MP3s or be downloaded, like that would really cripple my connection if that was happening. But I'm at 2.5 gigabits, the next one will be five gigabits and the hardware can do it. And the five gigabits, I mean, that's like a decent server. And if you can do five gigabits all day, every day, sorry, yeah, gigabits, yeah, gigabits per second. That's pretty decent. So I'm just waiting for more internet.
98:37Jerod Santo
I was gonna say you're gonna have a hundred gigabit WAN, but you're not gonna have a connection for it, right?
98:42Gerhard Lazu
Right, so very few places in the world have that. So if I was in Switzerland, I would get 25 gigabits.
98:49Jerod Santo
Now, would you move? Would you move for this?
98:51Gerhard Lazu
Of course. The only reason, the only reason to move is, yeah, it's the 25 gigabit connection. But I know that a hundred gig is coming, so we'll see. I either ship it by the time I move or I come, I move, and then they ship it. So it's one or the other. The important thing is I have the router to handle that. You'll be ready. You will be ready. Exactly. So I'm a prepper. I'm prepping for that.
99:18Jerod Santo
Prepping for a good internet.
99:20Gerhard Lazu
And this is just like, and interestingly, five years ago when I got like the previous router, I did the same thing. It's like a forum post, it's like a follow-up. So I just did a follow-up recently at this milestone. So I've been at this for some number of years and I like optimizing my network and making sure that it's in tip-top condition.
99:39Jerod Santo
Relentless. I love it. So relentless. Good stuff, Gerhard. Well, that's a happy note to end on, right? That's a happy note to end on. observability in a hundred gigabit.
99:50Gerhard Lazu
No.
99:51Jerod Santo
So you go. All right. Well, the good news for Kaizen is we have a lot to work on.
99:57Gerhard Lazu
Always, always.
99:59Jerod Santo
So that's what it seems.
100:00Gerhard Lazu
Yeah. We know how to, we know how to pick them, don't we?
100:03Jerod Santo
Oh my gosh. The rabbit hole goes deep and we keep going in. Kaizen. My friends. Kaizen. Kaizen. All right, Kaizen 22 is in the bag. Join the discussion in our Community Zulip. Head to changelog.com slash community to sign up for $0. And of course, check out all of Gerhard's passions at makeitwork.club. Thanks again to our partners at Fly.io and to our Beat Freakin' residents, Breakmaster Cylinder. Next week on the pod, news on Monday, Damian Tanner from Layer Code on Wednesday and Techno Tim catches Adam up with the state of home lab tech on Friday. Have a great weekend. Recommend us to a friend if you like the show and let's talk again real soon.