This AWS outage has reminded and bolstered my confidence of the idea that there really are practical limits on how we can manage complexity.
As a codebase ages, as services grow out in scale and scope, complexity increases. Developers know this. I don't believe that you can linearly scale your support to accommodate the resulting unknown unknowns that arise. I'm not even sure you can exponentially scale your support for it. There is going to be a minimum expected resolution time set by your complexity that you cannot go under.
I think times where there have been outages like this that have been resolved quickly are the exceptional cases, this is the norm we should expect.
Doesn’t an increasing LLM centric code base only compound this problem? Under the assumption that people are lazily screening LLM’s output when using them
As a codebase ages, as services grow out in scale and scope, complexity increases. Developers know this. I don't believe that you can linearly scale your support to accommodate the resulting unknown unknowns that arise. I'm not even sure you can exponentially scale your support for it. There is going to be a minimum expected resolution time set by your complexity that you cannot go under.
I think times where there have been outages like this that have been resolved quickly are the exceptional cases, this is the norm we should expect.