I've been on my end of plenty of operational outages. I don't want to be harsh but this could have been written by one of my colleagues, the type of colleagues that I really wish I didn't work with. Console logging for hours? Randomly disabling things? Sometimes when you feel "imposter syndrome" you shouldn't ignore it and maybe up your game a bit.
In fact, I have dealt with an extremely similar situation where a bunch of calls for one of our APIs were failing silently but only after they had taken card payment transactions. Dealing with the developers of this system was like pulling teeth, after we got them to stop stammering and stop chipping in with their ideas (after half a day with this issue ongoing) it took 10 minutes to find the culprit by simply going through the system task by task until we got to the failing task (confirmation emails were unable to send so the API server failed for the entire order despite payments being taken etc.).
This only required 2 things: knowledge of the system, and systematic process to fault finding. You would think that developers who have at least the first, being the ones who wrote it, but sometimes even that is a big ask.
Maybe I'm just burnt out from this industry and incompetent people but... come on... no excuses really.
And then I’d add: Start with reading the error message. In his panic state, he seems to have thought it was a red herring. Error messages are gold. It gives you a concrete thing to work backwards from.
In fact, I have dealt with an extremely similar situation where a bunch of calls for one of our APIs were failing silently but only after they had taken card payment transactions. Dealing with the developers of this system was like pulling teeth, after we got them to stop stammering and stop chipping in with their ideas (after half a day with this issue ongoing) it took 10 minutes to find the culprit by simply going through the system task by task until we got to the failing task (confirmation emails were unable to send so the API server failed for the entire order despite payments being taken etc.).
This only required 2 things: knowledge of the system, and systematic process to fault finding. You would think that developers who have at least the first, being the ones who wrote it, but sometimes even that is a big ask.
Maybe I'm just burnt out from this industry and incompetent people but... come on... no excuses really.