Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I never really liked flamegraphs much but I am going to put that aside for a bit and try to be as objective as possible.

I don't find the usecase presented here compelling. Cutting out the "yo we will save you $x billion in compute" costs the tools presented here seem to be…stacktraces for your kernels. Stacktraces that go from your Python code through the driver shim to the kernel and finally onto the GPU. Neat. I don't actually know very much about what Intel has in this area so perhaps this is a step forward for them? If so, I will always applaud people figuring out how to piece together symbols and whatnot to make profiling work.

However, I am still not very impressed. Sure, there are some workloads where it is nice to know that 70% of your time is spent in some GEMM. But I think the real optimization doesn't look like that all. For most "real" workloads, you already know the basics of how your kernels look and execute. Nobody is burning a million dollars an hour on a training run without knowing what each and every one of the important kernels are. Some of them were probably written by hand. Some might be written in higher-level PyTorch/Triton/JAX/whatever. Still others might be built on some general library. But the people who do this are not stupid, and they aren't going to be caught unawares that a random kernel has suddenly popped up on their flamegraph. They should already know what is there. And most of these tools have debugging facilities to dump intermediate state in forms that tools understand. Often this is incomplete and buggy, I know. But it's there and people do use them.

What these people are optimizing are things that flamegraphs do not show. That's things like latency in kernel launches, or synchronization overhead with the host. It's global memory traffic and warp stalls. Sure, the tools to profile this are immature compared to what the hyperscalers have for CPUs. But they are still present and used heavily: I don't buy the argument that knowing that your python code calls a kernel through __cuda12_ioctl_whatever is actually helpful. This seems like a solution searching for a problem, or maybe a basic diagnostic tool at best.



> What these people are optimizing are things that flamegraphs do not show. That's things like latency in kernel launches, or synchronization overhead with the...

What OP is showing is an example of what can be shown on flamegraphs. They are a generic visualisation tool so if you want to include latency or whatever (financial cost maybe?) you are free to do it.

As for the rest, Intel is here providing tools for developers who would like to optimize the sw stacks on their platform. Invaluable if you would like to efficiency support non-NVidia hardware.


Flamegraphs categorically cannot represent timeseries data. That's not what they are designed to do and they don't have a way to display it.


That is not true, they definitely can represent some timeseries data in specific ways. But that's not even connected to what I said - I specifically mentioned latency which can be included in profiling data. Or am I misunderstanding what you are trying to say?


How would you indicate how long a kernel takes to launch in a flamegraph?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: