The 20$ question, what can I do with this?

devit · 2025-02-26T02:48:42 1740538122

Multiply FP8 matrices with FP32 scaling factors giving a bfloat16 matrix result on an nVidia Hopper or newer GPU.

Maxious · 2025-02-26T02:56:02 1740538562

Just tested and it doesn't work out of the box on the consumer 50 series ie. 5080:

    Testing GEMM:
    Assertion failed: 
    deep_gemm/jit/../include/deep_gemm/fp8_gemm.cuh:369, condition: cudaFuncSetAttribute(kernel, 
 cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) == cudaSuccess
    terminate called after throwing an instance of 
    'AssertionException'
      what():  Assertion failed: cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) == cudaSuccess

devit · 2025-02-26T03:07:54 1740539274

Perhaps your card has less per-SM shared memory than the GPUs DeepSeek uses.

Try to lower the sm90_capacity value in gemm.py: I think 128KB is the correct value for RTX 5080 compared to 256KB for the H100/H800.

And probably add ", 3, 2, 1" after "6, 5, 4".

wenc · 2025-02-26T02:57:56 1740538676

It says:

> DeepGEMM exclusively supports NVIDIA Hopper tensor cores