Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The 20$ question, what can I do with this?


Multiply FP8 matrices with FP32 scaling factors giving a bfloat16 matrix result on an nVidia Hopper or newer GPU.


Just tested and it doesn't work out of the box on the consumer 50 series ie. 5080:

    Testing GEMM:
    Assertion failed: 
    deep_gemm/jit/../include/deep_gemm/fp8_gemm.cuh:369, condition: cudaFuncSetAttribute(kernel, 
 cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) == cudaSuccess
    terminate called after throwing an instance of 
    'AssertionException'
      what():  Assertion failed: cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) == cudaSuccess


Perhaps your card has less per-SM shared memory than the GPUs DeepSeek uses.

Try to lower the sm90_capacity value in gemm.py: I think 128KB is the correct value for RTX 5080 compared to 256KB for the H100/H800.

And probably add ", 3, 2, 1" after "6, 5, 4".


It says:

> DeepGEMM exclusively supports NVIDIA Hopper tensor cores




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: