I'm really surprised that Haskell didn't take off here. Strong types, plus lazy evaluation that seems perfect for orchestrating asynchronous GPU operations.
I don’t have much experience with CL, but I was always a bit put off by a mixture of different paradigms it has, ie it’s not being a pure or kinda pure functional language.
I am looking at the GitHub repo for the Malt library developed in the book https://github.com/themetaschemer/malt. It looks like they use Racket (Scheme) vectors to implement tensors. I experimented with loading Keras simple models into Racket several years ago: the built in math library matrix support was fast enough for my needs so the book software may be both pedagogical and useful for smaller problems.
EDIT: I have not tried the OpenBLAS Racket bindings here (https://github.com/soegaard/sci) but perhaps the low level tensor and tensor ops book code could be optimized,
EDIT: Half Life 3 confirmed: "Presents key ideas of machine learning using a small, manageable subset of the Scheme language"