Maybe I'm an outlier but I don't see much value in running tiny local models vs. using a more powerful desktop in my house to host a larger and far more usable model. I run Open WebUI and connect it to my own llama.cpp/koboldcpp that runs a 4-bit 70B model, and can connect to it anywhere easily with Tailscale. For questions that even 70B can't handle I have Open WebUI hit OpenRouter and can choose between all the flagship models.
Every time I've tried a tiny model it's been too questionable to trust.
Have you tried Gemma 27b? I’ve been using it with llamafile and it’s pretty incredible. I think the winds are changing a bit and small models are becoming much more capable. Worth giving some of the smaller ones a shot if it’s been a while. I can run Gemma 27b on my 32gb MacBook Pro and it’s pretty capable with code too.
Every time I've tried a tiny model it's been too questionable to trust.