This generally isn't true. Cloud vendors have to make back the cost of electricity and the cost of the GPUs. If you already bought the Mac for other purposes, also using it for LLM generation means your marginal cost is just the electricity.
Also, vendors need to make a profit! So tack a little extra on as well.
However, you're right that it will be much slower. Even just an 8xH100 can do 100+ tps for GLM-4.7 at FP8; no Mac can get anywhere close to that decode speed. And for long prompts (which are compute constrained) the difference will be even more stark.
A question on the 100+ tps - is this for short prompts? For large contexts that generate a chunk of tokens at context sizes at 120k+, I was seeing 30-50 - and that's with 95% KV cache hit rate. Am wondering if I'm simply doing something wrong here...
Depends on how well the speculator predicts your prompts, assuming you're using speculative decoding — weird prompts are slower, but e.g. TypeScript code diffs should be very fast. For SGLang, you also want to use a larger chunked prefill size and larger max batch sizes for CUDA graphs than the defaults IME.
When other models would grep, then read results, then use search, then read results, then read 100 lines from a file, then read results, Composer 1 is trained to grep AND search AND read in one round trip
It may read 15 files, and then make small edits in all 15 files at once
Just ask LLM to write one on top of OpenRouter, AI SDK and Bun
To take your .md input file and save outputs as md files (or whatever you need)
Take https://github.com/T3-Content/auto-draftify as example
I think with majority of TypeScript projects using Prettier, 2 is more likely to be the default[0]
The linked page literally says to ignore it [1]
> STOP READING IMMEDIATELY
THIS PAGE PROBABLY DOES NOT PERTAIN TO YOU
> These are Coding Guidelines for Contributors to TypeScript. This is NOT a prescriptive guideline for the TypeScript community.
4 is a historical thing used as a default for all languages in VSCode [2]
And you can only generate like $20 of tokens a month
Cloud tokens made on TPU will always be cheaper and waaay faster then anything you can make at home