That kind of thing is surprisingly hard to implement. To date I've not seen any ...

danenania · 2025-07-29T21:05:32 1753823132

Probably the only way to do it reliably would be to intercept the prompt with a specially trained classifier? I think you're right that once it gets to the main model, nothing really works.

jerjerjer · 2025-07-30T14:16:27 1753884987

> That kind of thing is surprisingly hard to implement.

If response contains prompt text verbatim (or it is below some distance metric) replace the response text.

Not saying it's trivial to implement (and probably it is hard to do in a pure LLM way), but I don't think it's too hard.

More like it's not really a big secret.