Hmmm.. The more I think about this the more any font kerning is likely a major leak for redaction. Even if the boxes have randomness applied to them, the words around a blacked out area have exact positioning that constrains the text within so that only certain letter/space combinations could fit between them. With a little knowledge of the rendering algorithm and some educated guessing about the text a bruit force search may be able to do a very credible job of discovering the actual text. This isn't my field. Anyone out there that has actually worked on this problem?
There was a recent vulnerability, where researchers were able to extract information from an encrypted chat session from an LLM, by analyzing packet size/timings of the underlying SSL connection. A classic side-channel attack. Seems possible to draw a parallel between the two.
> the more any font kerning is likely a major leak for redaction
Now I want a font that randomly adjusts the kerning automagically to be used by people in standard word processors not some graphics app. In this way, every time the same word appears in the document, the kerning is different between each one.
most people cannot detect differences in kerning, and must be extreme adjustments to get people to notice. even then, the words would need to be aligned above/below each other for people to see the differences. however, a computer program analyzing the size of a bounding box would notice single pixel differences. so randomly adjusting the kearning per word by pixels between each letter would go unnoticed by the vast majority of readers, but could play absolute havoc with algos trying to decipher possible word combos based on bounding box size.
Really depends on the length and predictability of the redaction, but yes. If it's short and contextually it's only likely to be either "yes" or "no", you've got it. If it's longer and could contain an unknown person's name along with some other words, well, that's harder.
I feel like this creates a hash value and the real question is how unique of a value does it represent and how easy it is to narrow it down given throwing a dictionary at it. Similarly, unknown names could likely be teased out like a one-time pad. If they appear in multiple sentences then their randomness quickly repeats and becomes something that potentially could be isolated from the rest of the words around them. This would probably be a fun problem for a cryptography class to work on.
If so, then finding the redacted string would be similar to trying to brute-force a hash (though presumably slower, since text layout algorithms are probably more complex than a single hash invocation).
Unlikely to be possible except for the smallest redactions, like if you have a single name redacted and a list of candidates. But I think kerning wouldn't help you much more than just knowing the rough length anyway.