Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No, we worked with researchers that developed that kind of system, but didn't broadcast our work b/c the research was too sensitive. Seems the cat is out the bag now though.

I think the combination of AI and font-metrics is going to be wild though. You ought to be able to make a system that can figure out likely words based on the unredacted ones and the redaction's size. I haven't seen any redaction system yet that protects against this.





I thought glyph spacing attacks are an old idea; like I recall reading about such ideas 10-20 years ago unless I’m misremembering. Can you clarify why it was considered “too sensitive” if the whole point of this effort is to showcase these attacks?

It’s a fine line. Most redactions are for the good, to protect someone or something. For example even in the Epstein files, where some redactions are being abused, most redactions are protecting victims.

If there’s a way to undo huge amounts of redactions, that’d certainly be a net negative. Sort of like if encryption were suddenly broken, you wouldn’t publish a paper saying so.

Our goal has always been to educate about the problem so that it can be addressed. We didn’t have resources to push on the font metrics approach, so we stayed mostly quiet about it.


> If there’s a way to undo huge amounts of redactions, that’d certainly be a net negative. Sort of like if encryption were suddenly broken, you wouldn’t publish a paper saying so.

I can't state emphatically enough how this is not the right mental playbook.

If you have found a vulnerability, it's likely someone else has too. By sitting on it, you only create more future victims.

Disclosure will lead to fixing this issue, mitigating it's precense, or switching tools/workflows, possibly a combination of. Sitting on it only ensures that folks who think they are protected, actually aren't.


We’re familiar with vulnerability disclosure philosophies, but what if the problem can’t be fixed because there’s no forward secrecy for the hundreds of millions of documents that are already out there?

It’s tricky stuff and we have limited resources, unfortunately.


>, but what if the problem can’t be fixed because there’s no forward secrecy for the hundreds of millions of documents that are already out there?

What if you are not the only folks who have found and exploited this vulnerability?

You can play the "what if" game to justify not doing the right thing all day long, when really it should be one "if" that guide you. What if someone else found this?


So what is the state of the art in redaction? Re-publish the document with an insert that says [redaction] so that no (or maybe minimal) length side-channel exists? I imagine someone thinks about clever ideas and it would be fun to read about them and the trade-offs.

Given that hiding among and behind victims is how abusers continue, I’m not so sure redactions really are all that beneficial when you count future victims in the pool of interested parties. And the public interest certainly isn’t helped by secrecy and redactions and selective release.

While protecting victims is noble, something like this really needs the light of day and a truth and reconciliation commission so that everyone associated with the crime ring is punished and accounted for.

And no, if you do find somehow all encryption is mathematically broken, it’s your duty to publicize it even if existing secrets are jeopardized (you mitigate as best you can obviously in the short term) because it’s likely people more powerful than you might have that knowledge anyway and are engaged in asymmetric warfare.


> I haven't seen any redaction system yet that protects against this.

The linked article suggests widening redacted areas more than needed with some randomization applied to the width. Strikes me that that wouldn't do much except add a few more possible solutions.


Yeah, the more robust protection is to widen to a constant. But in the general case that could require reflowing the pdf. But honestly single word redactions are really probably useless with cheap AI that can highly accurately fill in the gaps

Depends what you're trying to hide.

If the redaction is a person's name, and there's nothing else to give the person's identity away, single word redaction probably works reasonably well, AI or no AI.


  > If the redaction is a person's name
I'm not sure if you're aware, but peoples names are variable in length. We are talking about a system that can identify single character differences. So that does reduce the search space, especially since names are not all possible letter permutations. Combine that with the fact that it isn't uncommon to see partial first letters show up. You can even see some instances in the Epstein files.

Of course, you can also take this further. Even if you can't recover names you can get meta information about how many parties are involved by recognizing different length redactions correspond to different entities. While same length redaction doesn't guarantee same entity it is a hint.


It is also common for authors to misspell names (proper nouns) in an attempt to determine who leaks docs (and to force non-matches for FOIA requests).

If you want to fingerprint text you can also do it by small insignificant changes to text which doesn't change the meaning.

If you have a number such locations with alternatives then you can make a number of identifiable versions by combining alternates.


Random side fact but this was also a thing map makers did back in the day. Including fake towns. In that way they could identify who was stealing their work.

This is going to be a disaster IMO because AI will just hallucinate what it thinks is the most probable redacted word and people will take that as gospel.

"don't redact or we will hallucinate something worse and make people believe it as gospel" is nice deterrent

We don’t need a “deterrent” against things being redacted in publicly released documents. We can have transparency without the whole world finding out the names of victims and witnesses, people’s phone numbers and SSNs, etc., every time a document is released.

Maybe we should all just use mono-space fonts for everything



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: