often times you will have requirements that the documents you release be digital...

pottertheotter · 2025-12-24T01:23:44 1766539424

This made me think of something I came across recently that’s almost the opposite problem of requiring PDFs to be searchable. A local government would publish PDFs where the text is clearly readable on screen, but the selectable text layer is intentionally scrambled, so copy/paste or search returns garbage. It's a very hostile thing to do, especially with public data!

2ICofafireteam · 2025-12-27T01:32:03 1766799123

I have encountered PDFs that would exhibit this behavior in one browser but not in another.

One fun thing I encountered from local government is releasing files with potato quality resolution and not considering the page size.

I had a FOI request that returned mainly Arch D sized drawings but they were in a 94 DPI PDF rendered as letter sized. It was a fun conversation trying to explain to an annoyed city employee that putting those large drawings in a 94 DPI letter size page effectively made it 30-ish DPI.

eviks · 2025-12-24T07:20:22 1766560822

Hostile indeed, and also happens in user-facing documents like product manuals!

8note · 2025-12-24T00:00:52 1766534452

run some ocr on them after to recreate the text layer?

albert_e · 2025-12-24T05:30:16 1766554216

With the aggressive push of LLMs and Generative AI ..i am expecting a lot of OCR features to become "smarter" by default, namely go beyond mechanical OCR and start inserting hallucinations and sematically/contextually "more correct" information in OCR output

It's not hard to imagine some powerful LLMs being able to undo some light redactions that are deducible based on context

blharr · 2025-12-24T21:42:58 1766612578

Or worse, making up names or information instead of writing the reaction.