Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone have any recommendations for a good tool that allows both programmatic inspection and modification of PDF primitives. For example, let's say someone wants to iterate through every embedded image in a PDF and apply some form of signal processing to the images in-place, then re-save the PDF?


My tool (PDFSyntax[1], mentioned in this thread) is a Python library that is able to both inspect and transform PDF files.

Depending on your transformation use case, you may write an incremental update with only a few bytes at the end of the original file instead of rewriting it entirely. To my knowledge this feature of the PDF specification is often overlooked and not a lot of libraries implements it.

It is a work in progress and I have not developed functions for images yet, though.

[1] https://github.com/desgeeko/pdfsyntax


I’ve used pikepdf[1] for text processing before. To use it for the task you outline, you’ll probably need to thoroughly investigate how bitmaps can be represented in PDFs. (Or maybe not, if you only need to deal with a known finite set of PDFs or PDF producers.)

[1] https://pikepdf.readthedocs.io/en/latest/


I've been using several Python libraries for working with PDFs. At least one of them allows you to walk the AST. (will look up in a bit and edit this comment)


I've been using pypdf for working with PDFs in Python. My uses are pretty humble. I create Jupyter notebooks for managing sheet music that I receive in PDF format, allowing me to do things like break up a book of tunes into individual files, and so forth. This in turns makes it easier to pull up individual tunes on my tablet during a performance. But it looks like you can treat the PDF as a tree structure. I've used that feature for writing some recursive functions.


yeah, I've been using pypdf mainly, camelot-py for some table stuff, and a bit of pdfminer

I've been needing something to see the x/y bounds of tables to fix some edge cases with camelot, seem to be some good links in the comments here


I'd suggest you to code something along popular libraries for PDF manipulation. I've used pdf-rs for the tool.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: