LayoutLM [1] is the closest that I have seen you what you are asking. It is applied to documents but essentially takes positional and visual information into account for text extraction. For example, extracting a total from the line that reads TOTAL - I think this would be the best place to start.
1. https://arxiv.org/abs/2204.08387