ksampath02's comments

ksampath02 · 2025-04-28T21:59:41 1745877581

One interesting part of this model's pretraining process is how they used Qwen2.5VL and Qwen 2.5 to parse public unstructured data and expand the corpus from 18T to 36T. The ability to consistently do this will push legacy companies to train their own models and enhance their edge.

ksampath02 · on Dec 17, 2024

You could try Aryn DocParse, which segments your documents first before running OCR: https://www.aryn.ai/ (full disclosure: I work there).

bambax · on Dec 17, 2024

I will try that, thanks.

ksampath02 · on Oct 10, 2024

May he rest in peace, his name will be remembered and impact felt for long.