Hacker Newsnew | past | comments | ask | show | jobs | submit | ksampath02's commentslogin

One interesting part of this model's pretraining process is how they used Qwen2.5VL and Qwen 2.5 to parse public unstructured data and expand the corpus from 18T to 36T. The ability to consistently do this will push legacy companies to train their own models and enhance their edge.


You could try Aryn DocParse, which segments your documents first before running OCR: https://www.aryn.ai/ (full disclosure: I work there).


I will try that, thanks.


May he rest in peace, his name will be remembered and impact felt for long.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: