Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I assume you're ingesting PDFs. If so, how are you handling tables accurately?


If it was me, I would be ingesting the raw filings from SEC EDGAR and using the robust xml documentation to create very accurately annotated data tables that would be fed to my LLM


A coworker presented a demo the other day of this - asking LLM (I think it was OpenAI) to extract the text from a PDF - each page of the PDF passed as an image. It was able to take a table and turn it into a hierarchical representation of the data (ie. Column with bullets under it for each row, then next column, etc.)

If you haven't tried maybe worth a shot


AWS textract now has the functionality to offer a table cell based on a query - if I’m not mistaken. I’ve seen nothing similar to this and would be very interested if there are other solutions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: