I assume you're ingesting PDFs. If so, how are you handling tables accurately?

Kon-Peki · on April 21, 2024

If it was me, I would be ingesting the raw filings from SEC EDGAR and using the robust xml documentation to create very accurately annotated data tables that would be fed to my LLM

scrollbar · on April 21, 2024

A coworker presented a demo the other day of this - asking LLM (I think it was OpenAI) to extract the text from a PDF - each page of the PDF passed as an image. It was able to take a table and turn it into a hierarchical representation of the data (ie. Column with bullets under it for each row, then next column, etc.)

If you haven't tried maybe worth a shot

coastermug · on April 21, 2024

AWS textract now has the functionality to offer a table cell based on a query - if I’m not mistaken. I’ve seen nothing similar to this and would be very interested if there are other solutions.