If it was me, I would be ingesting the raw filings from SEC EDGAR and using the robust xml documentation to create very accurately annotated data tables that would be fed to my LLM
A coworker presented a demo the other day of this - asking LLM (I think it was OpenAI) to extract the text from a PDF - each page of the PDF passed as an image. It was able to take a table and turn it into a hierarchical representation of the data (ie. Column with bullets under it for each row, then next column, etc.)
AWS textract now has the functionality to offer a table cell based on a query - if I’m not mistaken. I’ve seen nothing similar to this and would be very interested if there are other solutions.