Can anyone comment on an open source multi-modal LLM that can produce structured... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		jn2clark on Jan 4, 2024 \| parent \| context \| favorite \| on: TinyGPT-V: Efficient Multimodal Large Language Mod... Can anyone comment on an open source multi-modal LLM that can produce structured outputs based on an image? I have not found a good open source one yet (this included), seems to be only closed source that can do this reliably well. Any suggestions are very welcome!

isaacfung on Jan 4, 2024 | [–]

Something like this?

https://imgur.com/a/hPAaZUv

https://huggingface.co/spaces/Qwen/Qwen-VL-Plus

You can also ask it to give you bounding boxes of objects.

addandsubtract on Jan 4, 2024 | [–]

I've only used LLaVA / BakLLaVA. It falls under the LLAMA 2 Community License. Not sure if you consider that open source or not.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact