I thought local LLMs were unable to summarize large documents due to limited tok...

icelancer · on May 26, 2024

You batch them. If token limit is 32k for example, you summarize them in batches of 32k tokens (inc. output) then summarize all the partial summaries.

It's what we were doing at our company until Anthropic and others released larger context window LLMs. We do the TTS locally (whisperX) and the summarization via API. Though we've tried with local LLMs, too.

phh · on May 26, 2024

Well it'll always depend on the length of the meeting to summarize. But they are using mistral which clocks at 32k context. With an average of 150 spoken words per minute, 1 token ~= word (which is rather pessimistic), that's 3h30m of meeting. So I guess that's okay?

rahimnathwani · on May 27, 2024

  mistral which clocks at 32k context

I may be wrong, but my understanding was/is:

- Mistral can handle 32k context, but only using sliding window attention. So it can't really process all 32k tokens at once.

- Mixtral (note the 'x') 8x7B can handle 32k context without resorting to sliding window attention.

I wonder whether Mistral would do a better job summarizing a long (32k token) doc all at once, or using recursive summarization.

icelancer · on May 27, 2024

Hmm. Interesting question. We had no issues using Mixtral 8x7B for this, perhaps reinforcing your point. We use fine-tuned Mistral-7B instances but not for long context stuff.

Maybe a neat eval to try.