I was about to release an app based on the new Assistant API but just a day before the release the response times increased to 8s flat. When I have function calls, that meant up to a minute to get a response.
I had to dismantle everything Assistant API and implement it with Chat API. Which turned out to be great because in Assistant API the context management was very bad and after a few back and forth messages the cost ballooned to over 10K tokens per message.
When I looked closely at the Assistant API and Chat API, I noticed that Assistant API is just a wrapper over Chat API and acts as a web service that stores the previous messages(so slow response problem was probably due to the web server which keeps track of the context). So I went ahead and implemented my own Assistant API which has more control. For example, I set max token cost per message and if the context balloons over that, I make a request with the context and ask OpenAI to create a summary with all the facts so far, add that summary as a system prompt and my context gets compressed back into reasonable territory.
It does considerably more than (poorly) managing the context window. It also (poorly) enables persistent document storage, knowledge retrieval, function calling and code execution.
It's useful if you just need to hook up a chat assistant and don't want to bother with the busywork doing it. All you care is loading the messages from the thread(which are conveniently kept for you) and add new messages.
So, the Assistant API in OpenAI is just a wrapper over the Chat API. They let you choose which model you would like to use, so as a result of you fine tune a model you should be able to use it.
However I never tried fine tuning, I rely on RAG and the Assistant API does provide you some tools to make this a bit easier. What tools? They provide an "editor interface" where you can set function calls, upload some files and access the code interpreter.
So if you are making a chatbot for Company Y, you can create an assistant which has information about Company Y in the system prompt and also can access up to date information about the company through function calls you define and the files you upload.
If you use only Chat API, you will have to handle these stuff yourself. Actually, though I'm using Chat API I do use the Assistant Editor UI to manage the functions and the system prompts. What I do is, I retrieve the assistant info from the OpenAI Assistant API and then I use this on Chat API. This way I don't have to bother with creating my own UI or fiddle with text files or the code.
As Assistant API is just a wrapper, most the data structures I receive from Assistant API directly work in Chat API.
Finally! I've been using the assistants api in building an ai mock interviewer (https://comp.lol) but the responses were painfully slow when using the latest iterations of the gpt-4 model. This will make things so much more responsive
I'd still want to see the entire response all at once. Having it stream in while I read it would be very distracting and make it difficult for me to read.
It's a request the front-end developer should be confronted with, not OpenAI.
The website could as well buffer the incoming stream until the used clicks an area to request the display of the next block of the response, once he has finished reading the initial sentences.
yes, it like surfing porn in the early internet year using a dialup modem. One line a the time until you finally can see enough of the picture (reply) to realize that is was not the reply you were looking for.
LLM streaming must be a cost saving feature to prevent you from overloading the servers by asking to many questions with in a short time frame. Annoying feature IMHO
How is hiding it behind a loading spinner any better? You still can't spam it with questions since you need to wait for it to finish. With streaming you can at least hit the stop button if it looks incorrect, so you actually spam it more with it enabled.
For me, the constant visual changes of new parts being streamed in are annoying, and straining on the eyes. Ideally, web frontends would honor `prefers-reduced-motion` and buffer the response when set.
Personally, I've fallen in love with that visual effect of streaming text you're talking about. It's a bit pavlovian, but I think in my head it signifies that I'm reading something high signal (even though it isn't always).
It's more about UX, to reduce the perceived delay. LLMs inherently stream their responses, but if you wait until the LLM has finished inference, the user is sitting around twiddling their thumbs.
This was one of the limitations of the Assistants API that made me entirely ignore it up until now.
I am curious if the Assistants API lets you edit/remove/retry messages yet. I don't see anything implying this has changed. It's annoying that the Assistants API doesn't give you enough control to support basic things that the ChatGPT app does.
Like the other commenter said, edit/remove/retry messages can be implemented by the API client already. The API doesn't maintain state so every new message in a "conversation" includes previous messages as context. To edit a message you would re-submit the conversation history with the desired changes.
I get what you're asking for though. It would be nice if this was easier. But that would require OpenAI changing their API model to one where conversation history is stored on their server. It would be more of a "ChatGPT conversation API" then just an GPT-4/3.5 API.
This was indeed true in the beginning, and I don’t know if this has changed. Inserting messages with Assistant role is crucial for many reasons, such as if you want to implement caching, or otherwise edit/compress a previous assistant response for cost or other reason.
At the time I implemented a work-around in Langroid[1]: since you can only insert a “user” role message, prepend the content with ASSISTANT: whenever you want it to be treated as an assistant role. This actually works as expected and I was able to do caching. I explained it in this forum:
For all the brilliance in the AI and infra departments of OpenAI, their official Python library (which is the flagship one as I understand) feels pretty unidiomatic, designed without much thought for common patterns in the language.
2012 JavaScript called, it wants its callbacks wrapped in objects back. Why do we have a context manager named "stream" for which you call `.until_done()`? This could've been an iterator, or better - an asynchronous iterator, since this is streaming over the network. We could be destructing instances of named tuples with pattern matching, or even just doing `"".join(delta.text for delta in prompt (...)`. But no here subclass this instead, tells me the wrapper around a web API.
The `stream` context manager actually does expose an async iterator (in the async client), so you could instead do this for the simple case:
with client.beta.threads.runs.create_and_stream(…) as stream:
async for text in stream.text_deltas:
print(text, end="", flush=True)
which I think is roughly what you want.
Perhaps the docs should be updated to highlight this simple case earlier.
We are also considering expanding this design, and perhaps replacing the callbacks, like so:
with client.beta.threads.runs.create_and_stream(…) as stream:
async for event in stream.all_events:
if event.type == 'text_delta':
print(event.delta.value, end='')
elif event.type == 'run_step_delta':
event.snapshot.id
event.delta.step_details...
which I think is also more in line with what you expect. (you could also `match event: case TextDelta: …`).
Note that the context manager is required because otherwise there's no way to tell if you `break` out of the loop (or otherwise stop listening to the stream) which means we can't close the request (and you both keep burning tokens and leak resources in your app).
does your team do usability tests on the apis before launching them?
if you got 3-5 developers to try and use one of the sdks to build something, i bet you'd see common trends.
e.g. we recently had to update an assistant with new data everyday and get 1 response, and this is what the engineer came up with. probably it could be improved, but this is really ugly
just to add to this, it's not helped by the docs. either they don't exist, or the seo isn't working right.
e.g. search term for me "openai assistant service function call node". The first 2 results are community forums, not what i'm looking for. The 3rd is seemingly the official one but doesn't actually answer the question (how to use the assistance service with node and function calling) with an example. The 4th is in python.
I'm sorry for your experience, and thanks very much for sharing the code snippet - that's helpful!
We did indeed code up some sample apps and highlighted this exact concern. We have some helpers planned to make it smoother, which we hope to launch before Assistants GA. For streaming beta, we were focused just on the streaming part of these helpers.
Is there a technical reason why log probs aren't available when using function calling? It's not a problem, I've already found a workaround. I was just curious haha.
In general I feel like the function calling/tool use is a bit cumbersome and restrictive so I prefer to write the typescript in the functions namespace myself and just use json_mode.
My experience is their official Python library was easy to use, no surprises, everything is typed and generated from the OpenAPI spec in a thoughtful way.
The tools are great because they don't invent their own DSL, they "just" use JSON schemas.
Maybe they ought to contribute changes to OpenAPI to support streaming APIs better.
In contrast so many startups make their own annotation-driven DSLs for Python with their branding slapped over everything. It gives desperate-for-lock-in vibes. The last people OpenAI should be taking advice from for their API design is this forum.
How is suggesting the use of iterators and named tuples related to creating domain specific languages? If anything I'd say they're a much more generic and universally recognizable approach than having users subclass `AssistantEventHandler` to be passed to `client.beta.threads.runs.create_and_stream`, the context manager. This is very much a long way past just using JSON schemas but that part is ok - there's a REST API, and there's a library. If you're keen on the simplicity of JSON schema then by all means use the API with `requests` or your preferred http client library. Since that's always an option, it stands to reason that the point of having a dedicated library is to provide thoughtful abstractions that make it easier to use the service.
What I'm arguing is precisely that the abstractions in the library (such as the `AssistantEventHandler` shown in the article) are ineffective in making things simpler. They force you to over-engineer solutions and distribute state unnecessarily and be aware of that specific class interface when it could've just been something you use in a `for x in y` loop like everyone would know to do without spending an afternoon looking over docs and figuring out how the underlying implicit FSM works.
On the second point, there was an issue on launch where it would not find a relevant fragment and appear to load the whole file into the context. Unsure if this has changed but it freaked quite a few folks out OpenAI discussion forums w/ escalating costs.
Throwing a feature request in here just in case someone from OpenAI sees it.
I'd really like it if the streaming versions of their APIs could return a token usage count at the end.
The non-streaming APIs do this right now:
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" -d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "A short fun fact about pigeons"
}
]
}'
Returns:
{
"id": "chatcmpl-92UiIWQaf442wq7Eyp7kF8ge0e3fE",
"object": "chat.completion",
"created": 1710381746,
"model": "gpt-3.5-turbo-0125",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Pigeons are one of the few bird species that can drink water by sucking it up through their beaks, rather than tilting their heads back to swallow."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 33,
"total_tokens": 47
},
"system_fingerprint": "fp_4f0b692a78"
}
Note the "usage" block there telling me how many tokens were used (which tells me how much this cost).
But if I add "stream": true I get back an SSE stream that looks like this:
There's no "usage" block, which means I have to try and account for the tokens myself. This is really inconvenient!
I noticed the other day that the Claude streaming API returns a "usage" block with the last message. I'd love it if OpenAI's API did the same thing.
I need this right now because I'm starting to build features for end users of my own software, and I want to be able to give them X,000 tokens "free" before starting to charge them for extras. Counting those tokens myself (probably using tiktoken) is code I'd rather not have to write - especially since features like tools/functions or images make counting tokens a lot less obvious.
We do the token counting on our end literally just running tiktoken on the content chunks (although I think usually its one token per chunk). Its a bit annoying and I too expected they'd have the usage block but its one line of code if you already have tiktoken available. I've found the accounting on my side lines up well with what we see on our usage dashboard.
As an FYI, this is fine for rough usage, but it's not accurate. The OpenAI APIs inject various tokens you are unaware of into the input for things like function calling.
I immediately implemented streaming into my rocketchat gpt bot, was definitely a distraction but my colleagues liked it. No more waiting until the complete response is sent.
Openai banned my account for suspicious payment activities, and I never was able to talk to a real person. Just several layers of chat bots posing as people.
I literally want to give them my money and can't. Every few weeks for shirts and giggles i send an email to them saying, "any update on this?"
I suspected as much when one of their support "personnel" used the phrase "I apologize for the earlier confusion..." (there was no confusion, I was simply contradicting what they were saying)
One of the reasons I tend to use any of their options through Azure where available. Azure support has a more straight forward (though still sometimes slow) process for account issues.
My Anthropic account was suspended for suspicious activity, even though I never used it. I had forgotten I had signed up, and tried to sign up using a new email with the same phone number. Locked out forever.
This website is now like 30% about this probability based autocomplete nonsense. Feels like all those bitcoin hypes and "running everything on blockchain" fad of few years ago. Now it's running everything through "large autocomplete" model.
I really hope this will fade and focus will turn back to highlighting some broader actual human ingenuity in IT, rather than constant stream of "we used autocomplete for this new thing" or "we build this new API for this glorified autocomplete".
Seriously though, it's not going away no matter how much anyone hates it. Emails and blogs will continue to be written with it, letters of recommendation will be/are written with it, Presidential speeches will be written with it, academic articles will be / are written with it (almost all ml and cs research is), news is written with it...
It's not going to stop, but it will _probably_/_very likely_ get better.
There is no tool, no human, no method to determine if text is generated with one of these models at high F-score (only sometimes high precision, low recall domains for silly examples).
We're stuck with it. Like the English teacher and their despised spell check.
It occurs to me that over time, reading comprehension will become significantly more important than the ability to write. Anyone will be able to write something smart-sounding with AI's help, but it'll take real skill to make sure the output is correct and appropriate.
Yes, customers will love anything that helps them. You can get customers to love you by adding any kind of automation for stuff they had to do by hand up to that point. Does this mean there should be 10 articles per day shared about "I added XLSX import to my app, so my customers don't have to do data entry via dialogs"?
My point is about repetitiveness of LLM topics. Not about usefullness of LLM itself. And LLMs are glorified autocomplete. Their internals are maybe interesting, but that's often not what's being discussed here or even written about in the shared articles.
I've gotten so used to having an LLM integrated into my editor that when I work on the occasional spreadsheet (or really anything with syntax that I only use occasionally and no integrated AI) it's pretty jarring to have to go to another tab to look up what function to use for a formula (even if that other tab is ChatGPT).
Nah it's got legs as a google replacement / competitor if they keep costs lower and take a smaller rent. WHEN they start advertising they'll explode. Which is why google is trying to snuff them out in the cradle (sorry about the visual).
"This 1990 paper demonstrated how neural networks could learn to represent and reason about part-whole hierarchical relationships, using family trees as the example domain.
By training on examples of family relations like parent-child and grandparent-grandchild, the neural network was able to capture the underlying logical patterns and reason about new family tree instances not seen during training.
This seminal work highlighted that neural networks can go beyond just memorizing training examples, and instead learn abstract representations that enable reasoning and generalization"
> We know next to nothing about how the human brain works
I was about to release an app based on the new Assistant API but just a day before the release the response times increased to 8s flat. When I have function calls, that meant up to a minute to get a response.
I had to dismantle everything Assistant API and implement it with Chat API. Which turned out to be great because in Assistant API the context management was very bad and after a few back and forth messages the cost ballooned to over 10K tokens per message.
When I looked closely at the Assistant API and Chat API, I noticed that Assistant API is just a wrapper over Chat API and acts as a web service that stores the previous messages(so slow response problem was probably due to the web server which keeps track of the context). So I went ahead and implemented my own Assistant API which has more control. For example, I set max token cost per message and if the context balloons over that, I make a request with the context and ask OpenAI to create a summary with all the facts so far, add that summary as a system prompt and my context gets compressed back into reasonable territory.