Implement caching to reduce response times and costs. Round-trip times to LLM providers can be lengthy, upwards of 2-3 seconds per request. With caching, you can return results to identical queries in milliseconds. You also won't pay the LLM provider for the generated response.
Caching is a common concept in computing that aims to serve future requests faster. Once implemented, the application can reuse previously retrieved or computed data instead of re-running the process.
An LLM is just another part of your application, and you can implement a cache layer to reduce latency and costs where it makes sense. Each API call to an LLM costs money and introduces latency to your feature. Caching identical calls means you can return the same response faster without paying for additional tokens.
Note, this is only useful if you want to return the exact same response.
✅ Good use of caching: Your app highlights example queries during onboarding. You still want to stream a chat response, but the examples are exactly the same every time.
✅ Good use of caching: Two users in an org ask the exact same question against the exact same data. You can return the answer quickly without calling the LLM API.
❌ Bad use of caching: A user gets a bad response and selects "try again" to get a different answer. In this case, you want the model to re-process the query to return a different response.
Design your application to use cached data when possible, and invalidate the cache when the request is different. Generating output tokens is the highest latency step of this process. As a best practice, you want to reduce output tokens to result in faster and cheaper results. You can do this by using shorter prompts, fine-tuning, or caching.
There are a variety of ways to implement caching - including pulling data from a database, filesystem, or in-memory cache. To start caching requests with Velvet, simply add a 'velvet-cache-enabled' header set to 'true'.
For example, you send the following with cache enabled. All subsequent requests with an identical payload will return the cached response.
Customers see upwards of an 87% reduction in response times when they use caching. Your application still needs to retrieve the data, but you won't need to make an additional API call to the LLM provider to process the request.
Read more about our latency benchmarks here, including caching tests.
Once Velvet caching is enabled, we'll return a status of 'HIT', 'MISS', or 'NONE/UNKNOWN'. The response body will be returned in the same format as the LLM provider, so you won't need to modify code to handle cached responses.
Velvet warehouses cached responses, and includes additional metadata on cached responses - including the key, value, and status.
To implement caching in your app, follow the Velvet caching docs here. You can also read additional optimization papers from Open AI on reducing costs, latency optimization, and accuracy.
Want to learn more? Read our docs, schedule a call, or email us.
Test models, settings, and metrics against historical request logs.
Use our data copilot to query your AI request logs with SQL.
Use Velvet to observe, analyze, and optimize your AI features.