Engineering

Cache LLM requests to reduce latency and costs

Implement caching to reduce response times and costs. Round-trip times to LLM providers can be lengthy, upwards of 2-3 seconds per request. With caching, you can return results to identical queries in milliseconds. You also won't pay the LLM provider for the generated response.

When to use caching

Caching is a common concept in computing that aims to serve future requests faster. Once implemented, the application can reuse previously retrieved or computed data instead of re-running the process.

An LLM is just another part of your application, and you can implement a cache layer to reduce latency and costs where it makes sense. Each API call to an LLM costs money and introduces latency to your feature. Caching identical calls means you can return the same response faster without paying for additional tokens.

Note, this is only useful if you want to return the exact same response.

✅ Good use of caching: Your app highlights example queries during onboarding. You still want to stream a chat response, but the examples are exactly the same every time.

✅ Good use of caching: Two users in an org ask the exact same question against the exact same data. You can return the answer quickly without calling the LLM API.

❌ Bad use of caching: A user gets a bad response and selects "try again" to get a different answer. In this case, you want the model to re-process the query to return a different response.

How to implement caching

Design your application to use cached data when possible, and invalidate the cache when the request is different. Generating output tokens is the highest latency step of this process. As a best practice, you want to reduce output tokens to result in faster and cheaper results. You can do this by using shorter prompts, fine-tuning, or caching.

There are a variety of ways to implement caching - including pulling data from a database, filesystem, or in-memory cache. To start caching requests with Velvet, simply add a 'velvet-cache-enabled' header set to 'true'.

For example, you send the following with cache enabled. All subsequent requests with an identical payload will return the cached response.

Reduce response times to milliseconds, and marginal costs to 0

Customers see upwards of an 87% reduction in response times when they use caching. Your application still needs to retrieve the data, but you won't need to make an additional API call to the LLM provider to process the request.

Read more about our latency benchmarks here, including caching tests.

Once Velvet caching is enabled, we'll return a status of 'HIT', 'MISS', or 'NONE/UNKNOWN'. The response body will be returned in the same format as the LLM provider, so you won't need to modify code to handle cached responses.

Velvet warehouses cached responses, and includes additional metadata on cached responses - including the key, value, and status.

To implement caching in your app, follow the Velvet caching docs here. You can also read additional optimization papers from Open AI on reducing costs, latency optimization, and accuracy.

Want to learn more? Read our docs, schedule a call, or email us.

AI gateway

Analyze, evaluate, and monitor your AI

Free up to 10k requests per month.

2 lines of code to get started.

Try Velvet for free

More articles

Product
Run model experiments on your historical data

Test models, settings, and metrics against historical request logs.

Product
Query logs with Velvet's text-to-SQL editor

Use our data copilot to query your AI request logs with SQL.

Product
Warehouse every request from OpenAI

Use Velvet to observe, analyze, and optimize your AI features.