Deploying your Large Language Model (LLM) is not necessarily the final step in productionizing your Generative AI application. An often forgotten, yet crucial part of the MLOPs lifecycle is properly load testing your LLM and ensuring it is ready to withstand your expected production traffic. Load testing at a high level is the practice of testing your application or in this case your model with the traffic it would be expecting in a production environment to ensure that it’s performant.
In the past we’ve discussed load testing traditional ML models using open source Python tools such as Locust. Locust helps capture general performance metrics such as requests per second (RPS) and latency percentiles on a per request basis. While this is effective with more traditional APIs and ML models it doesn’t capture the full story for LLMs.
LLMs traditionally have a much lower RPS and higher latency than traditional ML models due to their size and larger compute requirements. In general the RPS metric does not really provide the most accurate picture either as requests can greatly vary depending on the input to the LLM. For instance you might have a query asking to summarize a large chunk of text and another query that might require a one-word response.
This is why tokens are seen as a much more accurate representation of an LLM’s performance. At a high level a token is a chunk of text, whenever an LLM is processing your input it “tokenizes” the input. A token differs depending specifically on the LLM you are using, but you can imagine it for instance as a word, sequence of words, or characters in essence.

What we’ll do in this article is explore how we can generate token based metrics so we can understand how your LLM is performing from a serving/deployment perspective. After this article you’ll have an idea of how you can set up a load-testing tool specifically to benchmark different LLMs in the case that you are evaluating many models or different deployment configurations or a combination of both.
Let’s get hands on! If you are more of a video based learner feel free to follow my corresponding YouTube video down below:
NOTE: This article assumes a basic understanding of Python, LLMs, and Amazon Bedrock/SageMaker. If you are new to Amazon Bedrock please refer to my starter guide here. If you want to learn more about SageMaker JumpStart LLM deployments refer to the video here.
DISCLAIMER: I am a Machine Learning Architect at AWS and my opinions are my own.
Table of Contents
- LLM Specific Metrics
- LLMPerf Intro
- Applying LLMPerf to Amazon Bedrock
- Additional Resources & Conclusion
LLM-Specific Metrics
As we briefly discussed in the introduction in regards to LLM hosting, token based metrics generally provide a much better representation of how your LLM is responding to different payload sizes or types of queries (summarization vs QnA).
Traditionally we have always tracked RPS and latency which we will still see here still, but more so at a token level. Here are some of the metrics to be aware of before we get started with load testing:
- Time to First Token: This is the duration it takes for the first token to generate. This is especially handy when streaming. For instance when using ChatGPT we start processing information when the first piece of text (token) appears.
- Total Output Tokens Per Second: This is the total number of tokens generated per second, you can think of this as a more granular alternative to the requests per second we traditionally track.
These are the major metrics that we’ll focus on, and there’s a few others such as inter-token latency that will also be displayed as part of the load tests. Keep in mind the parameters that also influence these metrics include the expected input and output token size. We specifically play with these parameters to get an accurate understanding of how our LLM performs in response to different generation tasks.
Now let’s take a look at a tool that enables us to toggle these parameters and display the relevant metrics we need.
LLMPerf Intro
LLMPerf is built on top of Ray, a popular distributed computing Python framework. LLMPerf specifically leverages Ray to create distributed load tests where we can simulate real-time production level traffic.
Note that any load-testing tool is also only going to be able to generate your expected amount of traffic if the client machine it is on has enough compute power to match your expected load. For instance as you scale the concurrency or throughput expected for your model, you’d also want to scale the client machine(s) where you are running your load test.
Now specifically within LLMPerf there’s a few parameters that are exposed that are tailored for LLM load testing as we’ve discussed:
- Model: This is the model provider and your hosted model that you’re working with. For our use-case it’ll be Amazon Bedrock and Claude 3 Sonnet specifically.
- LLM API: This is the API format in which the payload should be structured. We use LiteLLM which provides a standardized payload structure across different model providers, thus simplifying the setup process for us especially if we want to test different models hosted on different platforms.
- Input Tokens: The mean input token length, you can also specify a standard deviation for this number.
- Output Tokens: The mean output token length, you can also specify a standard deviation for this number.
- Concurrent Requests: The number of concurrent requests for the load test to simulate.
- Test Duration: You can control the duration of the test, this parameter is enabled in seconds.
LLMPerf specifically exposes all these parameters through their token_benchmark_ray.py script which we configure with our specific values. Let’s take a look now at how we can configure this specifically for Amazon Bedrock.
Applying LLMPerf to Amazon Bedrock
Setup
For this example we’ll be working in a SageMaker Classic Notebook Instance with a conda_python3 kernel and ml.g5.12xlarge instance. Note that you want to select an instance that has enough compute to generate the traffic load that you want to simulate. Ensure that you also have your AWS credentials for LLMPerf to access the hosted model be it on Bedrock or SageMaker.
LiteLLM Configuration
We first configure our LLM API structure of choice which is LiteLLM in this case. With LiteLLM there’s support across various model providers, in this case we configure the completion API to work with Amazon Bedrock:
import os
from litellm import completion
os.environ["AWS_ACCESS_KEY_ID"] = "Enter your access key ID"
os.environ["AWS_SECRET_ACCESS_KEY"] = "Enter your secret access key"
os.environ["AWS_REGION_NAME"] = "us-east-1"
response = completion(
model="anthropic.claude-3-sonnet-20240229-v1:0",
messages=[{ "content": "Who is Roger Federer?","role": "user"}]
)
output = response.choices[0].message.content
print(output)
To work with Bedrock we configure the Model ID to point towards Claude 3 Sonnet and pass in our prompt. The neat part with LiteLLM is that messages key has a consistent format across model providers.
Post-execution here we can focus on configuring LLMPerf for Bedrock specifically.
LLMPerf Bedrock Integration
To execute a load test with LLMPerf we can simply use the provided token_benchmark_ray.py script and pass in the following parameters that we talked of earlier:
- Input Tokens Mean & Standard Deviation
- Output Tokens Mean & Standard Deviation
- Max number of requests for test
- Duration of test
- Concurrent requests
In this case we also specify our API format to be LiteLLM and we can execute the load test with a simple shell script like the following:
%%sh
python llmperf/token_benchmark_ray.py
--model bedrock/anthropic.claude-3-sonnet-20240229-v1:0
--mean-input-tokens 1024
--stddev-input-tokens 200
--mean-output-tokens 1024
--stddev-output-tokens 200
--max-num-completed-requests 30
--num-concurrent-requests 1
--timeout 300
--llm-api litellm
--results-dir bedrock-outputs
In this case we keep the concurrency low, but feel free to toggle this number depending on what you’re expecting in production. Our test will run for 300 seconds and post duration you should see an output directory with two files representing statistics for each inference and also the mean metrics across all requests in the duration of the test.
We can make this look a little neater by parsing the summary file with pandas:
import json
from pathlib import Path
import pandas as pd
# Load JSON files
individual_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_individual_responses.json")
summary_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_summary.json")
with open(individual_path, "r") as f:
individual_data = json.load(f)
with open(summary_path, "r") as f:
summary_data = json.load(f)
# Print summary metrics
df = pd.DataFrame(individual_data)
summary_metrics = {
"Model": summary_data.get("model"),
"Mean Input Tokens": summary_data.get("mean_input_tokens"),
"Stddev Input Tokens": summary_data.get("stddev_input_tokens"),
"Mean Output Tokens": summary_data.get("mean_output_tokens"),
"Stddev Output Tokens": summary_data.get("stddev_output_tokens"),
"Mean TTFT (s)": summary_data.get("results_ttft_s_mean"),
"Mean Inter-token Latency (s)": summary_data.get("results_inter_token_latency_s_mean"),
"Mean Output Throughput (tokens/s)": summary_data.get("results_mean_output_throughput_token_per_s"),
"Completed Requests": summary_data.get("results_num_completed_requests"),
"Error Rate": summary_data.get("results_error_rate")
}
print("Claude 3 Sonnet - Performance Summary:n")
for k, v in summary_metrics.items():
print(f"{k}: {v}")
The final load test results will look something like the following:

As we can see we see the input parameters that we configured, and then the corresponding results with time to first token(s) and throughput in regards to mean output tokens per second.
In a real-world use case you might use LLMPerf across many different model providers and run tests across these platforms. With this tool you can use it holistically to identify the right model and deployment stack for your use-case when used at scale.
Additional Resources & Conclusion
The entire code for the sample can be found at this associated Github repository. If you also want to work with SageMaker endpoints you can find a Llama JumpStart deployment load testing sample here.
All in all load testing and evaluation are both crucial to ensuring that your LLM is performant against your expected traffic before pushing to production. In future articles we’ll cover not just the evaluation portion, but how we can create a holistic test with both components.
As always thank you for reading and feel free to leave any feedback and connect with me on Linkedln and X.
The post Load-Testing LLMs Using LLMPerf appeared first on Towards Data Science.


0 Comments