In this article, we will go over the required fundamentals to understand how benchmarking works and then show how we can use the results of the benchmarks to create fair markets.
Intro
The Nosana grid contains about two thousand nodes with various hardware configurations, which are actively running AI models. At the start of the Nosana Test Grid phase, these nodes have mostly been running image generation or transcription jobs through Stable Diffusion and Whisper. Although these jobs are suitable to make sure nodes are functioning properly, they do not provide any additional benefit from an AI use case perspective.
So to make the best use of the nodes until the launch of the mainnet at the beginning of 2024, we started looking for opportunities to run jobs that would be useful to the Nosana community. As Large Language Models (LLMS) are projected to be the center pillar of Nosana’s AI demand in the foreseeable future, we decided to hire a dedicated AI specialist team to start working on a large-scale LLM benchmarking project on the Nosana Grid. This project aims to provide information that will help clients make better-informed decisions, help the Nosana team implement a fairer market system, and contribute valuable information to the LLM research community. In this blogpost, we will go over the required fundamentals to understand how benchmarking works and then show how we can use the results of the benchmarks to create fair markets.
LLM Fundamentals
Let’s start with the fundamentals of LLMs. What is an LLM? How does it work? And what do we need to run one?
Architecture
For anyone reading up on LLMs, they might seem like complex neural networks used in artificial intelligence. While this is true to some extent, in practice, LLMs essentially consist of two easy-to-understand files. The model weights file and the model code file. The model weights file contains the parameters of the model and determines the model size, which is measured in bytes. The model code file contains instructions on how to load and run the model.
When looking at llama3.1–70B, the identifier 70B means the model contains 70 billion parameters. The parameters of the model are stored with 16-bit floating-point precision, which equals to 2 bytes, making the model weights file size 140 gigabytes.
Each parameter in the model weights file corresponds to a neuron in the architecture described in the model code file. For most modern-day LLMs, this architecture is called a transformer. The image below shows a generalized transformer architecture used for producing text.
A detailed explanation of transformers is beyond the scope of this article, so we will focus only on the part that is most important to understand for this benchmarking research. For a LLM to output text, it needs to perform computations at every layer, and to do this, it needs to have its parameters and specific cached computations loaded into memory at the respective layers in the model.
Inference
Now that we know what an LLM is, let’s see how we actually produce language with them. LLMs are trained on the task of next token prediction. Tokens are units of text that correspond to words or parts of words, and they are the vocabulary that is understood by LLMs. So as far as LLMs are concerned, producing language is nothing more than correctly predicting the next word or subword given the preceding ones. This process of producing tokens with LLMs is called inference. The speed of inference is an important factor in the usability of LLMs, and it is influenced by the model size, architecture, and the hardware & software configuration on which it is run.
So how does inference work? To facilitate inference, LLMs go through 2 main stages. The prefill phase and the decoding phase.
In the prefill phase, the model processes all input tokens simultaneously to compute all the necessary information for generating subsequent tokens. In practice, the duration of this phase corresponds to the time you wait until the LLM starts generating its response. The prefill phase is highly parallelized and makes efficient use of the computing capacity, passing through the model only once.
At the start of the decoding phase, the model uses the cached information computed during the prefill phase to generate a token. From this point on, for every newly generated token, the previous token needs to pass through the network together with the cached computations. This process of repeatedly going through the network is not computationally intensive, as computations only have to be performed for a single token. Instead, the decoding phase is memory-intensive because the cached information has to be moved around to perform the necessary computations.
The prefill phase is compute-bound, while the decoding phase is memory bound. A process is considered compute-bound when it requires significant computation and its speed or performance is limited primarily by the amount of processing power of the hardware. A process is memory bound when its performance is limited by the rate at which data moves to and from memory. This rate is called the memory bandwidth.
Hardware Requirements
Alright, so we need compute and memory capacity to run LLMs. This is where our GPUs come in. Let’s figure out what we would be able to run with the Nosana grid’s most popular GPU, the RTX 4090. When considering GPUs, the available memory is expressed in VRAM, the memory bandwidth is expressed in bytes per second, and the computational capacity is expressed in FLOPS, floating point operations per second. The RTX 4090 has 24 GB VRAM, a memory bandwidth of 1,008 GB/s, and can produce 82.58 teraFLOPS.
The hard requirement for running an LLM is having enough VRAM to store its parameters + cached computations. For llama3–8B with 16-bit floating point precision, this would approximately be 18GB of VRAM. Our RTX 4090 will be able to handle that.
Figuring out if a model can be run is fairly straightforward, but figuring out how it will perform is hard to determine as it is dependent on many variables. That being said, given the constraint of a memory-bound process, we can give a rough theoretical estimation of the number of tokens per second by dividing the model size by the memory bandwidth.
18GB ÷ 1,008 GB/s = 18ms
This calculation gives us 18ms per token or 56 tokens per second. So this calculation would give a rough estimation for an inference run with a single input sequence, which is a predominantly memory-bound process. However, if we increased the number of sequences processed at the same time, it could shift from memory-bound to compute-bound and the number of FLOPS of the GPU would start to play a more important role.
Optimization
We have covered the fundamentals of LLMs and understand which variables are important for running them. By using optimization techniques, we can tweak these variables, resulting in beneficial tradeoffs that will allow us to run bigger models or increase inference speed. There are many knobs to turn when it comes to optimization, but the three most important ones are quantization, computation enhancement and caching strategies.
Quantization reduces the precision of the model’s weights to lower bit representations. For example, if we would use quantization to transform the 16-bit floating point precision of our llama3.1–8B model to 8-bit integer, then we would reduce the model size from 16GB to 8GB. While this can have an impact on the accuracy of the model, the significant decrease in memory usage makes it a viable optimization technique.
Computation enhancement focuses on optimizing the operations performed within the model, such as the attention mechanism. This mechanism is an important part of the transformer’s success but is also computationally expensive. By modifying the order of computations or by fusing together certain model layers, we can reduce the amount of data that needs to be written from and to memory.
Caching strategies involve the reduction of the cached computations that are kept in memory. By simplifying the structure of the cached computations, it is possible to significantly reduce the memory footprint in exchange for a slight decrease in model accuracy.
Inference Frameworks
We have briefly gone over the main optimization techniques and it is already becoming apparent that implementing any potentially desired change might become complicated. Luckily, there are various LLM inference frameworks that provide an interface to a wide range of models with built-in options for optimization. As the field of LLM inference is rapidly evolving, these frameworks roll out updates frequently, and there is no clear-cut best optimization framework.
One such framework is Ollama. We will give it a special mention here because it is the framework that was used to gather the initial benchmarking results. Ollama originated as a user-friendly framework with the goal of democratizing the use of LLMs. The Ollama team has impressively succeeded in achieving this goal, as it is undoubtedly the easiest framework for anyone with minimal hardware requirements to spin up an LLM. It is especially fitting for running and testing consumer-grade hardware, as its optimization techniques seamlessly allow models of any size to be run on GPUs with any amount of VRAM.
Nosana Benchmarking
Enough preliminaries. It is time to get into the actual benchmarking. Lets start off by going over the data we collected and how we collected it.
Benchmarking Setup
As mentioned, Ollama was picked as the initial framework for benchmarking due to its compatibility with consumer-grade hardware. With this framework in place, we implemented a custom-made benchmarking script to gather data in two distinct categories, model performance and system specifications.
The performance data contains variables on the total number of tokens produced and how long it took to produce them, as well as the clock speed and the wattage of the GPU. The system specifications data contains variables on an extensive set of system configurations that can have either large or small effects on model performance. The tables below illustrate the kinds of variables and their potential values.
With the to-be-collected variables defined, we now had to pick a model for benchmarking. Given the notable performance of the newly launched llama models and the Nosana Nodes with a wide variety of VRAM capacities, we decided to pick llama3.1–8B that can fit on all GPUs.
Now for the actual procedure of benchmarking, we had to create a method compatible with the current job posting structure of Nosana. A job has a maximum length of X hours, giving us plenty of time to load in one or more models, prompt them, and measure their performance. During a job, every model got prompted with randomly sampled sequences such as “Write a step-by-step guide on how to bake a chocolate cake from scratch.” The content of a prompt does not have an influence on the performance of the model, but it does determine the length of the response, so to make sure the LLMs spent most of their time on actual inference, the pool of prompts we made encouraged longer answers. At the end of each job, the output contains the model performance and system specs variables, which we extracted and added to a large dataset.
Evaluation
Before we get into the results, let’s quickly look at the key evaluation metrics for LLM inference, inference speed and time to first token (TTFT). The inference speed is measured in tokens produced per second during the decoding phase and largely determines how long a user has to wait for a full response. The TTFT is a measurement of the time a user has to wait for the first response token, which is a crucial component in the usability and desirability of many LLM applications.
In this first article, where we test consecutive single-user queries, we will mainly focus on inference speed as a measure of performance, not the TFTT. The TTFT measurement is dependent on the pre-fill phase, which is a compute-bound process, as we mentioned. In our setting, where we process single queries at a time, the amount of computation needed during the prefill phase is low, resulting in uniformly low TTFTs for all GPUs. Inference speed, on the other hand, which is a measurement of how fast the decoding process is executed, is heavily dependent on memory bandwidth. There is a large variety of memory bandwidth capacities between the GPUs on the Nosana Grid, so focusing on the inference speed metric will provide us with the most insightful observations.
As a final note on our evaluation procedure, it is worth mentioning that most LLM inference benchmarking research focuses on the hardware’s capacity to handle concurrent users. When dealing with concurrent users, the model has to handle multiple queries at the same time. This setting makes it possible to maximally utilize the GPUs memory and computational capacities, especially for enterprise hardware with large amounts of VRAM. We have deliberately chosen to perform this initial benchmark with consecutive single queries, or 1 concurrent user, to define a setting in which we can evaluate all GPUs on the Nosana Grid. This means that our results will help us identify well-performing and underperforming nodes across all markets, which enables the implementation of a fair market structure. However, the current setup does not benchmark the actual maximum capabilities of the nodes, which will be a topic in one of our upcoming articles.
Results
The dataset we used contains information on 10,596 jobs performed by 550 unique nodes in 15 Nosana Markets. Within these nodes, there are 39 unique types of GPUs. The RTX 4090 and 3090 are the most common GPUs by far, with 122 and 101 counts, respectively.
Market Performance
As an initial goal of our benchmarking research, we set out to create fair markets. In the presented results, we have aggregated the data on market level, so we can show the performance for each market and highlight opportunities for improvements.
In the above visual, we see the average tokens per second for each market. All the way at the top, we have the H100 with 111 tokens per second, and at the bottom, we have the RTX 4060 with 42 tokens per second. At first glance, this graph does not indicate anything out of the ordinary. On average, the general trend seems to be that the more expensive the GPU, the better the performance.
When we look at the performance variation within markets, indicated by the black bar, we get some more interesting findings. The larger the variation within a market, the more varied the performance of different nodes within the market is. Varied performance within markets is undesirable for both clients and node runners. When clients use a Nosana Node for inference computation, they want reliable performance suitable for their application. When node runners provide compute, they want to be paid based on the quality of compute they provide. A high variance within markets interferes with both of these objectives.
So having completely fair markets would mean that we would have a variance of 0 within all of them. However, getting to 0 variance within each market would require drastic solutions that impede the functionality of the Nosana Grid. Fortunately, though, designing the markets to minimize performance variance is something we can do. For example, we can implement a minimum performance threshold based on the average of each market. Every node performing worse than the average minus the performance threshold would be removed from the market. This would not only reduce the variance of the market, which is caused predominantly by underperforming nodes, but also increase the average tokens per second within the market.
The visual below illustrates what would happen to the market if a 20% or a stricter 10% threshold were to be implemented.
As we can see, for markets with high variation, the threshold causes a significant increase in performance. This is because the threshold removes a larger number of underperforming nodes within these markets. As a result, these markets become fairer because they provide more reliable compute to clients and payout similar amounts for similar-quality compute.
Performance Monitoring
After we observed the variance between nodes, we started analyzing variables that cause performance fluctuations. Even though we used a large set of hardware specs for our analysis, the results were unambiguous and pointed at two main factors responsible for performance: GPU type & wattage. GPU type is the foundation that determines the performance range of nodes, but wattage plays an arguably more crucial role by determining the location within this range. For example, a 3070 GPU running at full power can outperform a 4090 GPU that’s not getting enough, showing how proper wattage allocation can be just as important as the GPU model itself.
With this knowledge, we are now able to categorize 3 types of node runners that deviate from the expected performance. Spoofers are malignant node runners that fake hardware configurations & performance. Underclockers are economically greedy node runners that do not provide enough power to their hardware setup. And a third category of node runners with unforeseen technical issues. As the Nosana team, we want to remove spoofers, keep underclockers in check, and help any node runner facing technical difficulties. Due to the unfakeable nature of model inference performance, monitoring this metric helps us identify which category of underperforming node runner we are dealing with, and take appropriate measures to balance the markets.
Node Leaderboard
As a first practical step towards fair markets, we introduce the Nosana Node Leaderboard. Here, we track the performance of each node within the market and display relevant hardware configurations. This allows us, together with the community, to monitor the performance of Nosana Nodes in a transparent way. Go check it out!
Next Steps
By doing research using the Nosana Grid, we aim to accomplish two main goals. First, to create the optimal Nosana experience by incorporating data-driven insights into our decision making. Second, to contribute valuable research findings to the broader large language model community. In this article, we mainly focused on the first goal, as the current benchmark results are practically useful for the Nosana Grid but do not provide information about the maximum performance capacity of specific model-hardware combinations in realistic settings. In our next article, we’ll explore maximum performance in real-world conditions.
About Nosana
Nosana is an open-source cloud computing marketplace dedicated to AI inference. Their mission is simple: make GPU computing more accessible at a fraction of the cost. The platform has two main goals: providing AI users with flexible GPU access and allowing GPU owners to earn passive income by renting out their hardware.
By offering affordable GPU power, Nosana enables AI users to train and deploy models faster without expensive hardware investments, all powered by the $NOS token. Access compute for a fraction of the cost or become a compute supplier at Nosana.io.
Website | Documentation | Twitter | Discord | Telegram | LinkedIn
Originally published at https://nosana.io.