Scenario 3: Generation Heavy Benchmarks in Generative AI
The generation heavy scenario is for generation / model response heavy use cases. For example, a long job description generated from a short bullet list of items.
The generation heavy scenario is performed with the following token lengths:
- The prompt length is fixed to 100 tokens
- The response length is fixed to 1,000 tokens
Important
The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:
- The number of concurrent requests.
- The number of tokens in the prompt.
- The number of tokens in the response.
- The variance of (2) and (3) across requests.
Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Chat and Text Generation Scenarios. The generation heavy scenario is performed in the following region.
Brazil East (Sao Paulo)
- Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 31.28 26.55 18.50 3.24 2 30.79 50.88 16.14 7.12 4 29.46 93.36 18.15 12.09 8 28.20 170.20 19.40 21.40 16 26.37 271.80 17.73 40.56 32 25.24 419.13 21.06 55.06 64 22.19 755.43 24.38 98.29 128 17.43 1,248.19 29.45 168.00 256 11.27 1,794.88 44.85 236.65 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.37 52.01 19.56 3.07 2 92.77 101.29 20.04 5.98 4 91.60 191.83 20.34 11.32 8 86.83 338.87 21.51 19.97 16 78.12 547.34 23.92 32.23 32 64.77 1,111.24 28.91 65.46 64 50.52 1,722.11 37.23 101.48 128 31.29 2,123.49 60.17 125.12 256 14.93 2,002.12 126.87 117.98 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 50.18 50.14 20.43 2.94 2 49.28 97.61 20.78 5.72 4 48.22 186.82 21.32 10.94 8 47.20 365.89 21.75 21.43 16 44.69 650.22 22.89 38.03 32 37.29 989.98 27.31 58.04 64 29.53 1621.76 32.68 95.08 128 19.17 1784.76 53.14 104.56 256 10.79 2271.18 94.78 133.05 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.20 50.32 3.53 16.65 2 45.06 98.42 3.61 32.48 4 43.85 165.60 3.26 63.91 8 40.56 292.22 3.04 133.20 16 38.35 416.13 3.61 171.22 32 28.68 557.5 4.64 219.01 64 15.19 613.72 9.65 171.83 128 10.74 664.11 11.67 233.87 256 5.83 721.50 22.78 253.54 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 126.40 110.90 13.07 4.57 2 122.93 213.92 13.33 8.87 4 117.03 403.27 15.32 15.26 8 106.11 707.45 16.86 26.78 16 98.06 1,258.94 18.22 47.94 32 86.74 2,147.82 21.04 79.38 64 72.43 3,011.59 25.50 107.48 128 55.80 5,058.49 32.38 191.22 256 36.56 5,025.93 52.34 189.68
Germany Central (Frankfurt)
- Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 31.28 26.55 18.50 3.24 2 30.79 50.88 16.14 7.12 4 29.46 93.36 18.15 12.09 8 28.20 170.20 19.40 21.40 16 26.37 271.80 17.73 40.56 32 25.24 419.13 21.06 55.06 64 22.19 755.43 24.38 98.29 128 17.43 1,248.19 29.45 168.00 256 11.27 1,794.88 44.85 236.65 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.37 52.01 19.56 3.07 2 92.77 101.29 20.04 5.98 4 91.60 191.83 20.34 11.32 8 86.83 338.87 21.51 19.97 16 78.12 547.34 23.92 32.23 32 64.77 1,111.24 28.91 65.46 64 50.52 1,722.11 37.23 101.48 128 31.29 2,123.49 60.17 125.12 256 14.93 2,002.12 126.87 117.98 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 50.18 50.14 20.43 2.94 2 49.28 97.61 20.78 5.72 4 48.22 186.82 21.32 10.94 8 47.20 365.89 21.75 21.43 16 44.69 650.22 22.89 38.03 32 37.29 989.98 27.31 58.04 64 29.53 1621.76 32.68 95.08 128 19.17 1784.76 53.14 104.56 256 10.79 2271.18 94.78 133.05 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.20 50.32 3.53 16.65 2 45.06 98.42 3.61 32.48 4 43.85 165.60 3.26 63.91 8 40.56 292.22 3.04 133.20 16 38.35 416.13 3.61 171.22 32 28.68 557.5 4.64 219.01 64 15.19 613.72 9.65 171.83 128 10.74 664.11 11.67 233.87 256 5.83 721.50 22.78 253.54 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 126.40 110.90 13.07 4.57 2 122.93 213.92 13.33 8.87 4 117.03 403.27 15.32 15.26 8 106.11 707.45 16.86 26.78 16 98.06 1,258.94 18.22 47.94 32 86.74 2,147.82 21.04 79.38 64 72.43 3,011.59 25.50 107.48 128 55.80 5,058.49 32.38 191.22 256 36.56 5,025.93 52.34 189.68
UK South (London)
- Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 31.28 26.55 18.50 3.24 2 30.79 50.88 16.14 7.12 4 29.46 93.36 18.15 12.09 8 28.20 170.20 19.40 21.40 16 26.37 271.80 17.73 40.56 32 25.24 419.13 21.06 55.06 64 22.19 755.43 24.38 98.29 128 17.43 1,248.19 29.45 168.00 256 11.27 1,794.88 44.85 236.65 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.37 52.01 19.56 3.07 2 92.77 101.29 20.04 5.98 4 91.60 191.83 20.34 11.32 8 86.83 338.87 21.51 19.97 16 78.12 547.34 23.92 32.23 32 64.77 1,111.24 28.91 65.46 64 50.52 1,722.11 37.23 101.48 128 31.29 2,123.49 60.17 125.12 256 14.93 2,002.12 126.87 117.98 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 50.18 50.14 20.43 2.94 2 49.28 97.61 20.78 5.72 4 48.22 186.82 21.32 10.94 8 47.20 365.89 21.75 21.43 16 44.69 650.22 22.89 38.03 32 37.29 989.98 27.31 58.04 64 29.53 1621.76 32.68 95.08 128 19.17 1784.76 53.14 104.56 256 10.79 2271.18 94.78 133.05 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.20 50.32 3.53 16.65 2 45.06 98.42 3.61 32.48 4 43.85 165.60 3.26 63.91 8 40.56 292.22 3.04 133.20 16 38.35 416.13 3.61 171.22 32 28.68 557.5 4.64 219.01 64 15.19 613.72 9.65 171.83 128 10.74 664.11 11.67 233.87 256 5.83 721.50 22.78 253.54 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 126.40 110.90 13.07 4.57 2 122.93 213.92 13.33 8.87 4 117.03 403.27 15.32 15.26 8 106.11 707.45 16.86 26.78 16 98.06 1,258.94 18.22 47.94 32 86.74 2,147.82 21.04 79.38 64 72.43 3,011.59 25.50 107.48 128 55.80 5,058.49 32.38 191.22 256 36.56 5,025.93 52.34 189.68
US Midwest (Chicago)
- Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 31.28 26.55 18.50 3.24 2 30.79 50.88 16.14 7.12 4 29.46 93.36 18.15 12.09 8 28.20 170.20 19.40 21.40 16 26.37 271.80 17.73 40.56 32 25.24 419.13 21.06 55.06 64 22.19 755.43 24.38 98.29 128 17.43 1,248.19 29.45 168.00 256 11.27 1,794.88 44.85 236.65 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.37 52.01 19.56 3.07 2 92.77 101.29 20.04 5.98 4 91.60 191.83 20.34 11.32 8 86.83 338.87 21.51 19.97 16 78.12 547.34 23.92 32.23 32 64.77 1,111.24 28.91 65.46 64 50.52 1,722.11 37.23 101.48 128 31.29 2,123.49 60.17 125.12 256 14.93 2,002.12 126.87 117.98 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 30.53 30.51 33.58 1.79 2 29.78 59.01 34.42 3.45 4 28.88 112.35 35.48 6.58 8 27.67 215.18 36.99 12.61 16 24.85 364.06 40.99 21.34 32 20.51 552.34 49.60 32.35 64 16.12 900.39 59.36 52.72 128 10.17 980.45 100.27 57.43 256 6.30 1334.59 162.08 78.19 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.20 50.32 3.53 16.65 2 45.06 98.42 3.61 32.48 4 43.85 165.60 3.26 63.91 8 40.56 292.22 3.04 133.20 16 38.35 416.13 3.61 171.22 32 28.68 557.5 4.64 219.01 64 15.19 613.72 9.65 171.83 128 10.74 664.11 11.67 233.87 256 5.83 721.50 22.78 253.54 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 126.40 110.90 13.07 4.57 2 122.93 213.92 13.33 8.87 4 117.03 403.27 15.32 15.26 8 106.11 707.45 16.86 26.78 16 98.06 1,258.94 18.22 47.94 32 86.74 2,147.82 21.04 79.38 64 72.43 3,011.59 25.50 107.48 128 55.80 5,058.49 32.38 191.22 256 36.56 5,025.93 52.34 189.68 - Model:
cohere.command
(Cohere Command 52 B) model hosted on one Large Cohere unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 35.78 33.43 10.98 5.33 8 31.41 99.67 13.87 16.61 32 28.49 237.1 19.48 40.24 128 23.01 326.93 53.13 54.89 - Model:
cohere.command-light
(Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 80.38 83.61 9.19 6.34 8 45.96 278.91 13.89 22.46 32 23.90 493.78 27.34 41.13 128 5.12 565.06 82.15 44.89 - Model:
meta.llama-2-70b-chat
(Llama2 70 B) model hosted on one Llama2 70 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 18.12 17.58 21.44 2.72 8 15.96 64.28 26.83 8.91 32 13.72 195.48 29.43 27.99 128 8.61 541.75 48.50 71.52