Scenario 2: Retrieval-Augmented Generation (RAG) Benchmarks in Generative AI
The RAG scenario has a very long prompt and a short response. This scenario also mimics summarization use cases.
- The prompt length is fixed to 2,000 tokens.
- The response length is fixed to 200 tokens.
Important
The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:
- The number of concurrent requests.
- The number of tokens in the prompt.
- The number of tokens in the response.
- The variance of (2) and (3) across requests.
Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Chat and Text Generation Scenarios. The retrieval-augmented generation scenario is performed in the following region.
Brazil East (Sao Paulo)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 136.91 133.84 3.19 18.35 2 128.58 250.14 3.40 34.21 4 114.22 434.70 3.81 59.56 8 93.74 680.93 4.67 92.38 16 71.06 1,007.40 5.96 138.94 32 50.30 1,561.75 8.74 212.91 64 30.71 1,922.54 14.28 262.99 128 17.99 2,043.92 25.57 279.72 256 8.83 2,061.45 46.83 281.73 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 105.30 103.49 4.27 13.81 2 99.67 195.23 4.51 26.15 4 92.17 349.80 4.87 46.61 8 73.11 532.86 6.08 71.15 16 54.17 750.15 6.08 99.56 32 40.22 1,266.6 11.18 169.29 64 24.62 1,559.01 18.31 208.03 128 15.35 1,604.24 31.44 213.95 256 6.96 1,660.81 58.06 221.39 - Model:
meta.llama-3.2-90b-vision-instruct
(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.83 44.33 4.47 13.30 2 46.14 82.67 4.79 24.80 4 45.18 145.33 5.46 43.60 8 44.67 234.67 6.74 70.40 16 43.43 336.00 9.34 100.80 32 32.74 394.66 15.61 118.40 64 33.25 416.00 30.12 124.80 128 33.28 405.32 59.98 121.60 256 33.27 394.60 116.63 118.38 - Model:
meta.llama-3.2-11b-vision-instruct
(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 105.23 101.67 1.95 30.50 2 100.86 191.33 2.08 57.40 4 96.79 348.00 2.28 104.40 8 86.60 568.00 2.77 170.40 16 72.41 837.33 3.73 251.20 32 43.23 1,002.66 6.19 300.80 64 47.43 1,066.65 11.63 320.00 128 47.45 1,066.62 23.25 319.99 256 47.41 1,066.60 45.83 319.98 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 32.94 25.28 7.91 7.58 2 31.31 49.05 8.15 14.71 4 28.85 87.28 8.85 26.18 8 24.24 141.04 10.42 42.31 16 20.31 219.48 12.52 65.85 32 15.99 366.75 16.70 110.03 64 11.03 485.78 24.63 145.74 128 8.27 560.24 41.22 168.07 256 8.01 583.97 74.21 175.19 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.86 49.82 4.10 14.62 2 91.14 94.21 4.34 14.62 4 84.77 170.89 4.63 50.04 8 75.09 281.23 5.35 82.35 16 58.20 407.94 7.00 82.35 32 42.16 593.60 10.26 174.28 64 31.93 715.30 16.44 174.28 128 30.32 754.79 29.37 174.28 256 29.16 751.22 56.21 220.34 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.78 47.82 4.28 14.02 2 45.51 90.14 4.50 26.42 4 42.24 164.92 4.81 48.51 8 37.44 289.82 5.48 85.13 16 28.00 421.00 7.19 123.72 32 18.73 542.99 10.65 159.56 64 11.63 668.78 16.17 196.44 128 6.20 700.83 32.89 205.70 256 3.97 756.00 54.71 222.02 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 49.33 47.66 4.14 14.24 2 45.65 86.90 4.50 26.04 4 40.32 152.10 5.09 45.51 8 30.69 235.78 6.57 70.43 16 24.60 310.44 9.74 93.07 32 9.95 307.32 18.21 91.81 64 5.43 297.06 31.41 89.08 128 4.44 313.47 44.90 93.89 256 2.36 312.97 85.35 93.53 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 107.17 94.33 4.17 14.12 2 100.71 176.04 4.44 26.35 4 90.03 310.18 4.96 46.44 8 70.71 493.30 6.26 73.86 16 53.45 716.66 8.20 108.07 32 35.60 929.63 12.22 139.13 64 21.75 1,150.16 18.41 172.14 128 17.99 1,209.36 31.93 181.05 256 9.19 1,213.82 53.31 181.70
Germany Central (Frankfurt)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 136.91 133.84 3.19 18.35 2 128.58 250.14 3.40 34.21 4 114.22 434.70 3.81 59.56 8 93.74 680.93 4.67 92.38 16 71.06 1,007.40 5.96 138.94 32 50.30 1,561.75 8.74 212.91 64 30.71 1,922.54 14.28 262.99 128 17.99 2,043.92 25.57 279.72 256 8.83 2,061.45 46.83 281.73 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 105.30 103.49 4.27 13.81 2 99.67 195.23 4.51 26.15 4 92.17 349.80 4.87 46.61 8 73.11 532.86 6.08 71.15 16 54.17 750.15 6.08 99.56 32 40.22 1,266.6 11.18 169.29 64 24.62 1,559.01 18.31 208.03 128 15.35 1,604.24 31.44 213.95 256 6.96 1,660.81 58.06 221.39 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 32.94 25.28 7.91 7.58 2 31.31 49.05 8.15 14.71 4 28.85 87.28 8.85 26.18 8 24.24 141.04 10.42 42.31 16 20.31 219.48 12.52 65.85 32 15.99 366.75 16.70 110.03 64 11.03 485.78 24.63 145.74 128 8.27 560.24 41.22 168.07 256 8.01 583.97 74.21 175.19 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.86 49.82 4.10 14.62 2 91.14 94.21 4.34 14.62 4 84.77 170.89 4.63 50.04 8 75.09 281.23 5.35 82.35 16 58.20 407.94 7.00 82.35 32 42.16 593.60 10.26 174.28 64 31.93 715.30 16.44 174.28 128 30.32 754.79 29.37 174.28 256 29.16 751.22 56.21 220.34 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.78 47.82 4.28 14.02 2 45.51 90.14 4.50 26.42 4 42.24 164.92 4.81 48.51 8 37.44 289.82 5.48 85.13 16 28.00 421.00 7.19 123.72 32 18.73 542.99 10.65 159.56 64 11.63 668.78 16.17 196.44 128 6.20 700.83 32.89 205.70 256 3.97 756.00 54.71 222.02 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 49.33 47.66 4.14 14.24 2 45.65 86.90 4.50 26.04 4 40.32 152.10 5.09 45.51 8 30.69 235.78 6.57 70.43 16 24.60 310.44 9.74 93.07 32 9.95 307.32 18.21 91.81 64 5.43 297.06 31.41 89.08 128 4.44 313.47 44.90 93.89 256 2.36 312.97 85.35 93.53 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 107.17 94.33 4.17 14.12 2 100.71 176.04 4.44 26.35 4 90.03 310.18 4.96 46.44 8 70.71 493.30 6.26 73.86 16 53.45 716.66 8.20 108.07 32 35.60 929.63 12.22 139.13 64 21.75 1,150.16 18.41 172.14 128 17.99 1,209.36 31.93 181.05 256 9.19 1,213.82 53.31 181.70
UK South (London)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 136.91 133.84 3.19 18.35 2 128.58 250.14 3.40 34.21 4 114.22 434.70 3.81 59.56 8 93.74 680.93 4.67 92.38 16 71.06 1,007.40 5.96 138.94 32 50.30 1,561.75 8.74 212.91 64 30.71 1,922.54 14.28 262.99 128 17.99 2,043.92 25.57 279.72 256 8.83 2,061.45 46.83 281.73 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 105.30 103.49 4.27 13.81 2 99.67 195.23 4.51 26.15 4 92.17 349.80 4.87 46.61 8 73.11 532.86 6.08 71.15 16 54.17 750.15 6.08 99.56 32 40.22 1,266.6 11.18 169.29 64 24.62 1,559.01 18.31 208.03 128 15.35 1,604.24 31.44 213.95 256 6.96 1,660.81 58.06 221.39 - Model:
meta.llama-3.2-90b-vision-instruct
(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.83 44.33 4.47 13.30 2 46.14 82.67 4.79 24.80 4 45.18 145.33 5.46 43.60 8 44.67 234.67 6.74 70.40 16 43.43 336.00 9.34 100.80 32 32.74 394.66 15.61 118.40 64 33.25 416.00 30.12 124.80 128 33.28 405.32 59.98 121.60 256 33.27 394.60 116.63 118.38 - Model:
meta.llama-3.2-11b-vision-instruct
(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 105.23 101.67 1.95 30.50 2 100.86 191.33 2.08 57.40 4 96.79 348.00 2.28 104.40 8 86.60 568.00 2.77 170.40 16 72.41 837.33 3.73 251.20 32 43.23 1,002.66 6.19 300.80 64 47.43 1,066.65 11.63 320.00 128 47.45 1,066.62 23.25 319.99 256 47.41 1,066.60 45.83 319.98 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 32.94 25.28 7.91 7.58 2 31.31 49.05 8.15 14.71 4 28.85 87.28 8.85 26.18 8 24.24 141.04 10.42 42.31 16 20.31 219.48 12.52 65.85 32 15.99 366.75 16.70 110.03 64 11.03 485.78 24.63 145.74 128 8.27 560.24 41.22 168.07 256 8.01 583.97 74.21 175.19 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.86 49.82 4.10 14.62 2 91.14 94.21 4.34 14.62 4 84.77 170.89 4.63 50.04 8 75.09 281.23 5.35 82.35 16 58.20 407.94 7.00 82.35 32 42.16 593.60 10.26 174.28 64 31.93 715.30 16.44 174.28 128 30.32 754.79 29.37 174.28 256 29.16 751.22 56.21 220.34 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.78 47.82 4.28 14.02 2 45.51 90.14 4.50 26.42 4 42.24 164.92 4.81 48.51 8 37.44 289.82 5.48 85.13 16 28.00 421.00 7.19 123.72 32 18.73 542.99 10.65 159.56 64 11.63 668.78 16.17 196.44 128 6.20 700.83 32.89 205.70 256 3.97 756.00 54.71 222.02 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 49.33 47.66 4.14 14.24 2 45.65 86.90 4.50 26.04 4 40.32 152.10 5.09 45.51 8 30.69 235.78 6.57 70.43 16 24.60 310.44 9.74 93.07 32 9.95 307.32 18.21 91.81 64 5.43 297.06 31.41 89.08 128 4.44 313.47 44.90 93.89 256 2.36 312.97 85.35 93.53 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 107.17 94.33 4.17 14.12 2 100.71 176.04 4.44 26.35 4 90.03 310.18 4.96 46.44 8 70.71 493.30 6.26 73.86 16 53.45 716.66 8.20 108.07 32 35.60 929.63 12.22 139.13 64 21.75 1,150.16 18.41 172.14 128 17.99 1,209.36 31.93 181.05 256 9.19 1,213.82 53.31 181.70
US Midwest (Chicago)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 136.91 133.84 3.19 18.35 2 128.58 250.14 3.40 34.21 4 114.22 434.70 3.81 59.56 8 93.74 680.93 4.67 92.38 16 71.06 1,007.40 5.96 138.94 32 50.30 1,561.75 8.74 212.91 64 30.71 1,922.54 14.28 262.99 128 17.99 2,043.92 25.57 279.72 256 8.83 2,061.45 46.83 281.73 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 105.30 103.49 4.27 13.81 2 99.67 195.23 4.51 26.15 4 92.17 349.80 4.87 46.61 8 73.11 532.86 6.08 71.15 16 54.17 750.15 6.08 99.56 32 40.22 1,266.6 11.18 169.29 64 24.62 1,559.01 18.31 208.03 128 15.35 1,604.24 31.44 213.95 256 6.96 1,660.81 58.06 221.39 - Model:
meta.llama-3.2-90b-vision-instruct
(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.83 44.33 4.47 13.30 2 46.14 82.67 4.79 24.80 4 45.18 145.33 5.46 43.60 8 44.67 234.67 6.74 70.40 16 43.43 336.00 9.34 100.80 32 32.74 394.66 15.61 118.40 64 33.25 416.00 30.12 124.80 128 33.28 405.32 59.98 121.60 256 33.27 394.60 116.63 118.38 - Model:
meta.llama-3.2-11b-vision-instruct
(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 105.23 101.67 1.95 30.50 2 100.86 191.33 2.08 57.40 4 96.79 348.00 2.28 104.40 8 86.60 568.00 2.77 170.40 16 72.41 837.33 3.73 251.20 32 43.23 1,002.66 6.19 300.80 64 47.43 1,066.65 11.63 320.00 128 47.45 1,066.62 23.25 319.99 256 47.41 1,066.60 45.83 319.98 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 32.94 25.28 7.91 7.58 2 31.31 49.05 8.15 14.71 4 28.85 87.28 8.85 26.18 8 24.24 141.04 10.42 42.31 16 20.31 219.48 12.52 65.85 32 15.99 366.75 16.70 110.03 64 11.03 485.78 24.63 145.74 128 8.27 560.24 41.22 168.07 256 8.01 583.97 74.21 175.19 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.86 49.82 4.10 14.62 2 91.14 94.21 4.34 14.62 4 84.77 170.89 4.63 50.04 8 75.09 281.23 5.35 82.35 16 58.20 407.94 7.00 82.35 32 42.16 593.60 10.26 174.28 64 31.93 715.30 16.44 174.28 128 30.32 754.79 29.37 174.28 256 29.16 751.22 56.21 220.34 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 28.84 28.82 7.11 8.44 2 26.52 52.69 7.66 15.51 4 24.23 94.86 8.38 27.92 8 20.01 155.97 10.21 45.76 16 14.34 216.26 14.12 63.43 32 9.33 275.28 21.30 80.89 64 5.68 334.46 32.55 98.11 128 3.13 364.18 64.59 106.94 256 1.59 359.21 128.67 105.44 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 49.33 47.66 4.14 14.24 2 45.65 86.90 4.50 26.04 4 40.32 152.10 5.09 45.51 8 30.69 235.78 6.57 70.43 16 24.60 310.44 9.74 93.07 32 9.95 307.32 18.21 91.81 64 5.43 297.06 31.41 89.08 128 4.44 313.47 44.90 93.89 256 2.36 312.97 85.35 93.53 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 107.17 94.33 4.17 14.12 2 100.71 176.04 4.44 26.35 4 90.03 310.18 4.96 46.44 8 70.71 493.30 6.26 73.86 16 53.45 716.66 8.20 108.07 32 35.60 929.63 12.22 139.13 64 21.75 1,150.16 18.41 172.14 128 17.99 1,209.36 31.93 181.05 256 9.19 1,213.82 53.31 181.70 - Model:
cohere.command
(Cohere Command 52 B) model hosted on one Large Cohere unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 33.13 25.28 6.68 8.62 8 23.24 90.64 13.29 29.84 32 13.03 163.48 26.56 54.21 128 5.60 186.31 65.30 61.32 - Model:
cohere.command-light
(Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 56.71 50.88 3.14 17.61 8 24.70 148.42 6.15 53.93 32 11.06 235.31 13.37 85.14 128 3.40 280.3 31.64 105.77