Scenario 1: Stochastic Length Benchmarks in Generative AI
This scenario mimics text generation use cases where the size of the prompt and response are unknown ahead of time. In this scenario, because of the unknown length of the prompt and response, we've used a stochastic approach where both the prompt and response length follow a normal distribution:
The prompt length follows a normal distribution with a mean of 480 tokens and a standard deviation of 240 tokens
The response length follows a normal distribution with a mean of 300 tokens and a standard deviation of 150 tokens.
Important
The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
143.82
142.16
3.89
15.07
2
141.16
276.64
4.28
27.37
4
136.15
517.89
4.98
45.85
8
121.71
858.28
4.97
84.62
16
105.84
1,243.61
5.53
122.45
32
88.15
2,126.25
6.53
210.29
64
67.40
3,398.12
8.63
319.28
128
45.86
4,499.76
13.96
427.76
256
24.14
4,784.32
25.79
453.83
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
119.49
118.18
4.50
13.08
2
115.14
225.40
4.90
23.69
4
109.71
404.66
4.63
48.83
8
95.83
702.76
5.03
85.92
16
81.12
1,029.98
6.07
125.54
32
70.92
1,819.24
7.02
182.65
64
52.10
2,778.58
8.79
313.12
128
35.58
3,566.59
13.80
438.64
256
20.75
4,065.93
24.69
481.11
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.62
52.71
5.43
10.97
2
52.65
102.99
5.48
21.65
4
52.06
205.56
5.58
42.61
8
51.06
393.93
5.68
82.31
16
46.755
715.89
6.08
152.11
32
39.55
1,152.97
7.80
228.8
64
31.22
1,663.88
9.36
353.91
128
23.00
2,055.51
13.94
433.91
256
17.44
1,873.44
22.85
427.95
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
48.75
47.98
6.37
9.40
2
47.28
92.89
6.63
18.00
4
45.10
176.53
6.65
35.80
8
42.53
333.45
7.04
67.80
16
38.39
597.84
7.95
119.70
32
29.86
929.18
10.12
187.40
64
30.00
933.09
20.11
187.20
128
30.03
934.30
39.85
186.00
256
30.05
932.61
76.19
187.79
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
105.74
104.30
2.75
21.70
2
103.21
204.22
2.82
42.40
4
99.41
393.69
3.10
77.10
8
93.98
745.29
3.26
146.70
16
81.62
1,294.14
3.64
262.60
32
60.55
1,924.74
4.97
384.40
64
60.54
1,928.70
10.03
379.40
128
62.57
1,912.53
19.68
383.09
256
60.00
1,911.45
38.36
386.14
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.44
26.84
11.66
5.10
2
26.56
51.93
11.44
10.39
4
25.66
100.31
11.97
19.89
8
24.98
193.34
11.96
39.48
16
20.73
322.99
14.86
63.76
32
18.39
562.55
16.50
114.21
64
15.05
877.61
20.42
180.76
128
10.79
1,210.61
29.53
241.73
256
8.67
1,301.65
47.22
282.78
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
32.66
25.79
10.78
5.56
2
31.36
50.81
10.06
11.68
4
29.86
96.01
10.87
21.52
8
27.89
170.45
10.87
34.09
16
24.74
282.52
13.51
60.35
32
21.51
457.24
16.73
91.42
64
17.68
676.90
18.29
152.47
128
13.06
1,035.08
25.59
222.67
256
7.82
1,302.71
41.88
289.08
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
95.50
51.58
6.12
9.78
2
92.25
98.89
6.44
18.53
4
90.51
184.54
7.37
30.67
8
83.38
326.71
7.64
57.06
16
71.45
509.03
8.77
90.02
32
58.48
724.23
10.00
138.82
64
44.74
1,146.92
14.07
206.58
128
27.00
1,434.57
22.48
268.58
256
18.03
1,635.95
41.06
309.97
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
49.76
49.58
6.42
9.33
2
48.04
95.38
6.80
17.53
4
46.09
181.21
6.99
33.60
8
44.19
330.46
7.43
60.67
16
40.56
591.52
8.40
104.42
32
31.35
869.36
9.68
168.46
64
23.87
1062.52
12.57
201.11
128
16.86
1,452.66
17.64
276.09
256
9.84
1,792.81
30.08
347.26
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
51.30
50.46
4.63
12.75
2
51.06
97.86
5.07
23.14
4
47.52
186.75
5.30
44.48
8
43.55
305.45
5.68
75.18
16
36.49
505.11
6.71
127.88
32
29.02
768.40
8.84
177.03
64
18.57
735.37
14.55
168.00
128
12.59
809.50
21.27
186.76
256
6.54
859.45
38.69
200.42
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
122.46
101.28
4.31
13.21
2
114.38
177.67
5.70
17.78
4
107.48
367.88
5.09
45.22
8
95.32
644.56
7.23
62.61
16
82.42
1,036.84
7.91
62.61
32
66.46
1,529.28
10.12
145.82
64
45.70
1,924.84
12.43
206.26
128
33.96
2,546.35
18.22
272.53
256
23.86
2,914.77
30.75
298.88
Germany Central (Frankfurt) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
143.82
142.16
3.89
15.07
2
141.16
276.64
4.28
27.37
4
136.15
517.89
4.98
45.85
8
121.71
858.28
4.97
84.62
16
105.84
1,243.61
5.53
122.45
32
88.15
2,126.25
6.53
210.29
64
67.40
3,398.12
8.63
319.28
128
45.86
4,499.76
13.96
427.76
256
24.14
4,784.32
25.79
453.83
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
119.49
118.18
4.50
13.08
2
115.14
225.40
4.90
23.69
4
109.71
404.66
4.63
48.83
8
95.83
702.76
5.03
85.92
16
81.12
1,029.98
6.07
125.54
32
70.92
1,819.24
7.02
182.65
64
52.10
2,778.58
8.79
313.12
128
35.58
3,566.59
13.80
438.64
256
20.75
4,065.93
24.69
481.11
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.62
52.71
5.43
10.97
2
52.65
102.99
5.48
21.65
4
52.06
205.56
5.58
42.61
8
51.06
393.93
5.68
82.31
16
46.755
715.89
6.08
152.11
32
39.55
1,152.97
7.80
228.8
64
31.22
1,663.88
9.36
353.91
128
23.00
2,055.51
13.94
433.91
256
17.44
1,873.44
22.85
427.95
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.44
26.84
11.66
5.10
2
26.56
51.93
11.44
10.39
4
25.66
100.31
11.97
19.89
8
24.98
193.34
11.96
39.48
16
20.73
322.99
14.86
63.76
32
18.39
562.55
16.50
114.21
64
15.05
877.61
20.42
180.76
128
10.79
1,210.61
29.53
241.73
256
8.67
1,301.65
47.22
282.78
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
32.66
25.79
10.78
5.56
2
31.36
50.81
10.06
11.68
4
29.86
96.01
10.87
21.52
8
27.89
170.45
10.87
34.09
16
24.74
282.52
13.51
60.35
32
21.51
457.24
16.73
91.42
64
17.68
676.90
18.29
152.47
128
13.06
1,035.08
25.59
222.67
256
7.82
1,302.71
41.88
289.08
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
95.50
51.58
6.12
9.78
2
92.25
98.89
6.44
18.53
4
90.51
184.54
7.37
30.67
8
83.38
326.71
7.64
57.06
16
71.45
509.03
8.77
90.02
32
58.48
724.23
10.00
138.82
64
44.74
1,146.92
14.07
206.58
128
27.00
1,434.57
22.48
268.58
256
18.03
1,635.95
41.06
309.97
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
49.76
49.58
6.42
9.33
2
48.04
95.38
6.80
17.53
4
46.09
181.21
6.99
33.60
8
44.19
330.46
7.43
60.67
16
40.56
591.52
8.40
104.42
32
31.35
869.36
9.68
168.46
64
23.87
1062.52
12.57
201.11
128
16.86
1,452.66
17.64
276.09
256
9.84
1,792.81
30.08
347.26
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
51.30
50.46
4.63
12.75
2
51.06
97.86
5.07
23.14
4
47.52
186.75
5.30
44.48
8
43.55
305.45
5.68
75.18
16
36.49
505.11
6.71
127.88
32
29.02
768.40
8.84
177.03
64
18.57
735.37
14.55
168.00
128
12.59
809.50
21.27
186.76
256
6.54
859.45
38.69
200.42
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
122.46
101.28
4.31
13.21
2
114.38
177.67
5.70
17.78
4
107.48
367.88
5.09
45.22
8
95.32
644.56
7.23
62.61
16
82.42
1,036.84
7.91
62.61
32
66.46
1,529.28
10.12
145.82
64
45.70
1,924.84
12.43
206.26
128
33.96
2,546.35
18.22
272.53
256
23.86
2,914.77
30.75
298.88
Japan Central (Osaka) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
143.82
142.16
3.89
15.07
2
141.16
276.64
4.28
27.37
4
136.15
517.89
4.98
45.85
8
121.71
858.28
4.97
84.62
16
105.84
1,243.61
5.53
122.45
32
88.15
2,126.25
6.53
210.29
64
67.40
3,398.12
8.63
319.28
128
45.86
4,499.76
13.96
427.76
256
24.14
4,784.32
25.79
453.83
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
119.49
118.18
4.50
13.08
2
115.14
225.40
4.90
23.69
4
109.71
404.66
4.63
48.83
8
95.83
702.76
5.03
85.92
16
81.12
1,029.98
6.07
125.54
32
70.92
1,819.24
7.02
182.65
64
52.10
2,778.58
8.79
313.12
128
35.58
3,566.59
13.80
438.64
256
20.75
4,065.93
24.69
481.11
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.62
52.71
5.43
10.97
2
52.65
102.99
5.48
21.65
4
52.06
205.56
5.58
42.61
8
51.06
393.93
5.68
82.31
16
46.755
715.89
6.08
152.11
32
39.55
1,152.97
7.80
228.8
64
31.22
1,663.88
9.36
353.91
128
23.00
2,055.51
13.94
433.91
256
17.44
1,873.44
22.85
427.95
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
48.75
47.98
6.37
9.40
2
47.28
92.89
6.63
18.00
4
45.10
176.53
6.65
35.80
8
42.53
333.45
7.04
67.80
16
38.39
597.84
7.95
119.70
32
29.86
929.18
10.12
187.40
64
30.00
933.09
20.11
187.20
128
30.03
934.30
39.85
186.00
256
30.05
932.61
76.19
187.79
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
105.74
104.30
2.75
21.70
2
103.21
204.22
2.82
42.40
4
99.41
393.69
3.10
77.10
8
93.98
745.29
3.26
146.70
16
81.62
1,294.14
3.64
262.60
32
60.55
1,924.74
4.97
384.40
64
60.54
1,928.70
10.03
379.40
128
62.57
1,912.53
19.68
383.09
256
60.00
1,911.45
38.36
386.14
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.44
26.84
11.66
5.10
2
26.56
51.93
11.44
10.39
4
25.66
100.31
11.97
19.89
8
24.98
193.34
11.96
39.48
16
20.73
322.99
14.86
63.76
32
18.39
562.55
16.50
114.21
64
15.05
877.61
20.42
180.76
128
10.79
1,210.61
29.53
241.73
256
8.67
1,301.65
47.22
282.78
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
32.66
25.79
10.78
5.56
2
31.36
50.81
10.06
11.68
4
29.86
96.01
10.87
21.52
8
27.89
170.45
10.87
34.09
16
24.74
282.52
13.51
60.35
32
21.51
457.24
16.73
91.42
64
17.68
676.90
18.29
152.47
128
13.06
1,035.08
25.59
222.67
256
7.82
1,302.71
41.88
289.08
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
95.50
51.58
6.12
9.78
2
92.25
98.89
6.44
18.53
4
90.51
184.54
7.37
30.67
8
83.38
326.71
7.64
57.06
16
71.45
509.03
8.77
90.02
32
58.48
724.23
10.00
138.82
64
44.74
1,146.92
14.07
206.58
128
27.00
1,434.57
22.48
268.58
256
18.03
1,635.95
41.06
309.97
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
51.30
50.46
4.63
12.75
2
51.06
97.86
5.07
23.14
4
47.52
186.75
5.30
44.48
8
43.55
305.45
5.68
75.18
16
36.49
505.11
6.71
127.88
32
29.02
768.40
8.84
177.03
64
18.57
735.37
14.55
168.00
128
12.59
809.50
21.27
186.76
256
6.54
859.45
38.69
200.42
UK South (London) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
143.82
142.16
3.89
15.07
2
141.16
276.64
4.28
27.37
4
136.15
517.89
4.98
45.85
8
121.71
858.28
4.97
84.62
16
105.84
1,243.61
5.53
122.45
32
88.15
2,126.25
6.53
210.29
64
67.40
3,398.12
8.63
319.28
128
45.86
4,499.76
13.96
427.76
256
24.14
4,784.32
25.79
453.83
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
119.49
118.18
4.50
13.08
2
115.14
225.40
4.90
23.69
4
109.71
404.66
4.63
48.83
8
95.83
702.76
5.03
85.92
16
81.12
1,029.98
6.07
125.54
32
70.92
1,819.24
7.02
182.65
64
52.10
2,778.58
8.79
313.12
128
35.58
3,566.59
13.80
438.64
256
20.75
4,065.93
24.69
481.11
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.62
52.71
5.43
10.97
2
52.65
102.99
5.48
21.65
4
52.06
205.56
5.58
42.61
8
51.06
393.93
5.68
82.31
16
46.755
715.89
6.08
152.11
32
39.55
1,152.97
7.80
228.8
64
31.22
1,663.88
9.36
353.91
128
23.00
2,055.51
13.94
433.91
256
17.44
1,873.44
22.85
427.95
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
48.75
47.98
6.37
9.40
2
47.28
92.89
6.63
18.00
4
45.10
176.53
6.65
35.80
8
42.53
333.45
7.04
67.80
16
38.39
597.84
7.95
119.70
32
29.86
929.18
10.12
187.40
64
30.00
933.09
20.11
187.20
128
30.03
934.30
39.85
186.00
256
30.05
932.61
76.19
187.79
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
105.74
104.30
2.75
21.70
2
103.21
204.22
2.82
42.40
4
99.41
393.69
3.10
77.10
8
93.98
745.29
3.26
146.70
16
81.62
1,294.14
3.64
262.60
32
60.55
1,924.74
4.97
384.40
64
60.54
1,928.70
10.03
379.40
128
62.57
1,912.53
19.68
383.09
256
60.00
1,911.45
38.36
386.14
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.44
26.84
11.66
5.10
2
26.56
51.93
11.44
10.39
4
25.66
100.31
11.97
19.89
8
24.98
193.34
11.96
39.48
16
20.73
322.99
14.86
63.76
32
18.39
562.55
16.50
114.21
64
15.05
877.61
20.42
180.76
128
10.79
1,210.61
29.53
241.73
256
8.67
1,301.65
47.22
282.78
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
32.66
25.79
10.78
5.56
2
31.36
50.81
10.06
11.68
4
29.86
96.01
10.87
21.52
8
27.89
170.45
10.87
34.09
16
24.74
282.52
13.51
60.35
32
21.51
457.24
16.73
91.42
64
17.68
676.90
18.29
152.47
128
13.06
1,035.08
25.59
222.67
256
7.82
1,302.71
41.88
289.08
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
95.50
51.58
6.12
9.78
2
92.25
98.89
6.44
18.53
4
90.51
184.54
7.37
30.67
8
83.38
326.71
7.64
57.06
16
71.45
509.03
8.77
90.02
32
58.48
724.23
10.00
138.82
64
44.74
1,146.92
14.07
206.58
128
27.00
1,434.57
22.48
268.58
256
18.03
1,635.95
41.06
309.97
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
49.76
49.58
6.42
9.33
2
48.04
95.38
6.80
17.53
4
46.09
181.21
6.99
33.60
8
44.19
330.46
7.43
60.67
16
40.56
591.52
8.40
104.42
32
31.35
869.36
9.68
168.46
64
23.87
1062.52
12.57
201.11
128
16.86
1,452.66
17.64
276.09
256
9.84
1,792.81
30.08
347.26
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
51.30
50.46
4.63
12.75
2
51.06
97.86
5.07
23.14
4
47.52
186.75
5.30
44.48
8
43.55
305.45
5.68
75.18
16
36.49
505.11
6.71
127.88
32
29.02
768.40
8.84
177.03
64
18.57
735.37
14.55
168.00
128
12.59
809.50
21.27
186.76
256
6.54
859.45
38.69
200.42
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
122.46
101.28
4.31
13.21
2
114.38
177.67
5.70
17.78
4
107.48
367.88
5.09
45.22
8
95.32
644.56
7.23
62.61
16
82.42
1,036.84
7.91
62.61
32
66.46
1,529.28
10.12
145.82
64
45.70
1,924.84
12.43
206.26
128
33.96
2,546.35
18.22
272.53
256
23.86
2,914.77
30.75
298.88
US Midwest (Chicago) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
143.82
142.16
3.89
15.07
2
141.16
276.64
4.28
27.37
4
136.15
517.89
4.98
45.85
8
121.71
858.28
4.97
84.62
16
105.84
1,243.61
5.53
122.45
32
88.15
2,126.25
6.53
210.29
64
67.40
3,398.12
8.63
319.28
128
45.86
4,499.76
13.96
427.76
256
24.14
4,784.32
25.79
453.83
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
119.49
118.18
4.50
13.08
2
115.14
225.40
4.90
23.69
4
109.71
404.66
4.63
48.83
8
95.83
702.76
5.03
85.92
16
81.12
1,029.98
6.07
125.54
32
70.92
1,819.24
7.02
182.65
64
52.10
2,778.58
8.79
313.12
128
35.58
3,566.59
13.80
438.64
256
20.75
4,065.93
24.69
481.11
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.62
52.71
5.43
10.97
2
52.65
102.99
5.48
21.65
4
52.06
205.56
5.58
42.61
8
51.06
393.93
5.68
82.31
16
46.755
715.89
6.08
152.11
32
39.55
1,152.97
7.80
228.8
64
31.22
1,663.88
9.36
353.91
128
23.00
2,055.51
13.94
433.91
256
17.44
1,873.44
22.85
427.95
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
48.75
47.98
6.37
9.40
2
47.28
92.89
6.63
18.00
4
45.10
176.53
6.65
35.80
8
42.53
333.45
7.04
67.80
16
38.39
597.84
7.95
119.70
32
29.86
929.18
10.12
187.40
64
30.00
933.09
20.11
187.20
128
30.03
934.30
39.85
186.00
256
30.05
932.61
76.19
187.79
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
105.74
104.30
2.75
21.70
2
103.21
204.22
2.82
42.40
4
99.41
393.69
3.10
77.10
8
93.98
745.29
3.26
146.70
16
81.62
1,294.14
3.64
262.60
32
60.55
1,924.74
4.97
384.40
64
60.54
1,928.70
10.03
379.40
128
62.57
1,912.53
19.68
383.09
256
60.00
1,911.45
38.36
386.14
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.44
26.84
11.66
5.10
2
26.56
51.93
11.44
10.39
4
25.66
100.31
11.97
19.89
8
24.98
193.34
11.96
39.48
16
20.73
322.99
14.86
63.76
32
18.39
562.55
16.50
114.21
64
15.05
877.61
20.42
180.76
128
10.79
1,210.61
29.53
241.73
256
8.67
1,301.65
47.22
282.78
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
32.66
25.79
10.78
5.56
2
31.36
50.81
10.06
11.68
4
29.86
96.01
10.87
21.52
8
27.89
170.45
10.87
34.09
16
24.74
282.52
13.51
60.35
32
21.51
457.24
16.73
91.42
64
17.68
676.90
18.29
152.47
128
13.06
1,035.08
25.59
222.67
256
7.82
1,302.71
41.88
289.08
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
95.50
51.58
6.12
9.78
2
92.25
98.89
6.44
18.53
4
90.51
184.54
7.37
30.67
8
83.38
326.71
7.64
57.06
16
71.45
509.03
8.77
90.02
32
58.48
724.23
10.00
138.82
64
44.74
1,146.92
14.07
206.58
128
27.00
1,434.57
22.48
268.58
256
18.03
1,635.95
41.06
309.97
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large
Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
30.51
30.36
10.47
5.73
2
28.85
57.37
11.09
10.68
4
27.99
108.49
11.13
21.08
8
25.61
196.68
13.27
34.65
16
21.97
318.82
15.36
56.37
32
16.01
428.45
18.55
82.88
64
11.60
563.70
24.31
108.58
128
7.50
650.40
40.64
40.64
256
4.58
927.31
67.42
172.42
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
51.30
50.46
4.63
12.75
2
51.06
97.86
5.07
23.14
4
47.52
186.75
5.30
44.48
8
43.55
305.45
5.68
75.18
16
36.49
505.11
6.71
127.88
32
29.02
768.40
8.84
177.03
64
18.57
735.37
14.55
168.00
128
12.59
809.50
21.27
186.76
256
6.54
859.45
38.69
200.42
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
122.46
101.28
4.31
13.21
2
114.38
177.67
5.70
17.78
4
107.48
367.88
5.09
45.22
8
95.32
644.56
7.23
62.61
16
82.42
1,036.84
7.91
62.61
32
66.46
1,529.28
10.12
145.82
64
45.70
1,924.84
12.43
206.26
128
33.96
2,546.35
18.22
272.53
256
23.86
2,914.77
30.75
298.88
Model: cohere.command(Cohere Command 52 B) model hosted on one Large Cohere unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
36.32
31.29
8.15
7.12
8
30.15
106.03
13.19
23.86
32
23.94
204.41
23.90
45.84
128
14.36
254.54
65.26
56.58
Model: cohere.command-light(Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
69.17
69.19
3.57
15.69
8
38.75
208.22
6.54
45.08
32
17.98
337.35
13.49
75.50
128
4.01
397.36
37.69
92.17
Model: meta.llama-2-70b-chat(Llama2 70 B) model hosted on one Llama2 70 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)