Scenario 3: Generation Heavy Benchmarks in Generative AI
The generation heavy scenario is for generation / model response heavy use cases. For example, a long job description generated from a short bullet list of items.
The generation heavy scenario is performed with the following token lengths:
The prompt length is fixed to 100 tokens
The response length is fixed to 1,000 tokens
Important
The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
147.84
148.54
8.18
7.25
2
146.96
292.45
10.59
11.16
4
139.14
520.57
8.46
26.20
8
128.71
923.73
9.73
43.55
16
122.33
1,631.48
10.76
73.30
32
114.14
2,586.64
12.99
102.60
64
95.98
4,124.24
13.42
186.47
128
69.06
6,366.06
19.24
285.92
256
40.02
6,973.92
35.71
305.09
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
132.10
131.90
16.12
3.70
2
130.10
256.33
15.61
7.62
4
125.23
495.22
17.36
13.61
8
111.15
832.88
18.74
23.87
16
104.75
1,375.51
21.45
36.61
32
100.82
2,974.72
21.65
81.76
64
79.67
4,635.15
26.36
131.98
128
60.49
6,290.61
37.0
171.76
256
31.69
7,010.75
62.48
196.58
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.55
53.21
18.70
3.19
2
52.83
103.10
18.97
6.19
4
53.40
206.18
18.77
12.37
8
53.25
412.36
18.85
24.74
16
51.53
812.24
19.48
48.73
32
45.99
1,447.02
21.861
86.82
64
45.99
2,599.88
23.81
156.00
128
34.76
4,216.35
29.32
252.98
256
23.72
3,826.77
44.02
229.61
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
49.15
48.33
20.37
2.90
2
48.73
96.67
20.57
2.90
4
48.17
186.67
20.85
11.20
8
47.53
373.33
21.20
22.40
16
46.69
720.00
21.75
43.20
32
41.65
1,279.99
24.54
76.80
64
41.92
1,279.98
47.75
76.80
128
41.93
1,279.96
91.49
76.80
256
41.88
1,279.93
166.93
76.80
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
106.36
105.00
9.41
6.30
2
104.89
206.67
9.55
12.40
4
101.93
400.00
9.84
24.00
8
98.89
773.33
10.17
46.40
16
91.20
1,439.99
11.07
86.40
32
72.13
2,239.98
14.03
134.40
64
72.29
2,293.30
27.49
137.60
128
72.30
2,239.89
53.75
134.39
256
72.27
2,239.84
102.37
134.39
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.35
26.65
36.65
1.60
2
26.72
49.97
37.53
3.00
4
26.21
99.94
38.27
6.00
8
26.42
199.89
38.00
11.99
16
22.60
346.45
44.45
20.79
32
21.97
692.91
45.77
41.57
64
20.10
1,177.63
50.14
70.66
128
17.06
2,086.85
60.70
125.21
256
11.05
2,024.72
109.59
121.48
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
31.28
26.55
18.50
3.24
2
30.79
50.88
16.14
7.12
4
29.46
93.36
18.15
12.09
8
28.20
170.20
19.40
21.40
16
26.37
271.80
17.73
40.56
32
25.24
419.13
21.06
55.06
64
22.19
755.43
24.38
98.29
128
17.43
1,248.19
29.45
168.00
256
11.27
1,794.88
44.85
236.65
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
95.37
52.01
19.56
3.07
2
92.77
101.29
20.04
5.98
4
91.60
191.83
20.34
11.32
8
86.83
338.87
21.51
19.97
16
78.12
547.34
23.92
32.23
32
64.77
1,111.24
28.91
65.46
64
50.52
1,722.11
37.23
101.48
128
31.29
2,123.49
60.17
125.12
256
14.93
2,002.12
126.87
117.98
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
50.18
50.14
20.43
2.94
2
49.28
97.61
20.78
5.72
4
48.22
186.82
21.32
10.94
8
47.20
365.89
21.75
21.43
16
44.69
650.22
22.89
38.03
32
37.29
989.98
27.31
58.04
64
29.53
1621.76
32.68
95.08
128
19.17
1784.76
53.14
104.56
256
10.79
2271.18
94.78
133.05
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
47.20
50.32
3.53
16.65
2
45.06
98.42
3.61
32.48
4
43.85
165.60
3.26
63.91
8
40.56
292.22
3.04
133.20
16
38.35
416.13
3.61
171.22
32
28.68
557.5
4.64
219.01
64
15.19
613.72
9.65
171.83
128
10.74
664.11
11.67
233.87
256
5.83
721.50
22.78
253.54
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
126.40
110.90
13.07
4.57
2
122.93
213.92
13.33
8.87
4
117.03
403.27
15.32
15.26
8
106.11
707.45
16.86
26.78
16
98.06
1,258.94
18.22
47.94
32
86.74
2,147.82
21.04
79.38
64
72.43
3,011.59
25.50
107.48
128
55.80
5,058.49
32.38
191.22
256
36.56
5,025.93
52.34
189.68
Germany Central (Frankfurt) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
147.84
148.54
8.18
7.25
2
146.96
292.45
10.59
11.16
4
139.14
520.57
8.46
26.20
8
128.71
923.73
9.73
43.55
16
122.33
1,631.48
10.76
73.30
32
114.14
2,586.64
12.99
102.60
64
95.98
4,124.24
13.42
186.47
128
69.06
6,366.06
19.24
285.92
256
40.02
6,973.92
35.71
305.09
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
132.10
131.90
16.12
3.70
2
130.10
256.33
15.61
7.62
4
125.23
495.22
17.36
13.61
8
111.15
832.88
18.74
23.87
16
104.75
1,375.51
21.45
36.61
32
100.82
2,974.72
21.65
81.76
64
79.67
4,635.15
26.36
131.98
128
60.49
6,290.61
37.0
171.76
256
31.69
7,010.75
62.48
196.58
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.55
53.21
18.70
3.19
2
52.83
103.10
18.97
6.19
4
53.40
206.18
18.77
12.37
8
53.25
412.36
18.85
24.74
16
51.53
812.24
19.48
48.73
32
45.99
1,447.02
21.861
86.82
64
45.99
2,599.88
23.81
156.00
128
34.76
4,216.35
29.32
252.98
256
23.72
3,826.77
44.02
229.61
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.35
26.65
36.65
1.60
2
26.72
49.97
37.53
3.00
4
26.21
99.94
38.27
6.00
8
26.42
199.89
38.00
11.99
16
22.60
346.45
44.45
20.79
32
21.97
692.91
45.77
41.57
64
20.10
1,177.63
50.14
70.66
128
17.06
2,086.85
60.70
125.21
256
11.05
2,024.72
109.59
121.48
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
31.28
26.55
18.50
3.24
2
30.79
50.88
16.14
7.12
4
29.46
93.36
18.15
12.09
8
28.20
170.20
19.40
21.40
16
26.37
271.80
17.73
40.56
32
25.24
419.13
21.06
55.06
64
22.19
755.43
24.38
98.29
128
17.43
1,248.19
29.45
168.00
256
11.27
1,794.88
44.85
236.65
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
95.37
52.01
19.56
3.07
2
92.77
101.29
20.04
5.98
4
91.60
191.83
20.34
11.32
8
86.83
338.87
21.51
19.97
16
78.12
547.34
23.92
32.23
32
64.77
1,111.24
28.91
65.46
64
50.52
1,722.11
37.23
101.48
128
31.29
2,123.49
60.17
125.12
256
14.93
2,002.12
126.87
117.98
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large
Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
50.18
50.14
20.43
2.94
2
49.28
97.61
20.78
5.72
4
48.22
186.82
21.32
10.94
8
47.20
365.89
21.75
21.43
16
44.69
650.22
22.89
38.03
32
37.29
989.98
27.31
58.04
64
29.53
1621.76
32.68
95.08
128
19.17
1784.76
53.14
104.56
256
10.79
2271.18
94.78
133.05
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Small
Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
47.20
50.32
3.53
16.65
2
45.06
98.42
3.61
32.48
4
43.85
165.60
3.26
63.91
8
40.56
292.22
3.04
133.20
16
38.35
416.13
3.61
171.22
32
28.68
557.5
4.64
219.01
64
15.19
613.72
9.65
171.83
128
10.74
664.11
11.67
233.87
256
5.83
721.50
22.78
253.54
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
126.40
110.90
13.07
4.57
2
122.93
213.92
13.33
8.87
4
117.03
403.27
15.32
15.26
8
106.11
707.45
16.86
26.78
16
98.06
1,258.94
18.22
47.94
32
86.74
2,147.82
21.04
79.38
64
72.43
3,011.59
25.50
107.48
128
55.80
5,058.49
32.38
191.22
256
36.56
5,025.93
52.34
189.68
Japan Central (Osaka) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
147.84
148.54
8.18
7.25
2
146.96
292.45
10.59
11.16
4
139.14
520.57
8.46
26.20
8
128.71
923.73
9.73
43.55
16
122.33
1,631.48
10.76
73.30
32
114.14
2,586.64
12.99
102.60
64
95.98
4,124.24
13.42
186.47
128
69.06
6,366.06
19.24
285.92
256
40.02
6,973.92
35.71
305.09
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
132.10
131.90
16.12
3.70
2
130.10
256.33
15.61
7.62
4
125.23
495.22
17.36
13.61
8
111.15
832.88
18.74
23.87
16
104.75
1,375.51
21.45
36.61
32
100.82
2,974.72
21.65
81.76
64
79.67
4,635.15
26.36
131.98
128
60.49
6,290.61
37.0
171.76
256
31.69
7,010.75
62.48
196.58
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.55
53.21
18.70
3.19
2
52.83
103.10
18.97
6.19
4
53.40
206.18
18.77
12.37
8
53.25
412.36
18.85
24.74
16
51.53
812.24
19.48
48.73
32
45.99
1,447.02
21.861
86.82
64
45.99
2,599.88
23.81
156.00
128
34.76
4,216.35
29.32
252.98
256
23.72
3,826.77
44.02
229.61
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
49.15
48.33
20.37
2.90
2
48.73
96.67
20.57
2.90
4
48.17
186.67
20.85
11.20
8
47.53
373.33
21.20
22.40
16
46.69
720.00
21.75
43.20
32
41.65
1,279.99
24.54
76.80
64
41.92
1,279.98
47.75
76.80
128
41.93
1,279.96
91.49
76.80
256
41.88
1,279.93
166.93
76.80
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
106.36
105.00
9.41
6.30
2
104.89
206.67
9.55
12.40
4
101.93
400.00
9.84
24.00
8
98.89
773.33
10.17
46.40
16
91.20
1,439.99
11.07
86.40
32
72.13
2,239.98
14.03
134.40
64
72.29
2,293.30
27.49
137.60
128
72.30
2,239.89
53.75
134.39
256
72.27
2,239.84
102.37
134.39
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.35
26.65
36.65
1.60
2
26.72
49.97
37.53
3.00
4
26.21
99.94
38.27
6.00
8
26.42
199.89
38.00
11.99
16
22.60
346.45
44.45
20.79
32
21.97
692.91
45.77
41.57
64
20.10
1,177.63
50.14
70.66
128
17.06
2,086.85
60.70
125.21
256
11.05
2,024.72
109.59
121.48
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
31.28
26.55
18.50
3.24
2
30.79
50.88
16.14
7.12
4
29.46
93.36
18.15
12.09
8
28.20
170.20
19.40
21.40
16
26.37
271.80
17.73
40.56
32
25.24
419.13
21.06
55.06
64
22.19
755.43
24.38
98.29
128
17.43
1,248.19
29.45
168.00
256
11.27
1,794.88
44.85
236.65
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
95.37
52.01
19.56
3.07
2
92.77
101.29
20.04
5.98
4
91.60
191.83
20.34
11.32
8
86.83
338.87
21.51
19.97
16
78.12
547.34
23.92
32.23
32
64.77
1,111.24
28.91
65.46
64
50.52
1,722.11
37.23
101.48
128
31.29
2,123.49
60.17
125.12
256
14.93
2,002.12
126.87
117.98
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
47.20
50.32
3.53
16.65
2
45.06
98.42
3.61
32.48
4
43.85
165.60
3.26
63.91
8
40.56
292.22
3.04
133.20
16
38.35
416.13
3.61
171.22
32
28.68
557.5
4.64
219.01
64
15.19
613.72
9.65
171.83
128
10.74
664.11
11.67
233.87
256
5.83
721.50
22.78
253.54
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
126.40
110.90
13.07
4.57
2
122.93
213.92
13.33
8.87
4
117.03
403.27
15.32
15.26
8
106.11
707.45
16.86
26.78
16
98.06
1,258.94
18.22
47.94
32
86.74
2,147.82
21.04
79.38
64
72.43
3,011.59
25.50
107.48
128
55.80
5,058.49
32.38
191.22
256
36.56
5,025.93
52.34
189.68
UK South (London) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
147.84
148.54
8.18
7.25
2
146.96
292.45
10.59
11.16
4
139.14
520.57
8.46
26.20
8
128.71
923.73
9.73
43.55
16
122.33
1,631.48
10.76
73.30
32
114.14
2,586.64
12.99
102.60
64
95.98
4,124.24
13.42
186.47
128
69.06
6,366.06
19.24
285.92
256
40.02
6,973.92
35.71
305.09
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
132.10
131.90
16.12
3.70
2
130.10
256.33
15.61
7.62
4
125.23
495.22
17.36
13.61
8
111.15
832.88
18.74
23.87
16
104.75
1,375.51
21.45
36.61
32
100.82
2,974.72
21.65
81.76
64
79.67
4,635.15
26.36
131.98
128
60.49
6,290.61
37.0
171.76
256
31.69
7,010.75
62.48
196.58
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.55
53.21
18.70
3.19
2
52.83
103.10
18.97
6.19
4
53.40
206.18
18.77
12.37
8
53.25
412.36
18.85
24.74
16
51.53
812.24
19.48
48.73
32
45.99
1,447.02
21.861
86.82
64
45.99
2,599.88
23.81
156.00
128
34.76
4,216.35
29.32
252.98
256
23.72
3,826.77
44.02
229.61
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
49.15
48.33
20.37
2.90
2
48.73
96.67
20.57
2.90
4
48.17
186.67
20.85
11.20
8
47.53
373.33
21.20
22.40
16
46.69
720.00
21.75
43.20
32
41.65
1,279.99
24.54
76.80
64
41.92
1,279.98
47.75
76.80
128
41.93
1,279.96
91.49
76.80
256
41.88
1,279.93
166.93
76.80
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
106.36
105.00
9.41
6.30
2
104.89
206.67
9.55
12.40
4
101.93
400.00
9.84
24.00
8
98.89
773.33
10.17
46.40
16
91.20
1,439.99
11.07
86.40
32
72.13
2,239.98
14.03
134.40
64
72.29
2,293.30
27.49
137.60
128
72.30
2,239.89
53.75
134.39
256
72.27
2,239.84
102.37
134.39
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.35
26.65
36.65
1.60
2
26.72
49.97
37.53
3.00
4
26.21
99.94
38.27
6.00
8
26.42
199.89
38.00
11.99
16
22.60
346.45
44.45
20.79
32
21.97
692.91
45.77
41.57
64
20.10
1,177.63
50.14
70.66
128
17.06
2,086.85
60.70
125.21
256
11.05
2,024.72
109.59
121.48
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
31.28
26.55
18.50
3.24
2
30.79
50.88
16.14
7.12
4
29.46
93.36
18.15
12.09
8
28.20
170.20
19.40
21.40
16
26.37
271.80
17.73
40.56
32
25.24
419.13
21.06
55.06
64
22.19
755.43
24.38
98.29
128
17.43
1,248.19
29.45
168.00
256
11.27
1,794.88
44.85
236.65
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
95.37
52.01
19.56
3.07
2
92.77
101.29
20.04
5.98
4
91.60
191.83
20.34
11.32
8
86.83
338.87
21.51
19.97
16
78.12
547.34
23.92
32.23
32
64.77
1,111.24
28.91
65.46
64
50.52
1,722.11
37.23
101.48
128
31.29
2,123.49
60.17
125.12
256
14.93
2,002.12
126.87
117.98
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
50.18
50.14
20.43
2.94
2
49.28
97.61
20.78
5.72
4
48.22
186.82
21.32
10.94
8
47.20
365.89
21.75
21.43
16
44.69
650.22
22.89
38.03
32
37.29
989.98
27.31
58.04
64
29.53
1621.76
32.68
95.08
128
19.17
1784.76
53.14
104.56
256
10.79
2271.18
94.78
133.05
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
47.20
50.32
3.53
16.65
2
45.06
98.42
3.61
32.48
4
43.85
165.60
3.26
63.91
8
40.56
292.22
3.04
133.20
16
38.35
416.13
3.61
171.22
32
28.68
557.5
4.64
219.01
64
15.19
613.72
9.65
171.83
128
10.74
664.11
11.67
233.87
256
5.83
721.50
22.78
253.54
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
126.40
110.90
13.07
4.57
2
122.93
213.92
13.33
8.87
4
117.03
403.27
15.32
15.26
8
106.11
707.45
16.86
26.78
16
98.06
1,258.94
18.22
47.94
32
86.74
2,147.82
21.04
79.38
64
72.43
3,011.59
25.50
107.48
128
55.80
5,058.49
32.38
191.22
256
36.56
5,025.93
52.34
189.68
US Midwest (Chicago) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
147.84
148.54
8.18
7.25
2
146.96
292.45
10.59
11.16
4
139.14
520.57
8.46
26.20
8
128.71
923.73
9.73
43.55
16
122.33
1,631.48
10.76
73.30
32
114.14
2,586.64
12.99
102.60
64
95.98
4,124.24
13.42
186.47
128
69.06
6,366.06
19.24
285.92
256
40.02
6,973.92
35.71
305.09
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
132.10
131.90
16.12
3.70
2
130.10
256.33
15.61
7.62
4
125.23
495.22
17.36
13.61
8
111.15
832.88
18.74
23.87
16
104.75
1,375.51
21.45
36.61
32
100.82
2,974.72
21.65
81.76
64
79.67
4,635.15
26.36
131.98
128
60.49
6,290.61
37.0
171.76
256
31.69
7,010.75
62.48
196.58
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.55
53.21
18.70
3.19
2
52.83
103.10
18.97
6.19
4
53.40
206.18
18.77
12.37
8
53.25
412.36
18.85
24.74
16
51.53
812.24
19.48
48.73
32
45.99
1,447.02
21.861
86.82
64
45.99
2,599.88
23.81
156.00
128
34.76
4,216.35
29.32
252.98
256
23.72
3,826.77
44.02
229.61
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
49.15
48.33
20.37
2.90
2
48.73
96.67
20.57
2.90
4
48.17
186.67
20.85
11.20
8
47.53
373.33
21.20
22.40
16
46.69
720.00
21.75
43.20
32
41.65
1,279.99
24.54
76.80
64
41.92
1,279.98
47.75
76.80
128
41.93
1,279.96
91.49
76.80
256
41.88
1,279.93
166.93
76.80
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
106.36
105.00
9.41
6.30
2
104.89
206.67
9.55
12.40
4
101.93
400.00
9.84
24.00
8
98.89
773.33
10.17
46.40
16
91.20
1,439.99
11.07
86.40
32
72.13
2,239.98
14.03
134.40
64
72.29
2,293.30
27.49
137.60
128
72.30
2,239.89
53.75
134.39
256
72.27
2,239.84
102.37
134.39
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.35
26.65
36.65
1.60
2
26.72
49.97
37.53
3.00
4
26.21
99.94
38.27
6.00
8
26.42
199.89
38.00
11.99
16
22.60
346.45
44.45
20.79
32
21.97
692.91
45.77
41.57
64
20.10
1,177.63
50.14
70.66
128
17.06
2,086.85
60.70
125.21
256
11.05
2,024.72
109.59
121.48
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
31.28
26.55
18.50
3.24
2
30.79
50.88
16.14
7.12
4
29.46
93.36
18.15
12.09
8
28.20
170.20
19.40
21.40
16
26.37
271.80
17.73
40.56
32
25.24
419.13
21.06
55.06
64
22.19
755.43
24.38
98.29
128
17.43
1,248.19
29.45
168.00
256
11.27
1,794.88
44.85
236.65
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
95.37
52.01
19.56
3.07
2
92.77
101.29
20.04
5.98
4
91.60
191.83
20.34
11.32
8
86.83
338.87
21.51
19.97
16
78.12
547.34
23.92
32.23
32
64.77
1,111.24
28.91
65.46
64
50.52
1,722.11
37.23
101.48
128
31.29
2,123.49
60.17
125.12
256
14.93
2,002.12
126.87
117.98
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large
Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
30.53
30.51
33.58
1.79
2
29.78
59.01
34.42
3.45
4
28.88
112.35
35.48
6.58
8
27.67
215.18
36.99
12.61
16
24.85
364.06
40.99
21.34
32
20.51
552.34
49.60
32.35
64
16.12
900.39
59.36
52.72
128
10.17
980.45
100.27
57.43
256
6.30
1334.59
162.08
78.19
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Small
Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
47.20
50.32
3.53
16.65
2
45.06
98.42
3.61
32.48
4
43.85
165.60
3.26
63.91
8
40.56
292.22
3.04
133.20
16
38.35
416.13
3.61
171.22
32
28.68
557.5
4.64
219.01
64
15.19
613.72
9.65
171.83
128
10.74
664.11
11.67
233.87
256
5.83
721.50
22.78
253.54
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
126.40
110.90
13.07
4.57
2
122.93
213.92
13.33
8.87
4
117.03
403.27
15.32
15.26
8
106.11
707.45
16.86
26.78
16
98.06
1,258.94
18.22
47.94
32
86.74
2,147.82
21.04
79.38
64
72.43
3,011.59
25.50
107.48
128
55.80
5,058.49
32.38
191.22
256
36.56
5,025.93
52.34
189.68
Model: cohere.command(Cohere Command 52 B) model hosted on one Large Cohere unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
35.78
33.43
10.98
5.33
8
31.41
99.67
13.87
16.61
32
28.49
237.1
19.48
40.24
128
23.01
326.93
53.13
54.89
Model: cohere.command-light(Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
80.38
83.61
9.19
6.34
8
45.96
278.91
13.89
22.46
32
23.90
493.78
27.34
41.13
128
5.12
565.06
82.15
44.89
Model: meta.llama-2-70b-chat(Llama2 70 B) model hosted on one Llama2 70 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)