Scenario 3: Generation Heavy Benchmarks in Generative AI

The generation heavy scenario is for generation / model response heavy use cases. For example, a long job description generated from a short bullet list of items.

The generation heavy scenario is performed with the following token lengths:

The prompt length is fixed to 100 tokens
The response length is fixed to 1,000 tokens

Important

The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:

The number of concurrent requests.
The number of tokens in the prompt.
The number of tokens in the response.
The variance of (2) and (3) across requests.

Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Chat and Text Generation Scenarios. The generation heavy scenario is performed in the following region.

Brazil East (Sao Paulo)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	147.84	148.54	8.18	7.25
2	146.96	292.45	10.59	11.16
4	139.14	520.57	8.46	26.20
8	128.71	923.73	9.73	43.55
16	122.33	1,631.48	10.76	73.30
32	114.14	2,586.64	12.99	102.60
64	95.98	4,124.24	13.42	186.47
128	69.06	6,366.06	19.24	285.92
256	40.02	6,973.92	35.71	305.09

Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	132.10	131.90	16.12	3.70
2	130.10	256.33	15.61	7.62
4	125.23	495.22	17.36	13.61
8	111.15	832.88	18.74	23.87
16	104.75	1,375.51	21.45	36.61
32	100.82	2,974.72	21.65	81.76
64	79.67	4,635.15	26.36	131.98
128	60.49	6,290.61	37.0	171.76
256	31.69	7,010.75	62.48	196.58

Model: meta.llama-3.3-70b-instruct (Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	53.55	53.21	18.70	3.19
2	52.83	103.10	18.97	6.19
4	53.40	206.18	18.77	12.37
8	53.25	412.36	18.85	24.74
16	51.53	812.24	19.48	48.73
32	45.99	1,447.02	21.861	86.82
64	45.99	2,599.88	23.81	156.00
128	34.76	4,216.35	29.32	252.98
256	23.72	3,826.77	44.02	229.61

Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	49.15	48.33	20.37	2.90
2	48.73	96.67	20.57	2.90
4	48.17	186.67	20.85	11.20
8	47.53	373.33	21.20	22.40
16	46.69	720.00	21.75	43.20
32	41.65	1,279.99	24.54	76.80
64	41.92	1,279.98	47.75	76.80
128	41.93	1,279.96	91.49	76.80
256	41.88	1,279.93	166.93	76.80

Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	106.36	105.00	9.41	6.30
2	104.89	206.67	9.55	12.40
4	101.93	400.00	9.84	24.00
8	98.89	773.33	10.17	46.40
16	91.20	1,439.99	11.07	86.40
32	72.13	2,239.98	14.03	134.40
64	72.29	2,293.30	27.49	137.60
128	72.30	2,239.89	53.75	134.39
256	72.27	2,239.84	102.37	134.39

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster

Important

You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.

The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.35	26.65	36.65	1.60
2	26.72	49.97	37.53	3.00
4	26.21	99.94	38.27	6.00
8	26.42	199.89	38.00	11.99
16	22.60	346.45	44.45	20.79
32	21.97	692.91	45.77	41.57
64	20.10	1,177.63	50.14	70.66
128	17.06	2,086.85	60.70	125.21
256	11.05	2,024.72	109.59	121.48

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	31.28	26.55	18.50	3.24
2	30.79	50.88	16.14	7.12
4	29.46	93.36	18.15	12.09
8	28.20	170.20	19.40	21.40
16	26.37	271.80	17.73	40.56
32	25.24	419.13	21.06	55.06
64	22.19	755.43	24.38	98.29
128	17.43	1,248.19	29.45	168.00
256	11.27	1,794.88	44.85	236.65

Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	95.37	52.01	19.56	3.07
2	92.77	101.29	20.04	5.98
4	91.60	191.83	20.34	11.32
8	86.83	338.87	21.51	19.97
16	78.12	547.34	23.92	32.23
32	64.77	1,111.24	28.91	65.46
64	50.52	1,722.11	37.23	101.48
128	31.29	2,123.49	60.17	125.12
256	14.93	2,002.12	126.87	117.98

Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	50.18	50.14	20.43	2.94
2	49.28	97.61	20.78	5.72
4	48.22	186.82	21.32	10.94
8	47.20	365.89	21.75	21.43
16	44.69	650.22	22.89	38.03
32	37.29	989.98	27.31	58.04
64	29.53	1621.76	32.68	95.08
128	19.17	1784.76	53.14	104.56
256	10.79	2271.18	94.78	133.05

Model: cohere.command-r-16k (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	47.20	50.32	3.53	16.65
2	45.06	98.42	3.61	32.48
4	43.85	165.60	3.26	63.91
8	40.56	292.22	3.04	133.20
16	38.35	416.13	3.61	171.22
32	28.68	557.5	4.64	219.01
64	15.19	613.72	9.65	171.83
128	10.74	664.11	11.67	233.87
256	5.83	721.50	22.78	253.54

Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	126.40	110.90	13.07	4.57
2	122.93	213.92	13.33	8.87
4	117.03	403.27	15.32	15.26
8	106.11	707.45	16.86	26.78
16	98.06	1,258.94	18.22	47.94
32	86.74	2,147.82	21.04	79.38
64	72.43	3,011.59	25.50	107.48
128	55.80	5,058.49	32.38	191.22
256	36.56	5,025.93	52.34	189.68

Germany Central (Frankfurt)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	147.84	148.54	8.18	7.25
2	146.96	292.45	10.59	11.16
4	139.14	520.57	8.46	26.20
8	128.71	923.73	9.73	43.55
16	122.33	1,631.48	10.76	73.30
32	114.14	2,586.64	12.99	102.60
64	95.98	4,124.24	13.42	186.47
128	69.06	6,366.06	19.24	285.92
256	40.02	6,973.92	35.71	305.09

Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	132.10	131.90	16.12	3.70
2	130.10	256.33	15.61	7.62
4	125.23	495.22	17.36	13.61
8	111.15	832.88	18.74	23.87
16	104.75	1,375.51	21.45	36.61
32	100.82	2,974.72	21.65	81.76
64	79.67	4,635.15	26.36	131.98
128	60.49	6,290.61	37.0	171.76
256	31.69	7,010.75	62.48	196.58

Model: meta.llama-3.3-70b-instruct (Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	53.55	53.21	18.70	3.19
2	52.83	103.10	18.97	6.19
4	53.40	206.18	18.77	12.37
8	53.25	412.36	18.85	24.74
16	51.53	812.24	19.48	48.73
32	45.99	1,447.02	21.861	86.82
64	45.99	2,599.88	23.81	156.00
128	34.76	4,216.35	29.32	252.98
256	23.72	3,826.77	44.02	229.61

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster

Important


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.35	26.65	36.65	1.60
2	26.72	49.97	37.53	3.00
4	26.21	99.94	38.27	6.00
8	26.42	199.89	38.00	11.99
16	22.60	346.45	44.45	20.79
32	21.97	692.91	45.77	41.57
64	20.10	1,177.63	50.14	70.66
128	17.06	2,086.85	60.70	125.21
256	11.05	2,024.72	109.59	121.48

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	31.28	26.55	18.50	3.24
2	30.79	50.88	16.14	7.12
4	29.46	93.36	18.15	12.09
8	28.20	170.20	19.40	21.40
16	26.37	271.80	17.73	40.56
32	25.24	419.13	21.06	55.06
64	22.19	755.43	24.38	98.29
128	17.43	1,248.19	29.45	168.00
256	11.27	1,794.88	44.85	236.65

Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	95.37	52.01	19.56	3.07
2	92.77	101.29	20.04	5.98
4	91.60	191.83	20.34	11.32
8	86.83	338.87	21.51	19.97
16	78.12	547.34	23.92	32.23
32	64.77	1,111.24	28.91	65.46
64	50.52	1,722.11	37.23	101.48
128	31.29	2,123.49	60.17	125.12
256	14.93	2,002.12	126.87	117.98

Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	50.18	50.14	20.43	2.94
2	49.28	97.61	20.78	5.72
4	48.22	186.82	21.32	10.94
8	47.20	365.89	21.75	21.43
16	44.69	650.22	22.89	38.03
32	37.29	989.98	27.31	58.04
64	29.53	1621.76	32.68	95.08
128	19.17	1784.76	53.14	104.56
256	10.79	2271.18	94.78	133.05

Model: cohere.command-r-16k (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	47.20	50.32	3.53	16.65
2	45.06	98.42	3.61	32.48
4	43.85	165.60	3.26	63.91
8	40.56	292.22	3.04	133.20
16	38.35	416.13	3.61	171.22
32	28.68	557.5	4.64	219.01
64	15.19	613.72	9.65	171.83
128	10.74	664.11	11.67	233.87
256	5.83	721.50	22.78	253.54

Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	126.40	110.90	13.07	4.57
2	122.93	213.92	13.33	8.87
4	117.03	403.27	15.32	15.26
8	106.11	707.45	16.86	26.78
16	98.06	1,258.94	18.22	47.94
32	86.74	2,147.82	21.04	79.38
64	72.43	3,011.59	25.50	107.48
128	55.80	5,058.49	32.38	191.22
256	36.56	5,025.93	52.34	189.68

Japan Central (Osaka)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	147.84	148.54	8.18	7.25
2	146.96	292.45	10.59	11.16
4	139.14	520.57	8.46	26.20
8	128.71	923.73	9.73	43.55
16	122.33	1,631.48	10.76	73.30
32	114.14	2,586.64	12.99	102.60
64	95.98	4,124.24	13.42	186.47
128	69.06	6,366.06	19.24	285.92
256	40.02	6,973.92	35.71	305.09

Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	132.10	131.90	16.12	3.70
2	130.10	256.33	15.61	7.62
4	125.23	495.22	17.36	13.61
8	111.15	832.88	18.74	23.87
16	104.75	1,375.51	21.45	36.61
32	100.82	2,974.72	21.65	81.76
64	79.67	4,635.15	26.36	131.98
128	60.49	6,290.61	37.0	171.76
256	31.69	7,010.75	62.48	196.58

Model: meta.llama-3.3-70b-instruct (Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	53.55	53.21	18.70	3.19
2	52.83	103.10	18.97	6.19
4	53.40	206.18	18.77	12.37
8	53.25	412.36	18.85	24.74
16	51.53	812.24	19.48	48.73
32	45.99	1,447.02	21.861	86.82
64	45.99	2,599.88	23.81	156.00
128	34.76	4,216.35	29.32	252.98
256	23.72	3,826.77	44.02	229.61

Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	49.15	48.33	20.37	2.90
2	48.73	96.67	20.57	2.90
4	48.17	186.67	20.85	11.20
8	47.53	373.33	21.20	22.40
16	46.69	720.00	21.75	43.20
32	41.65	1,279.99	24.54	76.80
64	41.92	1,279.98	47.75	76.80
128	41.93	1,279.96	91.49	76.80
256	41.88	1,279.93	166.93	76.80

Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	106.36	105.00	9.41	6.30
2	104.89	206.67	9.55	12.40
4	101.93	400.00	9.84	24.00
8	98.89	773.33	10.17	46.40
16	91.20	1,439.99	11.07	86.40
32	72.13	2,239.98	14.03	134.40
64	72.29	2,293.30	27.49	137.60
128	72.30	2,239.89	53.75	134.39
256	72.27	2,239.84	102.37	134.39

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster

Important


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.35	26.65	36.65	1.60
2	26.72	49.97	37.53	3.00
4	26.21	99.94	38.27	6.00
8	26.42	199.89	38.00	11.99
16	22.60	346.45	44.45	20.79
32	21.97	692.91	45.77	41.57
64	20.10	1,177.63	50.14	70.66
128	17.06	2,086.85	60.70	125.21
256	11.05	2,024.72	109.59	121.48

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	31.28	26.55	18.50	3.24
2	30.79	50.88	16.14	7.12
4	29.46	93.36	18.15	12.09
8	28.20	170.20	19.40	21.40
16	26.37	271.80	17.73	40.56
32	25.24	419.13	21.06	55.06
64	22.19	755.43	24.38	98.29
128	17.43	1,248.19	29.45	168.00
256	11.27	1,794.88	44.85	236.65

Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	95.37	52.01	19.56	3.07
2	92.77	101.29	20.04	5.98
4	91.60	191.83	20.34	11.32
8	86.83	338.87	21.51	19.97
16	78.12	547.34	23.92	32.23
32	64.77	1,111.24	28.91	65.46
64	50.52	1,722.11	37.23	101.48
128	31.29	2,123.49	60.17	125.12
256	14.93	2,002.12	126.87	117.98

Model: cohere.command-r-16k (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	47.20	50.32	3.53	16.65
2	45.06	98.42	3.61	32.48
4	43.85	165.60	3.26	63.91
8	40.56	292.22	3.04	133.20
16	38.35	416.13	3.61	171.22
32	28.68	557.5	4.64	219.01
64	15.19	613.72	9.65	171.83
128	10.74	664.11	11.67	233.87
256	5.83	721.50	22.78	253.54

Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	126.40	110.90	13.07	4.57
2	122.93	213.92	13.33	8.87
4	117.03	403.27	15.32	15.26
8	106.11	707.45	16.86	26.78
16	98.06	1,258.94	18.22	47.94
32	86.74	2,147.82	21.04	79.38
64	72.43	3,011.59	25.50	107.48
128	55.80	5,058.49	32.38	191.22
256	36.56	5,025.93	52.34	189.68

UK South (London)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	147.84	148.54	8.18	7.25
2	146.96	292.45	10.59	11.16
4	139.14	520.57	8.46	26.20
8	128.71	923.73	9.73	43.55
16	122.33	1,631.48	10.76	73.30
32	114.14	2,586.64	12.99	102.60
64	95.98	4,124.24	13.42	186.47
128	69.06	6,366.06	19.24	285.92
256	40.02	6,973.92	35.71	305.09

Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	132.10	131.90	16.12	3.70
2	130.10	256.33	15.61	7.62
4	125.23	495.22	17.36	13.61
8	111.15	832.88	18.74	23.87
16	104.75	1,375.51	21.45	36.61
32	100.82	2,974.72	21.65	81.76
64	79.67	4,635.15	26.36	131.98
128	60.49	6,290.61	37.0	171.76
256	31.69	7,010.75	62.48	196.58

Model: meta.llama-3.3-70b-instruct (Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	53.55	53.21	18.70	3.19
2	52.83	103.10	18.97	6.19
4	53.40	206.18	18.77	12.37
8	53.25	412.36	18.85	24.74
16	51.53	812.24	19.48	48.73
32	45.99	1,447.02	21.861	86.82
64	45.99	2,599.88	23.81	156.00
128	34.76	4,216.35	29.32	252.98
256	23.72	3,826.77	44.02	229.61

Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	49.15	48.33	20.37	2.90
2	48.73	96.67	20.57	2.90
4	48.17	186.67	20.85	11.20
8	47.53	373.33	21.20	22.40
16	46.69	720.00	21.75	43.20
32	41.65	1,279.99	24.54	76.80
64	41.92	1,279.98	47.75	76.80
128	41.93	1,279.96	91.49	76.80
256	41.88	1,279.93	166.93	76.80

Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	106.36	105.00	9.41	6.30
2	104.89	206.67	9.55	12.40
4	101.93	400.00	9.84	24.00
8	98.89	773.33	10.17	46.40
16	91.20	1,439.99	11.07	86.40
32	72.13	2,239.98	14.03	134.40
64	72.29	2,293.30	27.49	137.60
128	72.30	2,239.89	53.75	134.39
256	72.27	2,239.84	102.37	134.39

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster

Important


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.35	26.65	36.65	1.60
2	26.72	49.97	37.53	3.00
4	26.21	99.94	38.27	6.00
8	26.42	199.89	38.00	11.99
16	22.60	346.45	44.45	20.79
32	21.97	692.91	45.77	41.57
64	20.10	1,177.63	50.14	70.66
128	17.06	2,086.85	60.70	125.21
256	11.05	2,024.72	109.59	121.48

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	31.28	26.55	18.50	3.24
2	30.79	50.88	16.14	7.12
4	29.46	93.36	18.15	12.09
8	28.20	170.20	19.40	21.40
16	26.37	271.80	17.73	40.56
32	25.24	419.13	21.06	55.06
64	22.19	755.43	24.38	98.29
128	17.43	1,248.19	29.45	168.00
256	11.27	1,794.88	44.85	236.65

Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	95.37	52.01	19.56	3.07
2	92.77	101.29	20.04	5.98
4	91.60	191.83	20.34	11.32
8	86.83	338.87	21.51	19.97
16	78.12	547.34	23.92	32.23
32	64.77	1,111.24	28.91	65.46
64	50.52	1,722.11	37.23	101.48
128	31.29	2,123.49	60.17	125.12
256	14.93	2,002.12	126.87	117.98

Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	50.18	50.14	20.43	2.94
2	49.28	97.61	20.78	5.72
4	48.22	186.82	21.32	10.94
8	47.20	365.89	21.75	21.43
16	44.69	650.22	22.89	38.03
32	37.29	989.98	27.31	58.04
64	29.53	1621.76	32.68	95.08
128	19.17	1784.76	53.14	104.56
256	10.79	2271.18	94.78	133.05

Model: cohere.command-r-16k (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	47.20	50.32	3.53	16.65
2	45.06	98.42	3.61	32.48
4	43.85	165.60	3.26	63.91
8	40.56	292.22	3.04	133.20
16	38.35	416.13	3.61	171.22
32	28.68	557.5	4.64	219.01
64	15.19	613.72	9.65	171.83
128	10.74	664.11	11.67	233.87
256	5.83	721.50	22.78	253.54

Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	126.40	110.90	13.07	4.57
2	122.93	213.92	13.33	8.87
4	117.03	403.27	15.32	15.26
8	106.11	707.45	16.86	26.78
16	98.06	1,258.94	18.22	47.94
32	86.74	2,147.82	21.04	79.38
64	72.43	3,011.59	25.50	107.48
128	55.80	5,058.49	32.38	191.22
256	36.56	5,025.93	52.34	189.68

US Midwest (Chicago)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	147.84	148.54	8.18	7.25
2	146.96	292.45	10.59	11.16
4	139.14	520.57	8.46	26.20
8	128.71	923.73	9.73	43.55
16	122.33	1,631.48	10.76	73.30
32	114.14	2,586.64	12.99	102.60
64	95.98	4,124.24	13.42	186.47
128	69.06	6,366.06	19.24	285.92
256	40.02	6,973.92	35.71	305.09

Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	132.10	131.90	16.12	3.70
2	130.10	256.33	15.61	7.62
4	125.23	495.22	17.36	13.61
8	111.15	832.88	18.74	23.87
16	104.75	1,375.51	21.45	36.61
32	100.82	2,974.72	21.65	81.76
64	79.67	4,635.15	26.36	131.98
128	60.49	6,290.61	37.0	171.76
256	31.69	7,010.75	62.48	196.58

Model: meta.llama-3.3-70b-instruct (Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	53.55	53.21	18.70	3.19
2	52.83	103.10	18.97	6.19
4	53.40	206.18	18.77	12.37
8	53.25	412.36	18.85	24.74
16	51.53	812.24	19.48	48.73
32	45.99	1,447.02	21.861	86.82
64	45.99	2,599.88	23.81	156.00
128	34.76	4,216.35	29.32	252.98
256	23.72	3,826.77	44.02	229.61

Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	49.15	48.33	20.37	2.90
2	48.73	96.67	20.57	2.90
4	48.17	186.67	20.85	11.20
8	47.53	373.33	21.20	22.40
16	46.69	720.00	21.75	43.20
32	41.65	1,279.99	24.54	76.80
64	41.92	1,279.98	47.75	76.80
128	41.93	1,279.96	91.49	76.80
256	41.88	1,279.93	166.93	76.80

Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	106.36	105.00	9.41	6.30
2	104.89	206.67	9.55	12.40
4	101.93	400.00	9.84	24.00
8	98.89	773.33	10.17	46.40
16	91.20	1,439.99	11.07	86.40
32	72.13	2,239.98	14.03	134.40
64	72.29	2,293.30	27.49	137.60
128	72.30	2,239.89	53.75	134.39
256	72.27	2,239.84	102.37	134.39

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster

Important


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.35	26.65	36.65	1.60
2	26.72	49.97	37.53	3.00
4	26.21	99.94	38.27	6.00
8	26.42	199.89	38.00	11.99
16	22.60	346.45	44.45	20.79
32	21.97	692.91	45.77	41.57
64	20.10	1,177.63	50.14	70.66
128	17.06	2,086.85	60.70	125.21
256	11.05	2,024.72	109.59	121.48

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	31.28	26.55	18.50	3.24
2	30.79	50.88	16.14	7.12
4	29.46	93.36	18.15	12.09
8	28.20	170.20	19.40	21.40
16	26.37	271.80	17.73	40.56
32	25.24	419.13	21.06	55.06
64	22.19	755.43	24.38	98.29
128	17.43	1,248.19	29.45	168.00
256	11.27	1,794.88	44.85	236.65

Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	95.37	52.01	19.56	3.07
2	92.77	101.29	20.04	5.98
4	91.60	191.83	20.34	11.32
8	86.83	338.87	21.51	19.97
16	78.12	547.34	23.92	32.23
32	64.77	1,111.24	28.91	65.46
64	50.52	1,722.11	37.23	101.48
128	31.29	2,123.49	60.17	125.12
256	14.93	2,002.12	126.87	117.98

Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	30.53	30.51	33.58	1.79
2	29.78	59.01	34.42	3.45
4	28.88	112.35	35.48	6.58
8	27.67	215.18	36.99	12.61
16	24.85	364.06	40.99	21.34
32	20.51	552.34	49.60	32.35
64	16.12	900.39	59.36	52.72
128	10.17	980.45	100.27	57.43
256	6.30	1334.59	162.08	78.19

Model: cohere.command-r-16k (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	47.20	50.32	3.53	16.65
2	45.06	98.42	3.61	32.48
4	43.85	165.60	3.26	63.91
8	40.56	292.22	3.04	133.20
16	38.35	416.13	3.61	171.22
32	28.68	557.5	4.64	219.01
64	15.19	613.72	9.65	171.83
128	10.74	664.11	11.67	233.87
256	5.83	721.50	22.78	253.54

Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	126.40	110.90	13.07	4.57
2	122.93	213.92	13.33	8.87
4	117.03	403.27	15.32	15.26
8	106.11	707.45	16.86	26.78
16	98.06	1,258.94	18.22	47.94
32	86.74	2,147.82	21.04	79.38
64	72.43	3,011.59	25.50	107.48
128	55.80	5,058.49	32.38	191.22
256	36.56	5,025.93	52.34	189.68

Model: cohere.command (Cohere Command 52 B) model hosted on one Large Cohere unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	35.78	33.43	10.98	5.33
8	31.41	99.67	13.87	16.61
32	28.49	237.1	19.48	40.24
128	23.01	326.93	53.13	54.89

Model: cohere.command-light (Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	80.38	83.61	9.19	6.34
8	45.96	278.91	13.89	22.46
32	23.90	493.78	27.34	41.13
128	5.12	565.06	82.15	44.89

Model: meta.llama-2-70b-chat (Llama2 70 B) model hosted on one Llama2 70 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	18.12	17.58	21.44	2.72
8	15.96	64.28	26.83	8.91
32	13.72	195.48	29.43	27.99
128	8.61	541.75	48.50	71.52

Oracle Cloud Infrastructure Documentation

Scenario 3: Generation Heavy Benchmarks in Generative AI

Brazil East (Sao Paulo) 🔗

Germany Central (Frankfurt) 🔗

Japan Central (Osaka) 🔗

UK South (London) 🔗

US Midwest (Chicago) 🔗

Brazil East (Sao Paulo)

Germany Central (Frankfurt)

Japan Central (Osaka)

UK South (London)

US Midwest (Chicago)