Review performance benchmarks for the cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2 unit of a dedicated AI cluster in OCI
Generative AI.
See details for the model and review the following sections:
This scenario mimics text generation use cases where the size of the prompt and response are unknown ahead of time. Because of the unknown prompt and response lengths, we've used a stochastic approach where both the prompt and response length follow a normal distribution. The prompt length follows a normal distribution with a mean of 480 tokens and a standard deviation of 240 tokens. The response length follows a normal distribution with a mean of 300 tokens and a standard deviation of 150 tokens.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
122.46
101.28
4.31
13.21
2
114.38
177.67
5.70
17.78
4
107.48
367.88
5.09
45.22
8
95.32
644.56
7.23
62.61
16
82.42
1,036.84
7.91
62.61
32
66.46
1,529.28
10.12
145.82
64
45.70
1,924.84
12.43
206.26
128
33.96
2,546.35
18.22
272.53
256
23.86
2,914.77
30.75
298.88
Chat 🔗
This scenario covers chat and dialog use cases where the prompt and responses are short. The prompt and response length are each fixed to 100 tokens.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
112.29
95.11
1.82
31.65
2
109.27
186.61
1.91
60.55
4
104.19
350.17
1.98
115.70
8
93.66
625.10
2.24
200.55
16
84.60
1,087.14
2.46
354.44
32
68.80
1,718.20
2.96
557.70
64
53.25
2,455.21
3.53
827.78
128
38.02
3,366.97
5.48
1,113.31
256
25.19
3,983.61
8.35
1,322.15
Generation Heavy 🔗
This scenario is for generation and model response heavy use cases. For example, a long job description generated from a short bullet list of items. For this case, the prompt length is fixed to 100 tokens and the response length is fixed to 1,000 tokens.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
126.40
110.90
13.07
4.57
2
122.93
213.92
13.33
8.87
4
117.03
403.27
15.32
15.26
8
106.11
707.45
16.86
26.78
16
98.06
1,258.94
18.22
47.94
32
86.74
2,147.82
21.04
79.38
64
72.43
3,011.59
25.50
107.48
128
55.80
5,058.49
32.38
191.22
256
36.56
5,025.93
52.34
189.68
RAG 🔗
The retrieval-augmented generation (RAG) scenario has a very long prompt and a short response such as summarizing use cases. The prompt length is fixed to 2,000 tokens and the response length is fixed to 200 tokens.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)