Training Data in Generative AI

Here are guidelines for creating training data for fine-tuning the pretrained models in OCI Generative AI. A custom model can be fine‑tuned with only one dataset, which the system automatically splits into 80 % training and 20 % validation data. The dataset must be a JSONL file containing at least 32 prompt/completion pairs, each line formatted as: {"prompt": "<your prompt>", "completion": "<expected response>"}. Save the file in an OCI Object Storage bucket and reference it when creating the custom model.

Dataset Requirements

Datasets for training custom models have the following requirements:

A maximum of one fine-tuning dataset is allowed per custom model. This dataset is randomly split to a 80:20 ratio for training and validating.
Each file must have at least 32 prompt/completion pair examples.
The file format is JSONL.
Each line in the JSONL file has the following format:
{"prompt": "<a prompt>", "completion": "<expected response given the prompt>"}\n
The file must be stored in an OCI Object Storage bucket.

JSONL Format

About JSONL

A JSONL file contains a new JSON value or object on each line. The file isn't evaluated as a whole, like a regular JSON file. Instead, each line is treated as if it is a separate JSON file. This format is ideal for storing a set of inputs in JSON format.

The OCI Generative AI service accepts a JSONL file for fine-tuning custom models in the following format:

{"prompt": "<first prompt>", "completion": "<expected completion given first prompt>"}
{"prompt": "<second prompt>", "completion": "<expected completion given second prompt>"}
.
.
.

JSONL Example

{"prompt": "What is the capital of France?", "completion": "The capital of France is Paris."}
{"prompt": "What is the smallest state in the USA?", "completion": "The smallest state in the USA is Rhode Island."}

Note

Ensure that each JSONL dataset file that you create for Generative AI has the following properties:

The file is UTF-8 encoded.
Each line item contains a valid JSON object.
Each JSON object has two properties: "prompt" and "completion".
Each JSON object is entered in a new line or followed by a newline character (\n).

After you create the JSONL file, add your dataset to an Object Storage bucket.

Oracle Cloud Infrastructure Documentation

Training Data in Generative AI

Dataset Requirements

JSONL Format