LLaMA-Factory-训练数据默认使用-instructioninputoutput-三个-key

2025-03-14 约 1535 字预计阅读 4 分钟

https://bing.ee123.net/img/rand?artid=146253730

LLaMA-Factory 训练数据默认使用 instruction、input、output 三个 key

在 LLaMA-Factory 进行 SFT（Directive supervision fine-tuning 指令监督微调）时， 训练数据的格式 非常重要，因为大模型依赖标准化的数据结构来学习 指令-响应 模式。

identity.json 文件的数据采用了 “instruction”、“input”、“output” 这三个 key，它们的作用如下：

Key	作用	示例
“instruction”	代表用户给 AI 的指令（问题或任务描述）。	`"instruction": "Who are you?"`
“input”	可选，表示提供给 AI 的额外输入信息。	`"input": ""` （如果不需要额外输入，可以留空）
“output”	期望 AI 生成的回答（AI 的理想输出）。	`"output": "I am { {name}}, an AI assistant developed by { {author}}."`

1. LLaMA-Factory 训练数据的格式要求

如果你要换用自己的数据集，格式应该符合 以下 JSON 结构 ：

identity.json文件

[
  {
    "instruction": "请介绍一下你自己。",
    "input": "",
    "output": "我是Llama-Chinese，由LLaMA Factory开发。我可以帮助您回答问题、提供信息和进行对话。"
  },
  {
    "instruction": "翻译下面的句子：",
    "input": "Hello, how are you?",
    "output": "你好，你怎么样？"
  },
  {
    "instruction": "请用Python写一个计算斐波那契数列的函数。",
    "input": "",
    "output": "```python\ndef fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)\n```"
  }
]

格式要求

JSON 格式的数组 ，每个数据点是一个字典 {} 。
必须包含 instruction 和 output ， input 可选。
instruction 是给 AI 的指令， output 是期望的 AI 输出 。
input 只有在额外信息需要提供时才填 （例如机器翻译任务）。

2. 你可以使用哪些 key？如何让大模型正确学习？

(1) 基础的三元结构

最通用的格式 ：

{
  "instruction": "请翻译这句话：",
  "input": "The weather is great today.",
  "output": "今天天气很好。"
}

适用于： 对话数据、翻译、问答任务 。

(2) 只有 `instruction` 和 `output`

如果任务不需要额外输入，可以省略 input ：

{
  "instruction": "写一个Python函数，计算平方数。",
  "output": "```python\ndef square(x):\n    return x * x\n```"
}

适用于： 代码生成、开放性问题、一般问答任务 。

(3) 只有 `input` 和 `output` （极少数情况）

某些任务 instruction 可以省略：

{
  "input": "What is the capital of France?",
  "output": "Paris"
}

适用于： 部分问答任务、无明确指令的情况 。
但在 LLaMA-Factory 里， 通常 instruction 是必须的 ，所以不建议这么做。

3. 如何自定义数据集 key？

如果你想改 key 的名字，比如你想用 "query" 代替 "instruction" ，那你需要 在 LLaMA-Factory 里修改解析代码 ，否则模型不会正确识别你的数据。

但 最简单的做法 是 保持 instruction 、 input 、 output 这三个 key ，因为 LLaMA-Factory 默认支持这种格式，不需要额外修改代码。

4. 如何确认你的数据格式正确？

在 训练之前 ，可以用 Python 代码检查 JSON 格式：

import json

with open("your_dataset.json", "r", encoding="utf-8") as f:
    dataset = json.load(f)

for sample in dataset[:3]:  # 只检查前3条数据
    assert "instruction" in sample, "数据缺少 instruction 字段"
    assert "output" in sample, "数据缺少 output 字段"
    print(sample)

如果没有报错，说明格式正确。

5. 其他数据格式示例

(1) 翻译任务

[
  {
    "instruction": "请把以下文本翻译成法语。",
    "input": "Hello, how are you?",
    "output": "Bonjour, comment ça va?"
  }
]

适用于 机器翻译微调 。

(2) 代码生成任务

[
  {
    "instruction": "写一个 Python 函数，判断一个数是否是素数。",
    "input": "",
    "output": "```python\ndef is_prime(n):\n    if n < 2:\n        return False\n    for i in range(2, int(n ** 0.5) + 1):\n        if n % i == 0:\n            return False\n    return True```"
  }
]

适用于 代码生成模型微调 。

(3) 文本摘要任务

[
  {
    "instruction": "请总结以下文章的主要内容。",
    "input": "最近人工智能技术取得了重大突破，许多公司开始投入大规模模型的研究。",
    "output": "人工智能技术发展迅速，公司纷纷加大对大模型的投入。"
  }
]

适用于 文本摘要微调 。

6. 总结

✅ LLaMA-Factory 训练数据默认使用 instruction 、 input 、 output 三个 key 。

✅ 确保 JSON 格式正确，否则模型无法正确学习 。

✅ 如果要自定义 key，建议修改 LLaMA-Factory 的代码（不推荐） ，或保持默认格式。

✅ 数据结构要根据任务类型调整 ，比如：

问答、闲聊： instruction + output
翻译任务： instruction + input + output
代码生成： instruction + output
文本摘要： instruction + input + output

💡 结论：如果你想换数据集，只需要用 instruction 、 input 、 output 结构组织数据，并替换内容，就可以直接用于 LLaMA-Factory 训练！ 🚀