【Llama-2】8bit+LoRAでRLHFファインチューニングを試す【学習できることは確認】

2023年7月20日2023年7月21日

Llama-2が出たのでRLHFを試してみました。

事前学習モデルでは教師ありファインチューニングをしてから行う必要がありますが、すでに調整されているモデルが公開されているのでそちらを使います。

前提

・モデルの申請などがすでに終わっていてモデルを使うことができる

・accelerateの設定(accelerate config)が終わっている

この記事で分かること

・Llama-2のRLHFファインチューニングの方法

・パラメータの変更方法

学習モデル

meta-llama/Llama-2-7b-chat-hfを使います。

とりあえず一番小さい（これでも大きい）モデルでお試しです。

マシンパワーさえあれば70bとかもいけると思います。

データセット

以下のデータセットのinstraction部分を入力として使っています。

ありがとうございます。

報酬モデル

報酬モデルについても自分で作るのは大変です。すでにあるものを使います。（とりあえず使うだけなので他のreward modelでも大丈夫です。)

日本語応答に対して適切に報酬を与えられるかわからないので、ここは他のモデルを選択する方がいい結果が得られると思います。

一応、OpenAssistantが公開している報酬モデルについては調査してみました。

環境

RTX3090

ライブラリ

pip install transformers
pip install accelerate
pip install peft
pip install trl
pip install datasets
pip install tqdm
pip install wandb # 必要な場合

コード全体

以下のコードをtrain.pyにコピペしてください。

学習はできますが、このままではしっかりとした学習ができるわけではないので調整が必要です。

また、トークナイザーなど（特殊トークンの関係とか)で間違えている場合もあるのでご了承ください。

# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

import torch
from dataclasses import dataclass, field
from typing import Optional
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import (
    Adafactor,
    AutoTokenizer,
    HfArgumentParser,
    pipeline
)

from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer, set_seed
from trl.core import LengthSampler

tqdm.pandas()


@dataclass
class ScriptArguments:
    """
    デフォルト値を変更するだけでいいかも
    """
    model_name: Optional[str] = field(default="meta-llama/Llama-2-7b-chat-hf", metadata={"help": "the model name"})
    tokenizer_name: Optional[str] = field(default="meta-llama/Llama-2-7b-chat-hf", metadata={"help": "the tokenizer name"})
    reward_model_name: Optional[str] = field(default="OpenAssistant/reward-model-deberta-v3-base", metadata={"help": "the reward model name"})
    dataset_name: Optional[str] = field(default="kunishou/hh-rlhf-49k-ja", metadata={"help": "the dataset name"})
    log_with: Optional[str] = field(default="wandb", metadata={"help": "use 'wandb' to log with wandb"})
    learning_rate: Optional[float] = field(default=1.41e-5, metadata={"help": "the learning rate"})
    max_length: Optional[int] = field(default=512, metadata={"help": "maximum length for input"})
    output_max_length: Optional[int] = field(default=128, metadata={"help": "maximum length for generation"})
    mini_batch_size: Optional[int] = field(default=1, metadata={"help": "the PPO minibatch size"})
    batch_size: Optional[int] = field(default=32, metadata={"help": "the batch size"})
    ppo_epochs: Optional[int] = field(default=4, metadata={"help": "the number of ppo epochs"})
    gradient_accumulation_steps: Optional[int] = field(
        default=4, metadata={"help": "the number of gradient accumulation steps"}
    )
    adafactor: Optional[bool] = field(default=False, metadata={"help": "whether to use the adafactor optimizer"})
    early_stopping: Optional[bool] = field(default=False, metadata={"help": "whether to early stop"})
    target_kl: Optional[float] = field(default=0.1, metadata={"help": "kl target for early stopping"})
    reward_baseline: Optional[float] = field(
        default=0.0,
        metadata={"help": "a baseline value that is subtracted from the reward"},
    )
    batched_gen: Optional[bool] = field(default=False, metadata={"help": "whether to use the batched text gen"})
    save_freq: Optional[int] = field(default=None, metadata={"help": "n steps to save the model"})
    output_dir: Optional[str] = field(default="./checkpoints/tuning_llama2_rl/",
                                      metadata={"help": "n steps to save the model"})
    seed: Optional[int] = field(default=0, metadata={"help": "the seed"})
    
    
def main():
    parser = HfArgumentParser(ScriptArguments)
    script_args: ScriptArguments = parser.parse_args_into_dataclasses()[0]

    set_seed(script_args.seed)

    # ここでデータセットを変換する
    def build_dataset(
            tokenizer, dataset_name
    ):

        train_dataset = load_dataset(dataset_name, split="train")
        original_columns = train_dataset.column_names
        num_proc = 24

        def preprocess_function(examples):
            new_examples = {
                "query": [],
                "input_ids": [],
            }
            for question in examples["instruction"]:
                query = "Question: " + question + "\n\nAnswer: "
                tokenized_question = tokenizer(query, truncation=True)
                new_examples["query"].append(query)
                new_examples["input_ids"].append(tokenized_question["input_ids"])

            return new_examples

        ds = train_dataset.map(
            preprocess_function,
            batched=True,
            num_proc=num_proc,
            remove_columns=original_columns,
        )
        ds = ds.filter(lambda x: len(x["input_ids"]) < script_args.max_length, batched=False)

        ds.set_format(type="torch")
        return ds


    def collator(data):
        return dict((key, [d[key] for d in data]) for key in data[0])

    reward_model_name = script_args.reward_model_name
    config = PPOConfig(
        model_name=script_args.model_name,
        learning_rate=script_args.learning_rate,
        log_with=script_args.log_with,
        batch_size=script_args.batch_size,
        mini_batch_size=script_args.mini_batch_size,
        gradient_accumulation_steps=script_args.gradient_accumulation_steps,
        optimize_cuda_cache=True,
        early_stopping=script_args.early_stopping,
        target_kl=script_args.target_kl,
        ppo_epochs=script_args.ppo_epochs,
        seed=script_args.seed,
    )

    rw_kwargs = {
        "return_all_scores": True,
        "function_to_apply": "none",
        "batch_size": 16,
        "truncation": True
    }

    tokenizer = AutoTokenizer.from_pretrained(script_args.model_name)
    if getattr(tokenizer, "pad_token", None) is None:
        tokenizer.pad_token = tokenizer.eos_token

    dataset = build_dataset(tokenizer, script_args.dataset_name)

    current_device = Accelerator().local_process_index

    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )
    model = AutoModelForCausalLMWithValueHead.from_pretrained(
        config.model_name,
        load_in_8bit=True,
        device_map={"": current_device},
        peft_config=lora_config,
    )

    optimizer = None
    if script_args.adafactor:
        optimizer = Adafactor(
            filter(lambda p: p.requires_grad, model.parameters()),
            scale_parameter=False,
            relative_step=False,
            warmup_init=False,
            lr=config.learning_rate,
        )

    ppo_trainer = PPOTrainer(
        config,
        model,
        ref_model=None,
        tokenizer=tokenizer,
        dataset=dataset,
        data_collator=collator,
        optimizer=optimizer,
    )

    device = ppo_trainer.accelerator.device
    if ppo_trainer.accelerator.num_processes == 1:
        device = 0 if torch.cuda.is_available() else "cpu"

    reward_tokenizer = AutoTokenizer.from_pretrained(reward_model_name)
    reward_model = pipeline(
        "text-classification",
        model=reward_model_name,
        device_map={"": current_device},
        model_kwargs={"load_in_8bit": True},
        tokenizer=reward_tokenizer,
    )

    generation_kwargs = {
        # "min_length": -1,
        "top_k": 0.0,
        "top_p": 1.0,
        "do_sample": True,
        "pad_token_id": tokenizer.eos_token_id,
    }
    output_min_length = 32
    output_max_length = script_args.output_max_length
    output_length_sampler = LengthSampler(output_min_length, output_max_length)

    for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
        question_tensors = batch["input_ids"]

        response_tensors = ppo_trainer.generate(
            question_tensors,
            return_prompt=False,
            length_sampler=output_length_sampler,
            **generation_kwargs,
        )
        batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)

        texts = [q + r for q, r in zip(batch["query"], batch["response"])]
        reward_outputs = reward_model(texts, **rw_kwargs)
        rewards = [torch.tensor(output[0]["score"] - script_args.reward_baseline) for output in reward_outputs]

        stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
        ppo_trainer.log_stats(stats, batch, rewards)

        if script_args.save_freq and epoch and epoch % script_args.save_freq == 0:
            ppo_trainer.save_pretrained(script_args.output_dir + f"step_{epoch}")

    ppo_trainer.save_pretrained(script_args.output_dir + f"step_{epoch}")

if __name__ == "__main__":
    main()

accelerateの設定が済んでいて、すぐ実行したい方は以下を実行してください。

accelerate launch train.py

コード解説

ScriptArguments

以下のように使ってください。

accelerate launch --multi_gpu --num_machines 1 \
    train.py \
    --log_with wandb \
    --save_freq 10 \

スクロールできます

引数名	デフォルト値	役割
model_name	“meta-llama/Llama-2-7b-chat-hf”	使用するCasual LMモデルの名前
tokenizer_name	“meta-llama/Llama-2-7b-chat-hf”	使用するトークナイザーの名前
reward_model_name	“OpenAssistant/reward-model-deberta-v3-base”	使用する報酬モデルの名前
dataset_name	“kunishou/hh-rlhf-49k-ja”	使用するデータセットの名前
log_with	“wandb”	ロギングに使用するツール（’wandb’であればWeights & Biases）
learning_rate	1.41e-5	学習率
max_length	512	入力の最大長
output_max_length	128	生成するテキストの最大長
mini_batch_size	1	PPOのミニバッチサイズ
batch_size	32	バッチサイズ
ppo_epochs	4	PPOのエポック数
gradient_accumulation_steps	4	勾配累積ステップ数
adafactor	False	Adafactorオプティマイザーの使用有無
early_stopping	False	早期停止の使用有無
target_kl	0.1	早期停止のためのKLダイバージェンスの目標値
reward_baseline	0.0	報酬から差し引かれるベースライン値
batched_gen	False	バッチ単位でのテキスト生成の使用有無
save_freq	None	モデルを保存するステップ数
output_dir	“./checkpoints/tuning_llama_rl/”	モデルを保存するディレクトリ
seed	0	乱数のシード

build_dataset

build_datasetでデータセットの形を整えています。

また、以下の部分で入力文章を調整できます。

        def preprocess_function(examples):
            new_examples = {
                "query": [],
                "input_ids": [],
            }
            for question in examples["instruction"]:
                query = "Question: " + question + "\n\nAnswer: "
                tokenized_question = tokenizer(query, truncation=True)
                new_examples["query"].append(query)
                new_examples["input_ids"].append(tokenized_question["input_ids"])

            return new_examples

下記のような感じになります。

Question:あなたの名前は？

Answer:

ここはHuman:～Assistant:のようにしても大丈夫です。

結果

wandbを使用すると学習状況が簡単に分かります。

他にも色々とみることができます。

感想

やっぱり難しいなという印象です。

報酬モデル自体の調整も必要で、今回使った報酬モデルは私の評価とあまり合致していないと感じました。

一方で、とりあえずは入力文だけ用意すればいいので、労力は教師ありファインチューニングよりは少なそうだと思ったので色々試していきます。

この記事が気に入ったら
フォローしてね！

Follow @note_zan_

よかったらシェアしてね！

URLをコピーしました！

マルチモーダルAIを学ぶ上で読んでほしい書籍『深層学習からマルチモーダル情報処理へ』

「BERT/GPT-3/DALL-E 自然言語処理・画像処理・音声処理人工知能プログラミング実践入門」を読んで

「Python実践100本ノックシリーズ」を読んで｜どんな人におすすめなのか

【Llama-2】8bit+LoRAでRLHFファインチューニングを試す【学習できることは確認】

学習モデル

データセット

報酬モデル

環境

ライブラリ

コード全体

コード解説

ScriptArguments

build_dataset

結果

感想

コメント

コメントするコメントをキャンセル

マルチモーダルAIを学ぶ上で読んでほしい書籍『深層学習からマルチモーダル情報処理へ』

「BERT/GPT-3/DALL-E 自然言語処理・画像処理・音声処理人工知能プログラミング実践入門」を読んで

「Python実践100本ノックシリーズ」を読んで｜どんな人におすすめなのか

マルチモーダルAIを学ぶ上で読んでほしい書籍『深層学習からマルチモーダル情報処理へ』

「BERT/GPT-3/DALL-E 自然言語処理・画像処理・音声処理 人工知能プログラミング実践入門」を読んで

「Python実践100本ノックシリーズ」を読んで｜どんな人におすすめなのか

【Llama-2】8bit+LoRAでRLHFファインチューニングを試す【学習できることは確認】

学習モデル

データセット

報酬モデル

環境

ライブラリ

コード全体

コード解説

ScriptArguments

build_dataset

結果

感想

コメント

コメントする コメントをキャンセル

マルチモーダルAIを学ぶ上で読んでほしい書籍『深層学習からマルチモーダル情報処理へ』

「BERT/GPT-3/DALL-E 自然言語処理・画像処理・音声処理 人工知能プログラミング実践入門」を読んで

「Python実践100本ノックシリーズ」を読んで｜どんな人におすすめなのか

「BERT/GPT-3/DALL-E 自然言語処理・画像処理・音声処理人工知能プログラミング実践入門」を読んで

コメントするコメントをキャンセル

「BERT/GPT-3/DALL-E 自然言語処理・画像処理・音声処理人工知能プログラミング実践入門」を読んで