国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

Table of Contents
FSDP Q-Lora background knowledge
Set up the development environment
Creating and loading data sets
使用 PyTorch FSDP、Q-Lora 和 SDPA 來微調(diào) LLM
可選步驟:將 LoRA 的適配器融入原始模型
Home Technology peripherals AI For only $250, Hugging Face's technical director teaches you how to fine-tune Llama 3 step by step

For only $250, Hugging Face's technical director teaches you how to fine-tune Llama 3 step by step

May 06, 2024 pm 03:52 PM
python ai train Memory usage

僅用250美元,Hugging Face技術(shù)主管手把手教你微調(diào)Llama 3

We are familiar with open source large language models such as Llama 3 launched by Meta, Mistral and Mixtral models launched by Mistral AI, and Jamba launched by AI21 Lab, which have become competitors of OpenAI. .

In most cases, users need to fine-tune these open source models based on their own data to fully unleash the model's potential.

It is not difficult to fine-tune a large language model (such as Mistral) compared to a small one using Q-Learning on a single GPU, but efficient fine-tuning of a large model like Llama 370b or Mixtral has remained a challenge until now.

Therefore, Philipp Schmid, technical director of Hugging Face, explains how to fine-tune Llama 3 using PyTorch FSDP and Q-Lora, with the help of Hugging Face's TRL, Transformers, peft and datasets libraries. In addition to FSDP, the author also adapted Flash Attention v2 after PyTorch 2.2 update.

The main steps for fine-tuning are as follows:

  • Set up the development environment
  • Create and load the data set
  • Use PyTorch FSDP, Q-Lora and SDPA for fine-tuning Large language model
  • Test the model and perform inference

Please note: The experiments conducted in this article were created and verified on NVIDIA H100 and NVIDIA A10G GPUs. Profiles and code are optimized for 4xA10G GPUs, each with 24GB of memory. If the user has more computing power, the configuration file (yaml file) mentioned in step 3 needs to be modified accordingly.

FSDP Q-Lora background knowledge

Based on the collaborative project jointly participated by Answer.AI, Q-Lora founder Tim Dettmers and Hugging Face, the author has a deep understanding of Q-Lora and PyTorch FSDP (fully shared The technical support provided by data parallelism is summarized.

The combination of FSDP and Q-Lora allows users to fine-tune Llama 270b or Mixtral 8x7B on 2 consumer-grade GPUs (24GB). For details, please refer to the article below. Among them, the PEFT library of Hugging Face plays a vital role in this.

Article address: https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html

PyTorch FSDP is a data/model parallel technology, which Models can be split across GPUs, reducing memory requirements and enabling larger models to be trained more efficiently. Q-LoRA is a fine-tuning method that leverages quantization and low-rank adapters to efficiently reduce computational requirements and memory footprint.

Set up the development environment

The first step is to install Hugging Face Libraries and Pyroch, including libraries such as trl, transformers and datasets. trl is a new library built on transformers and datasets that makes fine-tuning, RLHF, and alignment of open source large language models easier.

# Install Pytorch for FSDP and FA/SDPA%pip install "torch==2.2.2" tensorboard# Install Hugging Face libraries%pip install--upgrade "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"

Next, log in to Hugging Face to get the Llama 3 70b model.

Creating and loading data sets

After the environment is set up, we can start creating and preparing data sets. The microinvocation data set should contain sample samples of the tasks the user wants to solve. Read How to fine-tune LLM with Hugging Face in 2024 to learn more about creating the dataset.

Article address: https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#3-create-and-prepare-the-dataset

The author used the HuggingFaceH4/no_robots data set, which is a high-quality data set containing 10,000 instructions and samples, and has undergone high-quality data annotation. This data can be used for supervised fine-tuning (SFT) to make language models better follow human instructions. The no_robots dataset is modeled after the human instructions dataset described in the InstructGPT paper published by OpenAI, and consists primarily of single-sentence instructions.

{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

no_robots The 10,000 samples in the data set are divided into 9,500 training samples and 500 test samples, some of which do not contain system information. The author used the datasets library to load the datasets, added missing system information, and saved them into separate json files. The sample code looks like this:

from datasets import load_dataset# Convert dataset to OAI messagessystem_message = """You are Llama, an AI assistant created by Philipp to be helpful and honest. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects."""def create_conversation(sample):if sample["messages"][0]["role"] == "system":return sampleelse:sample["messages"] = [{"role": "system", "content": system_message}] + sample["messages"]return sample# Load dataset from the hubdataset = load_dataset("HuggingFaceH4/no_robots")# Add system message to each conversationcolumns_to_remove = list(dataset["train"].features)columns_to_remove.remove("messages")dataset = dataset.map(create_conversation, remove_columns=columns_to_remove,batched=False)# Filter out conversations which are corrupted with wrong turns, keep which have even number of turns after adding system messagedataset["train"] = dataset["train"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)dataset["test"] = dataset["test"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)# save datasets to diskdataset["train"].to_json("train_dataset.json", orient="records", force_ascii=False)dataset["test"].to_json("test_dataset.json", orient="records", force_ascii=False)

使用 PyTorch FSDP、Q-Lora 和 SDPA 來微調(diào) LLM

接下來使用 PyTorch FSDP、Q-Lora 和 SDPA 對大語言模型進行微調(diào)。作者是在分布式設(shè)備中運行模型,因此需要使用 torchrun 和 python 腳本啟動訓(xùn)練。

作者編寫了 run_fsdp_qlora.py 腳本,其作用是從磁盤加載數(shù)據(jù)集、初始化模型和分詞器并開始模型訓(xùn)練。腳本使用 trl 庫中的 SFTTrainer 來對模型進行微調(diào)。

SFTTrainer 能夠讓對開源大語言模型的有監(jiān)督微調(diào)更加容易上手,具體來說有以下幾點:

格式化的數(shù)據(jù)集,包括格式化的多輪會話和指令(已使用)只對完整的內(nèi)容進行訓(xùn)練,忽略只有 prompts 的情況(未使用)打包數(shù)據(jù)集,提高訓(xùn)練效率(已使用)支持參數(shù)高效微調(diào)技術(shù),包括 Q-LoRA(已使用)為會話級任務(wù)微調(diào)初始化模型和分詞器(未使用,見下文)

注意:作者使用的是類似于 Anthropic/Vicuna 的聊天模板,設(shè)置了「用戶」和「助手」角色。這樣做是因為基礎(chǔ) Llama 3 中的特殊分詞器(<|begin_of_text|> 及 <|reserved_special_token_XX|>)沒有經(jīng)過訓(xùn)練。

這意味著如果要在模板中使用這些分詞器,還需要對它們進行訓(xùn)練,并更新嵌入層和 lm_head,對內(nèi)存會產(chǎn)生額外的需求。如果使用者有更多的算力,可以修改 run_fsdp_qlora.py 腳本中的 LLAMA_3_CHAT_TEMPLATE 環(huán)境變量。

在配置參數(shù)方面,作者使用了新的 TrlParser 變量,它允許我們在 yaml 文件中提供超參數(shù),或者通過明確地將參數(shù)傳遞給 CLI 來覆蓋配置文件中的參數(shù),例如 —num_epochs 10。以下是在 4x A10G GPU 或 4x24GB GPU 上微調(diào) Llama 3 70B 的配置文件。

%%writefile llama_3_70b_fsdp_qlora.yaml# script parametersmodel_id: "meta-llama/Meta-Llama-3-70b" # Hugging Face model iddataset_path: "."# path to datasetmax_seq_len:3072 # 2048# max sequence length for model and packing of the dataset# training parametersoutput_dir: "./llama-3-70b-hf-no-robot" # Temporary output directory for model checkpointsreport_to: "tensorboard" # report metrics to tensorboardlearning_rate: 0.0002# learning rate 2e-4lr_scheduler_type: "constant"# learning rate schedulernum_train_epochs: 3# number of training epochsper_device_train_batch_size: 1 # batch size per device during trainingper_device_eval_batch_size: 1# batch size for evaluationgradient_accumulation_steps: 2 # number of steps before performing a backward/update passoptim: adamw_torch # use torch adamw optimizerlogging_steps: 10# log every 10 stepssave_strategy: epoch # save checkpoint every epochevaluation_strategy: epoch # evaluate every epochmax_grad_norm: 0.3 # max gradient normwarmup_ratio: 0.03 # warmup ratiobf16: true # use bfloat16 precisiontf32: true # use tf32 precisiongradient_checkpointing: true # use gradient checkpointing to save memory# FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdpfsdp: "full_shard auto_wrap offload" # remove offload if enough GPU memoryfsdp_config:backward_prefetch: "backward_pre"forward_prefetch: "false"use_orig_params: "false"

注意:訓(xùn)練結(jié)束時,GPU 內(nèi)存使用量會略有增加(約 10%),這是因為模型保存所帶來的開銷。所以使用時,請確保 GPU 上有足夠的內(nèi)存來保存模型。

在啟動模型訓(xùn)練階段,作者使用 torchrun 來更加靈活地運用樣本,并且易于被調(diào)整,就像 Amazon SageMaker 及 Google Cloud Vertex AI 一樣。

對于 torchrun 和 FSDP,作者需要對環(huán)境變量 ACCELERATE_USE_FSDP 和 FSDP_CPU_RAM_EFFICIENT_LOADING 進行設(shè)置,來告訴 transformers/accelerate 使用 FSDP 并以節(jié)省內(nèi)存的方式加載模型。

注意:如果想不使用 CPU offloading 功能,需要更改 fsdp 的設(shè)置。這種操作只適用于內(nèi)存大于 40GB 的 GPU。

本文使用以下命令啟動訓(xùn)練:

!ACCELERATE_USE_FSDP=1 FSDP_CPU_RAM_EFFICIENT_LOADING=1 torchrun --nproc_per_node=4 ./scripts/run_fsdp_qlora.py --config llama_3_70b_fsdp_qlora.yaml

預(yù)期內(nèi)存使用情況:

  • 使用 FSDP 進行全微調(diào)需要約 16 塊 80GB 內(nèi)存的 GPU
  • FSDP+LoRA 需要約 8 塊 80GB 內(nèi)存的 GPU
  • FSDP+Q-Lora 需要約 2 塊 40GB 內(nèi)存的 GPU
  • FSDP+Q-Lora+CPU offloading 技術(shù)需要 4 塊 24GB 內(nèi)存的 GPU,以及一塊具備 22 GB 內(nèi)存的 GPU 和 127 GB 的 CPU RAM,序列長度為 3072、batch 大小為 1。

在 g5.12xlarge 服務(wù)器上,基于包含 1 萬個樣本的數(shù)據(jù)集,作者使用 Flash Attention 對 Llama 3 70B 進行 3 個 epoch 的訓(xùn)練,總共需要 45 小時。每小時成本為 5.67 美元,總成本為 255.15 美元。這聽起來很貴,但可以讓你在較小的 GPU 資源上對 Llama 3 70B 進行微調(diào)。

如果我們將訓(xùn)練擴展到 4x H100 GPU,訓(xùn)練時間將縮短至大約 125 小時。如果假設(shè) 1 臺 H100 的成本為 5-10 美元 / 小時,那么總成本將在 25-50 美元之間。

我們需要在易用性和性能之間做出權(quán)衡。如果能獲得更多更好的計算資源,就能減少訓(xùn)練時間和成本,但即使只有少量資源,也能對 Llama 3 70B 進行微調(diào)。對于 4x A10G GPU 而言,需要將模型加載到 CPU 上,這就降低了總體 flops,因此成本和性能會有所不同。

注意:在作者進行的評估和測試過程中,他注意到大約 40 個最大步長(將 80 個樣本堆疊為長度為三千的序列)就足以獲得初步結(jié)果。40 個步長的訓(xùn)練時間約為 1 小時,成本約合 5 美元。

可選步驟:將 LoRA 的適配器融入原始模型

使用 QLoRA 時,作者只訓(xùn)練適配器而不對整個模型做出修改。這意味著在訓(xùn)練過程中保存模型時,只保存適配器權(quán)重,而不保存完整模型。

如果使用者想保存完整的模型,使其更容易與文本生成推理器一起使用,則可以使用 merge_and_unload 方法將適配器權(quán)重合并到模型權(quán)重中,然后使用 save_pretrained 方法保存模型。這將保存一個默認(rèn)模型,可用于推理。

注意:CPU 內(nèi)存需要大于 192GB。

#### COMMENT IN TO MERGE PEFT AND BASE MODEL ##### from peft import AutoPeftModelForCausalLM# # Load PEFT model on CPU# model = AutoPeftModelForCausalLM.from_pretrained(# args.output_dir,# torch_dtype=torch.float16,# low_cpu_mem_usage=True,# )# # Merge LoRA and base model and save# merged_model = model.merge_and_unload()# merged_model.save_pretrained(args.output_dir,safe_serialization=True, max_shard_size="2GB")

模型測試和推理

訓(xùn)練完成后,我們要對模型進行評估和測試。作者從原始數(shù)據(jù)集中加載不同的樣本,并手動評估模型。評估生成式人工智能模型并非易事,因為一個輸入可能有多個正確的輸出。閱讀《評估 LLMs 和 RAG,一個使用 Langchain 和 Hugging Face 的實用案例》可以了解到關(guān)于評估生成模型的相關(guān)內(nèi)容。

文章地址:https://www.philschmid.de/evaluate-llm

import torchfrom peft import AutoPeftModelForCausalLMfrom transformers import AutoTokenizerpeft_model_id = "./llama-3-70b-hf-no-robot"# Load Model with PEFT adaptermodel = AutoPeftModelForCausalLM.from_pretrained(peft_model_id,torch_dtype=torch.float16,quantization_config= {"load_in_4bit": True},device_map="auto")tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

接下來加載測試數(shù)據(jù)集,嘗試生成指令。

from datasets import load_datasetfrom random import randint# Load our test dataseteval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")rand_idx = randint(0, len(eval_dataset))messages = eval_dataset[rand_idx]["messages"][:2]# Test on sampleinput_ids = tokenizer.apply_chat_template(messages,add_generation_prompt=True,return_tensors="pt").to(model.device)outputs = model.generate(input_ids,max_new_tokens=512,eos_token_id= tokenizer.eos_token_id,do_sample=True,temperature=0.6,top_p=0.9,)response = outputs[0][input_ids.shape[-1]:]print(f"**Query:**\n{eval_dataset[rand_idx]['messages'][1]['content']}\n")print(f"**Original Answer:**\n{eval_dataset[rand_idx]['messages'][2]['content']}\n")print(f"**Generated Answer:**\n{tokenizer.decode(response,skip_special_tokens=True)}")# **Query:**# How long was the Revolutionary War?# **Original Answer:**# The American Revolutionary War lasted just over seven years. The war started on April 19, 1775, and ended on September 3, 1783.# **Generated Answer:**# The Revolutionary War, also known as the American Revolution, was an 18th-century war fought between the Kingdom of Great Britain and the Thirteen Colonies. The war lasted from 1775 to 1783.

至此,主要流程就介紹完了,心動不如行動,趕緊從第一步開始操作吧。

The above is the detailed content of For only $250, Hugging Face's technical director teaches you how to fine-tune Llama 3 step by step. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What is Impossible Cloud Network (ICNT)? How? A comprehensive introduction to the ICN project that Binance will launch soon What is Impossible Cloud Network (ICNT)? How? A comprehensive introduction to the ICN project that Binance will launch soon Jul 07, 2025 pm 07:06 PM

Contents 1. What is ICN? 2. ICNT latest updates 3. Comparison and economic model between ICN and other DePIN projects and economic models 4. Conclusion of the next stage of the DePIN track At the end of May, ICN (ImpossibleCloudNetwork) @ICN_Protocol announced that it had received strategic investment in NGPCapital with a valuation of US$470 million. Many people's first reaction was: "Has Xiaomi invested in Web3?" Although this was not Lei Jun's direct move, the one who had bet on Xiaomi, Helium, and WorkFusion

2025 Stablecoin Investment Tutorial How to Choose a Safe Stablecoin Platform 2025 Stablecoin Investment Tutorial How to Choose a Safe Stablecoin Platform Jul 07, 2025 pm 09:09 PM

How do novice users choose a safe and reliable stablecoin platform? This article recommends the Top 10 stablecoin platforms in 2025, including Binance, OKX, Bybit, Gate.io, HTX, KuCoin, MEXC, Bitget, CoinEx and ProBit, and compares and analyzes them from dimensions such as security, stablecoin types, liquidity, user experience, fee structure and additional functions. The data comes from CoinGecko, DefiLlama and community evaluation. It is recommended that novices choose platforms that are highly compliant, easy to operate and support Chinese, such as KuCoin and CoinEx, and gradually build confidence through a small number of tests.

How to avoid risks in the turmoil in the currency circle? The TOP3 stablecoin list is revealed How to avoid risks in the turmoil in the currency circle? The TOP3 stablecoin list is revealed Jul 08, 2025 pm 07:27 PM

Against the backdrop of violent fluctuations in the cryptocurrency market, investors' demand for asset preservation is becoming increasingly prominent. This article aims to answer how to effectively hedge risks in the turbulent currency circle. It will introduce in detail the concept of stablecoin, a core hedge tool, and provide a list of TOP3 stablecoins by analyzing the current highly recognized options in the market. The article will explain how to select and use these stablecoins according to their own needs, so as to better manage risks in an uncertain market environment.

Binance Exchange official website entrance binance link entrance Binance Exchange official website entrance binance link entrance Jul 07, 2025 pm 06:54 PM

Binance is the world's leading cryptocurrency trading platform, providing a variety of trading services such as spot, contracts, options, and value-added services such as financial management, lending and other value-added services. 1. The user base is huge and the market liquidity is high, which is conducive to rapid transactions and reduce the impact of price fluctuations; 2. Provide a wealth of mainstream and emerging currency trading pairs, and covers a variety of financial derivatives; 3. It has a high-performance trading engine and multiple security protection measures to ensure transaction stability and asset security; 4. It has built a diversified blockchain ecosystem including public chains, project incubation, financial products, industry research and education; 5. It operates globally and actively arranges compliance, supports multi-fiat currency and multi-language services, and adapts to regulatory requirements in different regions.

What is a stablecoin? What are the types of stable currencies? Is it related to US Treasury bonds? What is a stablecoin? What are the types of stable currencies? Is it related to US Treasury bonds? Jul 07, 2025 pm 08:36 PM

Stable coins are digital currencies that maintain stable value by anchoring specific assets. They are mainly divided into three categories: fiat currency collateral, crypto asset collateral and algorithmic. Among them, fiat currency collateral such as USDT and USDC are widely used, and their reserves are often invested in US Treasury bonds, forming a close connection with the traditional financial system.

Global stablecoin market value PK! Who is the gold substitute in the bear market Global stablecoin market value PK! Who is the gold substitute in the bear market Jul 08, 2025 pm 07:24 PM

This article will discuss the world's mainstream stablecoins and analyze which stablecoins have the risk aversion attribute of "gold substitute" in the market downward cycle (bear market). We will explain how to judge and choose a relatively stable value storage tool in a bear market by comparing the market value, endorsement mechanism, transparency, and comprehensively combining common views on the Internet, and explain this analysis process.

What are Python type hints? What are Python type hints? Jul 07, 2025 am 02:55 AM

TypehintsinPythonsolvetheproblemofambiguityandpotentialbugsindynamicallytypedcodebyallowingdeveloperstospecifyexpectedtypes.Theyenhancereadability,enableearlybugdetection,andimprovetoolingsupport.Typehintsareaddedusingacolon(:)forvariablesandparamete

Review of the most complete historical price of Ethereum ETH 2010-2025 (the latest version in 2025) Review of the most complete historical price of Ethereum ETH 2010-2025 (the latest version in 2025) Jul 07, 2025 pm 09:00 PM

Ethereum price has gone through several critical stages, from $0.70 in 2015 to $3,050 in 2025. 1) From 2015 to 2016, ETH rose from $0.70 to $20.64 in mid-2016; 2) from 2017 to 2018, driven by the ICO boom, reached $1,417 in early 2018, and then fell to $80 due to regulatory concerns; 3) from 2019 to 2020, and rose to $737 under DeFi; 4) from 2021, hit a new high of $4,864, and then fell to $1,200-2,000 due to PoS transformation; 5) from 2023 to 2024 to about $3,000

See all articles