Hand-tearing Llama3 layer 1: Implementing llama3 from scratch
Jun 01, 2024 pm 05:45 PM1. Architecture of Llama3
In this series of articles, we implement llama3 from scratch.
The overall architecture of Llama3:
Picture
The model parameters of Llama3:
Let’s take a look at these The actual values ??of the parameters in the LlaMa 3 model.
Picture
[1] Context window (context-window)
When instantiating the LlaMa class, the variable max_seq_len defines context -window. There are other parameters in the class, but this parameter is most directly related to the transformer model. The max_seq_len here is 8K.
Picture
[2] Vocabulary-size and Attention Layers
Transformer class is a A model with defined vocabulary and number of layers. Vocabulary here refers to the set of words (and tokens) that the model is able to recognize and process. Attention layers refer to the transformer block (a combination of attention and feed-forward layers) used in the model.
Picture
Based on these numbers, LlaMa 3 has a vocabulary of 128K, which is quite large. Furthermore, it has 32 transformer blocks.
[3] Feature-dimension and attention-heads
Feature-dimension and attention-heads are introduced into the Self-Attention module. Feature dimension refers to the vector size of tokens in the embedding space (feature dimension refers to the dimension size of the input data or embedding vector), while attention-heads include the QK-module that drives the self-attention mechanism in transformers.
Picture
[4] Hidden Dimensions
Hidden dimensions refer to the feed forward neural network (Feed Forward) , the dimension size of the hidden layer. Feedforward neural networks usually contain one or more hidden layers, and the dimensions of these hidden layers determine the capacity and complexity of the network. In the Transformer model, the hidden layer dimension of the feedforward neural network is usually a multiple of the feature dimension to increase the representation ability of the model. In LLama3, the hidden dimension is 1.3 times the feature dimension. It should be noted that hidden layers and hidden dimensions are two concepts.
A higher number of hidden layers allows the network to internally create and manipulate richer representations before projecting them back into smaller output dimensions.
Picture
[5] Combine the above parameters into Transformer
The first matrix is ??the input feature matrix, which is processed through the Attention layer Generate Attention Weighted features. In this image, the input feature matrix is ??only 5 x 3 in size, but in the real Llama 3 model it grows to 8K x 4096, which is huge.
Next are the hidden layers in the Feed-Forward Network, growing to 5325 and then falling back to 4096 in the last layer.
Picture
[6] Multi-layer Transformer block
LlaMa 3 combines the above 32 transformer blocks, and the output is from one block Pass to the next block until the last one is reached.
Picture
[7] Putting it all together
Once we have all the above parts started, it’s time to put them together Together, see how they create the LlaMa effect.
Picture
Step 1: First we have our input matrix with size 8K(context-window) x 128K(vocabulary-size). This matrix undergoes an embedding process to convert this high-dimensional matrix into a low-dimensional one.
Step 2: In this case, this low-dimensional result becomes 4096, which is the specified dimension of the features in the LlaMa model we saw earlier.
In neural networks, dimensionality enhancement and dimensionality reduction are common operations, and they each have different purposes and effects.
Dimensionality increase is usually to increase the capacity of the model so that it can capture more complex features and patterns. When the input data is mapped into a higher dimensional space, different feature combinations can be more easily distinguished by the model. This is especially useful when dealing with non-linear problems, as it can help the model learn more complex decision boundaries.
Dimensionality reduction is to reduce the complexity of the model and the risk of overfitting. By reducing the dimensionality of the feature space, the model can be forced to learn more refined and generalized feature representations. In addition, dimensionality reduction can be used as a regularization method to help improve the generalization ability of the model. In some cases, dimensionality reduction can also reduce computational costs and improve model operating efficiency.
In practical applications, the strategy of dimensionality increase and then dimensionality reduction can be regarded as a process of feature extraction and transformation. In this process, the model first explores the intrinsic structure of the data by increasing the dimensionality, and then extracts the most useful features and patterns by reducing the dimensionality. This method can help the model avoid overfitting to the training data while maintaining sufficient complexity.
Step 3: This feature is processed through the Transformer block, first by the Attention layer, and then by the FFN layer. The Attention layer processes across features horizontally, while the FFN layer processes across dimensions vertically.
Step 4: Step 3 is repeated for the 32 layers of the Transformer block. Finally, the dimensions of the resulting matrix are the same as those used for the feature dimensions.
Step 5: Finally, this matrix is ??converted back to the original vocabulary matrix size, which is 128K, so that the model can select and map the words available in the vocabulary.
This is how LlaMa 3 scores high on those benchmarks and creates the LlaMa 3 effect.
We will summarize several terms that are easily confused in a short language:
1. max_seq_len (maximum sequence length)
This is the model’s single processing time The maximum number of tokens that can be accepted.
In the LlaMa 3-8B model, this parameter is set to 8,000 tokens, that is, Context Window Size = 8K. This means that the maximum number of tokens the model can consider in a single processing is 8,000. This is critical for understanding long texts or maintaining the context of long-term conversations.
2. Vocabulary-size (vocabulary)
This is the number of all different tokens that the model can recognize. This includes all possible words, punctuation, and special characters. The vocabulary of the model is 128,000, expressed as Vocabulary-size = 128K. This means that the model is able to recognize and process 128,000 different tokens, which include various words, punctuation marks, and special characters.
3. Attention Layers
A main component in the Transformer model. It is mainly responsible for processing input data by learning which parts of the input data are most important (i.e. which tokens are "attended"). A model may have multiple such layers, each trying to understand the input data from a different perspective.
The LlaMa 3-8B model contains 32 processing layers, that is, Number of Layers = 32. These layers include multiple Attention Layers and other types of network layers, each of which processes and understands the input data from a different perspective.
4. transformer block
Contains modules of multiple different layers, usually including at least one Attention Layer and a Feed-Forward Network (feed-forward network). A model can have multiple transformer blocks. These blocks are connected sequentially, and the output of each block is the input of the next block. The transformer block can also be called a decoder layer.
In the context of the Transformer model, usually we say that the model has "32 layers", which can be equivalent to saying that the model has "32 Transformer blocks". Each Transformer block usually contains a self-attention layer and a feed-forward neural network layer. These two sub-layers together form a complete processing unit or "layer".
Therefore, when we say that the model has 32 Transformer blocks, we are actually describing that the model is composed of 32 such processing units, each unit has the ability to perform self-attention processing and pre-processing of data. Feed network processing. This presentation emphasizes the hierarchical structure of the model and its processing capabilities at each level.
In summary, "32 layers" and "32 Transformer blocks" are basically synonymous when describing the Transformer model structure. They both mean that the model contains 32 independent data processing cycles, and each cycle includes Self-attention and feedforward network operations.
5. Feature-dimension (feature dimension)
This is the dimension of each vector when the input token is represented as a vector in the model.
Each token is converted into a vector containing 4096 features in the model, that is, Feature-dimension = 4096. This high dimension enables the model to capture richer semantic information and contextual relationships.
6. Attention-Heads
In each Attention Layer, there can be multiple Attention-Heads, and each head independently analyzes the input data from different perspectives.
Each Attention Layer contains 32 independent Attention Heads, that is, Number of Attention Heads = 32. These heads analyze input data from different aspects and jointly provide more comprehensive data analysis capabilities.
7. Hidden Dimensions
This usually refers to the width of the layer in the Feed-Forward Network, that is, the number of neurons in each layer. Typically, Hidden Dimensions will be larger than Feature-dimension, which allows the model to create a richer data representation internally.
In Feed-Forward Networks, the dimension of the hidden layer is 5325, that is, Hidden Dimensions = 5325. This is larger than the feature dimension, allowing the model to perform deeper feature translation and learning between internal layers.
Relationships and values:
Relationship between Attention Layers and Attention-Heads: Each Attention Layer can contain multiple Attention-Heads.
Numerical relationship: A model may have multiple transformer blocks, each block contains an Attention Layer and one or more other layers. Each Attention Layer may have multiple Attention-Heads. In this way, the entire model performs complex data processing in different layers and heads.
Download the official link script of the Llama3 model: https://llama.meta.com/llama-downloads/
2. View the model
The following code shows How to use the tiktoken library to load and use a Byte Pair Encoding (BPE)-based tokenizer. This tokenizer is designed to process text data, especially for use in natural language processing and machine learning models.
We enter hello world and see how the word segmenter performs word segmentation.
from pathlib import Pathimport tiktokenfrom tiktoken.load import load_tiktoken_bpeimport torchimport jsonimport matplotlib.pyplot as plttokenizer_path = "Meta-Llama-3-8B/tokenizer.model"special_tokens = ["<|begin_of_text|>","<|end_of_text|>","<|reserved_special_token_0|>","<|reserved_special_token_1|>","<|reserved_special_token_2|>","<|reserved_special_token_3|>","<|start_header_id|>","<|end_header_id|>","<|reserved_special_token_4|>","<|eot_id|>",# end of turn] + [f"<|reserved_special_token_{i}|>" for i in range(5, 256 - 5)]mergeable_ranks = load_tiktoken_bpe(tokenizer_path)tokenizer = tiktoken.Encoding(name=Path(tokenizer_path).name,pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",mergeable_ranks=mergeable_ranks,special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},)tokenizer.decode(tokenizer.encode("hello world!"))
##Picture
model = torch.load("Meta-Llama-3-8B/consolidated.00.pth")print(json.dumps(list(model.keys())[:20], indent=4))
Picture
- "tok_embeddings.weight": This means that the model has a word embedding layer that is used to Input words (or more generally, tokens) are converted into fixed-dimensional vectors. This is the first step in most natural language processing models.
- "layers.0.attention..." and "layers.1.attention...": These parameters represent multiple layers, each layer containing an attention mechanism module. In this module, wq, wk, wv, and wo represent the weight matrices of query, key, value, and output respectively. This is the core component of the Transformer model and is used to capture the relationship between different parts of the input sequence.
- "layers.0.feed_forward..." and "layers.1.feed_forward...": These parameters indicate that each layer also contains a feed forward network (Feed Forward Network), which usually consists of two It consists of a linear transformation with a nonlinear activation function in the middle. w1, w2, and w3 may represent the weights of different linear layers in this feedforward network.
- "layers.0.attention_norm.weight" and "layers.1.attention_norm.weight": These parameters indicate that there is a normalization layer (possibly Layer Normalization) behind the attention module in each layer. , used to stabilize the training process.
- "layers.0.ffn_norm.weight" and "layers.1.ffn_norm.weight": These parameters indicate that there is also a normalization layer behind the feedforward network. The output content of the above code is the same as the picture below, which is a transformer block in Llama3.
Picture
with open("Meta-Llama-3-8B/params.json", "r") as f:config = json.load(f)config
圖片
- 'dim': 4096?- 表示模型中的隱藏層維度或特征維度。這是模型處理數(shù)據(jù)時每個向量的大小。
- 'n_layers': 32?- 表示模型中層的數(shù)量。在基于Transformer的模型中,這通常指的是編碼器和解碼器中的層的數(shù)量。
- 'n_heads': 32?- 表示在自注意力(Self-Attention)機制中,頭(head)的數(shù)量。多頭注意力機制是Transformer模型的關鍵特性之一,它允許模型在不同的表示子空間中并行捕獲信息。
- 'n_kv_heads': 8?- 這個參數(shù)不是標準Transformer模型的常見配置,可能指的是在某些特定的注意力機制中,用于鍵(Key)和值(Value)的頭的數(shù)量。
- 'vocab_size': 128256?- 表示模型使用的詞匯表大小。這是模型能夠識別的不同單詞或標記的總數(shù)。
- 'multiple_of': 1024?- 這可能是指模型的某些維度需要是1024的倍數(shù),以確保模型結構的對齊或優(yōu)化。
- 'ffn_dim_multiplier': 1.3?- 表示前饋網(wǎng)絡(Feed-Forward Network, FFN)的維度乘數(shù)。在Transformer模型中,F(xiàn)FN是每個注意力層后的一個網(wǎng)絡,這個乘數(shù)可能用于調整FFN的大小。
- 'norm_eps': 1e-05?- 表示在歸一化層(如Layer Normalization)中使用的epsilon值,用于防止除以零的錯誤。這是數(shù)值穩(wěn)定性的一個小技巧。
- 'rope_theta': 500000.0?- 這個參數(shù)不是標準Transformer模型的常見配置,可能是指某種特定于模型的技術或優(yōu)化的參數(shù)。它可能與位置編碼或某種正則化技術有關。
我們使用這個配置來推斷模型的細節(jié),比如:
- 模型有32個Transformer層
- 每個多頭注意力塊有32個頭
- 詞匯表的大小等等?
dim = config["dim"]n_layers = config["n_layers"]n_heads = config["n_heads"]n_kv_heads = config["n_kv_heads"]vocab_size = config["vocab_size"]multiple_of = config["multiple_of"]ffn_dim_multiplier = config["ffn_dim_multiplier"]norm_eps = config["norm_eps"]rope_theta = torch.tensor(config["rope_theta"])
圖片
將Text轉化為Token
代碼如下:
prompt = "the answer to the ultimate question of life, the universe, and everything is "tokens = [128000] + tokenizer.encode(prompt)print(tokens)tokens = torch.tensor(tokens)prompt_split_as_tokens = [tokenizer.decode([token.item()]) for token in tokens]print(prompt_split_as_tokens)
[128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220]['<|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']
將令牌轉換為它們的嵌入表示
截止到目前,我們的[17x1]令牌現(xiàn)在變成了[17x4096],即長度為4096的17個嵌入(每個令牌一個)。
下圖是為了驗證我們輸入的這句話,是17個token。
圖片
代碼如下:
embedding_layer = torch.nn.Embedding(vocab_size, dim)embedding_layer.weight.data.copy_(model["tok_embeddings.weight"])token_embeddings_unnormalized = embedding_layer(tokens).to(torch.bfloat16)token_embeddings_unnormalized.shape
圖片
三、構建Transformer的第一層
我們接著使用 RMS 歸一化對嵌入進行歸一化,也就是圖中這個位置:
圖片
使用公式如下:
圖片
代碼如下:
# def rms_norm(tensor, norm_weights):# rms = (tensor.pow(2).mean(-1, keepdim=True) + norm_eps)**0.5# return tensor * (norm_weights / rms)def rms_norm(tensor, norm_weights):return (tensor * torch.rsqrt(tensor.pow(2).mean(-1, keepdim=True) + norm_eps)) * norm_weights
這段代碼定義了一個名為?rms_norm?的函數(shù),它實現(xiàn)了對輸入張量(tensor)的RMS(Root Mean Square,均方根)歸一化處理。這個函數(shù)接受兩個參數(shù):tensor?和?norm_weights。tensor?是需要進行歸一化處理的輸入張量,而?norm_weights?是歸一化時使用的權重。
函數(shù)的工作原理如下:
- 首先,計算輸入張量每個元素的平方(tensor.pow(2))。
- 然后,對平方后的張量沿著最后一個維度(-1)計算均值(mean),并保持維度不變(keepdim=True),這樣得到每個元素的均方值。
- 接著,將均方值加上一個很小的正數(shù)?norm_eps(為了避免除以零的情況),然后計算其平方根的倒數(shù)(torch.rsqrt),得到RMS的倒數(shù)。
- 最后,將輸入張量與RMS的倒數(shù)相乘,再乘以歸一化權重?norm_weights,得到歸一化后的張量。
在進行歸一化處理后,我們的數(shù)據(jù)形狀仍然保持為 [17x4096],這與嵌入層的形狀相同,只不過數(shù)據(jù)已經過歸一化。
token_embeddings = rms_norm(token_embeddings_unnormalized, model["layers.0.attention_norm.weight"])token_embeddings.shape
圖片
圖片
接下來,我們介紹注意力機制的實現(xiàn),也就是下圖中的紅框標注的位置:
圖片
圖片
1. 輸入句子
- 描述:這是我們的輸入句子。
- 解釋:輸入句子被表示為一個矩陣 ( X ),其中每一行代表一個詞的嵌入向量。
2. 嵌入每個詞
- 描述:我們對每個詞進行嵌入。
- 解釋:輸入句子中的每個詞被轉換為一個高維向量,這些向量組成了矩陣 ( X )。
3. 分成8個頭
- 描述:將矩陣 ( X ) 分成8個頭。我們用權重矩陣 ( W^Q )、( W^K ) 和 ( W^V ) 分別乘以 ( X )。
- 解釋:多頭注意力機制將輸入矩陣 ( X ) 分成多個頭(這里是8個),每個頭有自己的查詢(Query)、鍵(Key)和值(Value)矩陣。具體來說,輸入矩陣 ( X ) 分別與查詢權重矩陣 ( W^Q )、鍵權重矩陣 ( W^K ) 和值權重矩陣 ( W^V ) 相乘,得到查詢矩陣 ( Q )、鍵矩陣 ( K ) 和值矩陣 ( V )。
4. 計算注意力
- 描述:使用得到的查詢、鍵和值矩陣計算注意力。
- 解釋:對于每個頭,使用查詢矩陣 ( Q )、鍵矩陣 ( K ) 和值矩陣 ( V ) 計算注意力分數(shù)。具體步驟包括:
計算 ( Q ) 和 ( K ) 的點積。
對點積結果進行縮放。
應用softmax函數(shù)得到注意力權重。
用注意力權重乘以值矩陣 ( V ) 得到輸出矩陣 ( Z )。
5. 拼接結果矩陣
- 描述:將得到的 ( Z ) 矩陣拼接起來,然后用權重矩陣 ( W^O ) 乘以拼接后的矩陣,得到層的輸出。
- 解釋:將所有頭的輸出矩陣 ( Z ) 拼接成一個矩陣,然后用輸出權重矩陣 ( W^O ) 乘以這個拼接后的矩陣,得到最終的輸出矩陣 ( Z )。
額外說明
- 查詢、鍵、值和輸出向量的形狀:在加載查詢、鍵、值和輸出向量時,注意到它們的形狀分別是 [4096x4096]、[1024x4096]、[1024x4096]、[1024x4096] 和 [4096x4096]。
- 并行化注意力頭的乘法:將它們捆綁在一起有助于并行化注意力頭的乘法。
這張圖展示了Transformer模型中多頭注意力機制的實現(xiàn)過程,從輸入句子的嵌入開始,經過多頭分割、注意力計算,最后拼接結果并生成輸出。每個步驟都詳細說明了如何從輸入矩陣 ( X ) 生成最終的輸出矩陣 ( Z )。
當我們從模型中加載查詢(query)、鍵(key)、值(value)和輸出(output)向量時,我們注意到它們的形狀分別是 [4096x4096]、[1024x4096]、[1024x4096]、[4096x4096]
乍一看這很奇怪,因為理想情況下我們希望每個頭的每個q、k、v和o都是單獨的
print(model["layers.0.attention.wq.weight"].shape,model["layers.0.attention.wk.weight"].shape,model["layers.0.attention.wv.weight"].shape,model["layers.0.attention.wo.weight"].shape)
Picture
The shape of the Query weight matrix (wq.weight) is [4096, 4096]. The shape of the key weight matrix (wk.weight) is [1024, 4096]. The shape of the value weight matrix (wv.weight) is [1024, 4096]. The shape of the output (Output) weight matrix (wo.weight) is [4096, 4096]. The output results show that the shapes of the query (Q) and output (O) weight matrices are the same, both [4096, 4096]. This means that both the input feature and the output feature have dimensions of 4096 for both query and output. The shapes of the key (K) and value (V) weight matrices are also the same, both [1024, 4096]. This shows that the input feature dimensions for keys and values ??are 4096, but the output feature dimensions are compressed to 1024. The shape of these weight matrices reflects how the model designer sets the dimensions of different parts of the attention mechanism. In particular, the dimensions of keys and values ??are reduced probably to reduce computational complexity and memory consumption, while keeping queries and outputs higher in dimensionality may be to retain more information. This design choice depends on the specific model architecture and application scenario
Let us use the sentence "I admire Li Hongzhang" as an example to simplify the implementation process of explaining the attention mechanism in this figure. Enter the sentence: First, we have the sentence "I admire Li Hongzhang". Before processing this sentence, we need to convert each word in the sentence into a mathematically processable form, that is, a word vector. This process is called word embedding.
Word embedding: Each word, such as "I", "appreciation", and "Li Hongzhang", will be converted into a fixed-size vector. These vectors contain the semantic information of the words.
Split into multiple heads: In order to allow the model to understand the sentence from different perspectives, we split the vector of each word into multiple parts, here are 8 heads. Each head focuses on a different aspect of the sentence.
Calculate attention: For each head, we will calculate something called attention. This process involves three steps: Take "I appreciate Li Hongzhang" as an example. If we want to focus on the word "appreciation", then "appreciation" is the query, and other words such as "I" and "Li Hongzhang" are keys. The vector of is the value.
Query (Q): This is the part where we want to find information. Key (K): This is the part that contains the information. Value (V): This is the actual information content. Splicing and output: After calculating the attention of each head, we concatenate these results and generate the final output through a weight matrix Wo. This output will be used in the next layer of processing or as part of the final result.
The shape problem mentioned in the comments to the figure is about how to store and process these vectors efficiently in a computer. In actual code implementation, in order to improve efficiency, developers may package the query, key, and value vectors of multiple headers together instead of processing each header individually. This can take advantage of the parallel processing capabilities of modern computers to speed up calculations.
- The shape of the query weight matrix (wq.weight) is [4096, 4096].
- The shape of the key weight matrix (wk.weight) is [1024, 4096].
- The shape of the value (Value) weight matrix (wv.weight) is [1024, 4096].
- The shape of the output (Output) weight matrix (wo.weight) is [4096, 4096].
The output results show that:
- The shapes of the query (Q) and output (O) weight matrices are the same, both [4096, 4096]. This means that both the input feature and the output feature have dimensions of 4096 for both query and output.
- The shapes of the key (K) and value (V) weight matrices are also the same, both [1024, 4096]. This shows that the input feature dimensions for keys and values ??are 4096, but the output feature dimensions are compressed to 1024.
The shape of these weight matrices reflects how the model designer sets the dimensions of different parts of the attention mechanism. In particular, the dimensions of keys and values ??are reduced probably to reduce computational complexity and memory consumption, while keeping queries and outputs higher in dimensionality may be to retain more information. This design choice depends on the specific model architecture and application scenario
Let us use the sentence "I admire Li Hongzhang" as an example to simplify the implementation process of explaining the attention mechanism in this figure.
- Input sentence: First, we have the sentence "I appreciate Li Hongzhang". Before processing this sentence, we need to convert each word in the sentence into a mathematically processable form, that is, a word vector. This process is called word embedding.
- Word embedding: Each word, such as "I", "appreciation", "Li Hongzhang", will be converted into a fixed-size vector. These vectors contain the semantic information of the words.
- Split into multiple heads: In order to allow the model to understand the sentence from different perspectives, we split the vector of each word into multiple parts, here are 8 heads. Each head focuses on a different aspect of the sentence.
- Calculate attention: For each head, we will calculate something called attention. This process involves three steps: Take "I appreciate Li Hongzhang" as an example. If we want to focus on the word "appreciation", then "appreciation" is the query, and other words such as "I" and "Li Hongzhang" are keys. The vector of is the value.
Query (Q): This is the part where we want to find information.
Key (K): This is the part that contains information.
Value (V): This is the actual information content.
- Splicing and output: After calculating the attention of each head, we splice these results together and generate the final output through a weight matrix Wo. This output will be used in the next layer of processing or as part of the final result.
The shape problem mentioned in the comments to the figure is about how to store and process these vectors efficiently in a computer. In actual code implementation, in order to improve efficiency, developers may package the query, key, and value vectors of multiple headers together instead of processing each header individually. This can take advantage of the parallel processing capabilities of modern computers to speed up calculations.
We continue to use the sentence "I appreciate Li Hongzhang" to explain the role of the weight matrices WQ, WK, WV and WO.
In the Transformer model, each word is converted into a vector through word embedding. These vectors are then passed through a series of linear transformations to calculate attention scores. These linear transformations are implemented through the weight matrices WQ, WK, WV and WO.
- WQ (weight matrix Q): This matrix is ??used to convert the vector of each word into a "query" vector. In our example, if we want to focus on the word "appreciation", we will multiply the vector of "appreciation" by WQ to get the query vector.
- WK (weight matrix K): This matrix is ??used to convert the vector of each word into a "Key" vector. Similarly, we will multiply the vector of each word, including "I" and "Li Hongzhang", by WK to get the key vector.
- WV (weight matrix V): This matrix is ??used to convert the vector of each word into a "value" vector. After multiplying each word vector by WV, we get a value vector. These three matrices (WQ, WK, WV) are used to generate different query, key and value vectors for each header. Doing this allows each head to focus on a different aspect of the sentence.
- WQ (weight matrix Q), WK (weight matrix K), WV (weight matrix V) and WO (weight matrix O) are the parameters in the Transformer model. They are passed during the model training process. It is learned by optimization methods such as backpropagation algorithm and gradient descent.
In the whole process, WQ, WK, WV and WO are learned through training. They determine how the model converts the input word vectors into different representations and how to combine these representations. Get the final output. These matrices are the core part of the attention mechanism in the Transformer model, and they enable the model to capture the relationship between different words in the sentence.
WQ (weight matrix Q), WK (weight matrix K), WV (weight matrix V) and WO (weight matrix O) are the parameters in the Transformer model. They are used in model training. In the process, it is learned through optimization methods such as backpropagation algorithm and gradient descent.
Let’s take a look at how this learning process works:
- 初始化:在訓練開始之前,這些矩陣通常會被隨機初始化。這意味著它們的初始值是隨機選取的,這樣可以打破對稱性并開始學習過程。
- 前向傳播:在模型的訓練過程中,輸入數(shù)據(jù)(如句子“我欣賞李鴻章”)會通過模型的各個層進行前向傳播。在注意力機制中,輸入的詞向量會與WQ、WK、WV矩陣相乘,以生成查詢、鍵和值向量。
- 計算損失:模型的輸出會與期望的輸出(通常是訓練數(shù)據(jù)中的標簽)進行比較,計算出一個損失值。這個損失值衡量了模型的預測與實際情況的差距。
- 反向傳播:損失值會通過反向傳播算法傳回模型,計算每個參數(shù)(包括WQ、WK、WV和WO)對損失的影響,即它們的梯度。
- 參數(shù)更新:根據(jù)計算出的梯度,使用梯度下降或其他優(yōu)化算法來更新這些矩陣的值。這個過程會逐漸減小損失值,使模型的預測更加準確。
- 迭代過程:這個前向傳播、損失計算、反向傳播和參數(shù)更新的過程會在訓練數(shù)據(jù)上多次迭代進行,直到模型的性能達到一定的標準或者不再顯著提升。
通過這個訓練過程,WQ、WK、WV和WO這些矩陣會逐漸調整它們的值,以便模型能夠更好地理解和處理輸入數(shù)據(jù)。在訓練完成后,這些矩陣將固定下來,用于模型的推理階段,即對新的輸入數(shù)據(jù)進行預測。
四、展開查詢向量
在本小節(jié)中,我們將從多個注意力頭中展開查詢向量,得到的形狀是 [32x128x4096] 這里,32 是 llama3 中注意力頭的數(shù)量,128 是查詢向量的大小,而 4096 是令牌嵌入的大小。
q_layer0 = model["layers.0.attention.wq.weight"]head_dim = q_layer0.shape[0] // n_headsq_layer0 = q_layer0.view(n_heads, head_dim, dim)q_layer0.shape
圖片
這段代碼通過對模型中第一層的查詢(Q)權重矩陣進行重塑(reshape),將其分解為多個注意力頭的形式,從而揭示了32和128這兩個維度。
- q_layer0 = model["layers.0.attention.wq.weight"]:這行代碼從模型中提取第一層的查詢(Q)權重矩陣。
- head_dim = q_layer0.shape[0] // n_heads:這行代碼計算每個注意力頭的維度大小。它通過將查詢權重矩陣的第一個維度(原本是4096)除以注意力頭的數(shù)量(n_heads),得到每個頭的維度。如果n_heads是32(即模型設計為有32個注意力頭),那么head_dim就是4096 // 32 = 128。
- q_layer0 = q_layer0.view(n_heads, head_dim, dim):這行代碼使用.view()方法重塑查詢權重矩陣,使其形狀變?yōu)閇n_heads, head_dim, dim]。這里dim很可能是原始特征維度4096,n_heads是32,head_dim是128,因此重塑后的形狀是[32, 128, 4096]。
- q_layer0.shape?輸出:torch.Size([32, 128, 4096]):這行代碼打印重塑后的查詢權重矩陣的形狀,確認了其形狀為[32, 128, 4096]。
之所以在這段代碼中出現(xiàn)了32和128這兩個維度,而在之前的代碼段中沒有,是因為這段代碼通過重塑操作明確地將查詢權重矩陣分解為多個注意力頭,每個頭具有自己的維度。32代表了模型中注意力頭的數(shù)量,而128代表了分配給每個頭的特征維度大小。這種分解是為了實現(xiàn)多頭注意力機制,其中每個頭可以獨立地關注輸入的不同部分,最終通過組合這些頭的輸出來提高模型的表達能力。?
實現(xiàn)第一層的第一個頭
訪問了第一層第一個頭的查詢(query)權重矩陣,這個查詢權重矩陣的大小是 [128x4096]。
q_layer0_head0 = q_layer0[0]q_layer0_head0.shape
圖片
我們現(xiàn)在將查詢權重與令牌嵌入相乘,以獲得令牌的查詢
在這里,你可以看到結果形狀是 [17x128],這是因為我們有17個令牌,每個令牌都有一個長度為128的查詢(每個令牌在一個頭上方的查詢)。
br
Picture
This code performs a matrix multiplication operation to combine the token embeddings (token_embeddings) with the query (query) weight of the first header of the first layer The transpose (.T) of the matrix (q_layer0_head0) is multiplied to generate the per-token query vector (q_per_token).
- q_per_token = torch.matmul(token_embeddings, q_layer0_head0.T):
torch.matmul is the matrix multiplication function in PyTorch, which can handle two tensors multiplication.
token_embeddings should be a tensor of shape [17, 4096], indicating that there are 17 tokens, each token is represented by a 4096-dimensional embedding vector.
q_layer0_head0 is the query weight matrix of the first head of the first layer, and its original shape is [128, 4096]. .T is the transpose operation in PyTorch, which transposes the shape of q_layer0_head0 to [4096, 128].
In this way, the matrix multiplication of token_embeddings and q_layer0_head0.T is the multiplication of [17, 4096] and [4096, 128], and the result is a tensor with shape [17, 128].
- q_per_token.shape and output: torch.Size([17, 128]):
This line of code prints out the shape of the q_per_token tensor, confirming that it is [ 17, 128].
This means that for every token entered (17 in total), we now have a 128-dimensional query vector. This 128-dimensional query vector is obtained by multiplying the token embedding and the query weight matrix and can be used for subsequent attention mechanism calculations.
In short, this code converts the embedding vector of each token into a query vector through matrix multiplication, preparing for the next step of implementing the attention mechanism. Each token now has a query vector corresponding to it, and these query vectors will be used to calculate attention scores with other tokens.
The above is the detailed content of Hand-tearing Llama3 layer 1: Implementing llama3 from scratch. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

The concept of deep learning originates from the research of artificial neural networks. A multi-layer perceptron containing multiple hidden layers is a deep learning structure. Deep learning combines low-level features to form more abstract high-level representations to represent categories or characteristics of data. It is able to discover distributed feature representations of data. Deep learning is a type of machine learning, and machine learning is the only way to achieve artificial intelligence. So, what are the differences between various deep learning system architectures? 1. Fully Connected Network (FCN) A fully connected network (FCN) consists of a series of fully connected layers, with every neuron in each layer connected to every neuron in another layer. Its main advantage is that it is "structure agnostic", i.e. no special assumptions about the input are required. Although this structural agnostic makes the complete

According to news from this site on June 2, at the ongoing Huang Renxun 2024 Taipei Computex keynote speech, Huang Renxun introduced that generative artificial intelligence will promote the reshaping of the full stack of software and demonstrated its NIM (Nvidia Inference Microservices) cloud-native microservices. Nvidia believes that the "AI factory" will set off a new industrial revolution: taking the software industry pioneered by Microsoft as an example, Huang Renxun believes that generative artificial intelligence will promote its full-stack reshaping. To facilitate the deployment of AI services by enterprises of all sizes, NVIDIA launched NIM (Nvidia Inference Microservices) cloud-native microservices in March this year. NIM+ is a suite of cloud-native microservices optimized to reduce time to market

SpringDataJPA is based on the JPA architecture and interacts with the database through mapping, ORM and transaction management. Its repository provides CRUD operations, and derived queries simplify database access. Additionally, it uses lazy loading to only retrieve data when necessary, thus improving performance.

Paper address: https://arxiv.org/abs/2307.09283 Code address: https://github.com/THU-MIG/RepViTRepViT performs well in the mobile ViT architecture and shows significant advantages. Next, we explore the contributions of this study. It is mentioned in the article that lightweight ViTs generally perform better than lightweight CNNs on visual tasks, mainly due to their multi-head self-attention module (MSHA) that allows the model to learn global representations. However, the architectural differences between lightweight ViTs and lightweight CNNs have not been fully studied. In this study, the authors integrated lightweight ViTs into the effective

1. Architecture of Llama3 In this series of articles, we implement llama3 from scratch. The overall architecture of Llama3: Picture the model parameters of Llama3: Let's take a look at the actual values ??of these parameters in the Llama3 model. Picture [1] Context window (context-window) When instantiating the LlaMa class, the variable max_seq_len defines context-window. There are other parameters in the class, but this parameter is most directly related to the transformer model. The max_seq_len here is 8K. Picture [2] Vocabulary-size and AttentionL
![Several front-end formatting tools worth knowing in 2023 [Summary]](https://img.php.cn/upload/article/000/000/068/63e4b85293db8322.jpg?x-oss-process=image/resize,m_fill,h_207,w_330)
eslint uses the eslint ecological chain to standardize developers' specifications for the basic syntax of js/ts. Prevent team members from writing randomly. The following eslint packages are mainly used here: Use the following statements to follow the dependencies: Next, you need to configure eslint

Artificial intelligence (AI) has changed the game in many industries, enabling businesses to improve efficiency, decision-making and customer experience. As AI continues to evolve and become more complex, it is critical that enterprises invest in the right infrastructure to support its development and deployment. A key aspect of this infrastructure is collaboration between IT and data science teams, as both play a critical role in ensuring the success of AI initiatives. The rapid development of artificial intelligence has led to increasing demands for computing power, storage and network capabilities. This demand puts pressure on traditional IT infrastructure, which was not designed to handle the complex and resource-intensive workloads required by AI. As a result, enterprises are now looking to build systems that can support AI workloads.

Deep learning models for vision tasks (such as image classification) are usually trained end-to-end with data from a single visual domain (such as natural images or computer-generated images). Generally, an application that completes vision tasks for multiple domains needs to build multiple models for each separate domain and train them independently. Data is not shared between different domains. During inference, each model will handle a specific domain. input data. Even if they are oriented to different fields, some features of the early layers between these models are similar, so joint training of these models is more efficient. This reduces latency and power consumption, and reduces the memory cost of storing each model parameter. This approach is called multi-domain learning (MDL). In addition, MDL models can also outperform single
