Rust实现GPT2

2024/5/31 Rust项目 Rust

用Rust实现了一个Transformer模型的一个变种——GPT-2模型。GPT-2 是基于 Transformer 架构的生成式预训练语言模型。

学习项目

原项目地址：code-cp/pico-gpt (github.com)

前言：

实现一个完整的Transformer模型通常需要以下几个关键组件和步骤：

1. 输入嵌入（Input Embeddings）

将输入序列中的每个token转换为向量表示。这个过程包括词嵌入和位置嵌入。

词嵌入（Token Embeddings）：将每个词或token转换为固定维度的向量。
位置嵌入（Position Embeddings）：添加位置信息，因为Transformer模型本身没有序列信息，需要通过位置嵌入将位置信息引入。

2. 自注意力机制（Self-Attention Mechanism）

核心组件，用于捕捉序列中不同位置之间的依赖关系。

计算查询（Q）、键（K）、值（V）：通过线性变换得到查询矩阵、键矩阵和值矩阵。
计算注意力权重：通过查询和键的点积，应用softmax函数得到注意力权重。
加权求和：使用注意力权重对值进行加权求和。

3. 多头注意力机制（Multi-Head Attention Mechanism）

将多个自注意力机制并行执行，每个头捕捉不同的子空间信息，然后将结果拼接起来，再进行线性变换。

4. 前馈神经网络（Feedforward Neural Network）

对每个位置的表示进行独立的非线性变换。

5. 加和归一化（Add & Normalize）

在自注意力层和前馈层之后，分别进行残差连接和层归一化。

6. 编码器和解码器堆叠（Stacked Encoders and Decoders）

Transformer模型通常由多个编码器层和解码器层堆叠而成。编码器处理输入序列，解码器生成输出序列。

7. 输出层（Output Layer）

将解码器的输出映射到目标词汇表的概率分布上，通常通过一个线性层和softmax函数实现。

代码实现示例

以下是各个组件的实现示例，基于Rust语言：

输入嵌入

pub struct Embeddings<B: Backend> {
    pub token_embedding: Tensor<B, 2>,
    pub position_embedding: Tensor<B, 2>,
}

impl<B: Backend> Embeddings<B> {
    pub fn new(token_embedding: Tensor<B, 2>, position_embedding: Tensor<B, 2>) -> Self {
        Self {
            token_embedding,
            position_embedding,
        }
    }

    pub fn forward(&self, input_ids: &Tensor<B, 2>) -> Tensor<B, 2> {
        let token_embeddings = self.token_embedding.clone().select(0, input_ids);
        let position_ids = (0..input_ids.shape()[1]).collect::<Vec<_>>();
        let position_embeddings = self.position_embedding.clone().select(0, &position_ids);
        token_embeddings + position_embeddings
    }
}

自注意力机制

pub struct SelfAttention<B: Backend> {
    pub query_weight: Tensor<B, 2>,
    pub key_weight: Tensor<B, 2>,
    pub value_weight: Tensor<B, 2>,
    pub output_weight: Tensor<B, 2>,
    pub num_heads: usize,
}

impl<B: Backend> SelfAttention<B> {
    pub fn new(query_weight: Tensor<B, 2>, key_weight: Tensor<B, 2>, value_weight: Tensor<B, 2>, output_weight: Tensor<B, 2>, num_heads: usize) -> Self {
        Self {
            query_weight,
            key_weight,
            value_weight,
            output_weight,
            num_heads,
        }
    }

    pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
        let q = x.matmul(self.query_weight.clone());
        let k = x.matmul(self.key_weight.clone());
        let v = x.matmul(self.value_weight.clone());

        let d_k = k.dims()[1] as f32;
        let scores = q.matmul(k.transpose()) / d_k.sqrt();
        let attention_weights = activation::softmax(scores, -1);

        let context = attention_weights.matmul(v);
        context.matmul(self.output_weight.clone())
    }
}

多头注意力机制

pub struct MultiHeadAttention<B: Backend> {
    pub heads: Vec<SelfAttention<B>>,
    pub output_weight: Tensor<B, 2>,
}

impl<B: Backend> MultiHeadAttention<B> {
    pub fn new(heads: Vec<SelfAttention<B>>, output_weight: Tensor<B, 2>) -> Self {
        Self { heads, output_weight }
    }

    pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
        let head_outputs: Vec<_> = self.heads.iter().map(|head| head.forward(x.clone())).collect();
        let concatenated = Tensor::cat(head_outputs, 1);
        concatenated.matmul(self.output_weight.clone())
    }
}

前馈神经网络

pub struct FeedForward<B: Backend> {
    pub linear1: Tensor<B, 2>,
    pub linear2: Tensor<B, 2>,
}

impl<B: Backend> FeedForward<B> {
    pub fn new(linear1: Tensor<B, 2>, linear2: Tensor<B, 2>) -> Self {
        Self { linear1, linear2 }
    }

    pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
        let x = activation::gelu(x.matmul(self.linear1.clone()));
        x.matmul(self.linear2.clone())
    }
}

加和归一化

pub struct AddNorm<B: Backend> {
    pub layer_norm: LayerNorm<B>,
}

impl<B: Backend> AddNorm<B> {
    pub fn new(layer_norm: LayerNorm<B>) -> Self {
        Self { layer_norm }
    }

    pub fn forward(&self, x: Tensor<B, 2>, sublayer_output: Tensor<B, 2>) -> Tensor<B, 2> {
        self.layer_norm.forward(x + sublayer_output)
    }
}

编码器层

pub struct EncoderLayer<B: Backend> {
    pub self_attention: MultiHeadAttention<B>,
    pub feed_forward: FeedForward<B>,
    pub add_norm1: AddNorm<B>,
    pub add_norm2: AddNorm<B>,
}

impl<B: Backend> EncoderLayer<B> {
    pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
        let attn_output = self.self_attention.forward(x.clone());
        let x = self.add_norm1.forward(x, attn_output);
        let ff_output = self.feed_forward.forward(x.clone());
        self.add_norm2.forward(x, ff_output)
    }
}

编码器

pub struct Encoder<B: Backend> {
    pub layers: Vec<EncoderLayer<B>>,
}

impl<B: Backend> Encoder<B> {
    pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
        self.layers.iter().fold(x, |x, layer| layer.forward(x))
    }
}

代码结构和关键组件

1. 模型配置（ModelConfig）

该部分定义了模型的配置参数，包括词汇大小、上下文长度、注意力头的数量、嵌入维度和层数。

#[derive(Config)]
pub struct ModelConfig {
    pub n_vocab: usize,
    pub n_ctx: usize,
    pub n_head: usize,
    pub n_embd: usize,
    pub n_layer: usize,
}

2. 模型初始化（init方法）

该方法从文件系统中加载预训练的权重，并初始化模型的各个部分，包括嵌入层和多个Transformer块。

impl ModelConfig {
    pub fn init<B: Backend>(&self, model_dir: PathBuf, device: &B::Device) -> Model<B> {
        let token_embedding_arr: Array2<f32> = ndarray_npy::read_npy(model_dir.join("wte.npy")).expect("should load wte");
        let token_embedding_vec: Vec<f32> = token_embedding_arr.iter().copied().collect();
        
        let position_embedding_arr: Array2<f32> = ndarray_npy::read_npy(model_dir.join("wpe.npy")).expect("should load wpe");
        let position_embedding_vec: Vec<f32> = position_embedding_arr.iter().copied().collect();
        
        let token_embedding: Tensor<B, 2> = Tensor::<B, 2>::from_data(
            Data::new(
                token_embedding_vec.clone(),
                Shape::new([
                    token_embedding_arr.shape()[0],
                    token_embedding_arr.shape()[1],
                ]),
            ).convert(),
            device,
        );

        let position_embedding: Tensor<B, 2> = Tensor::<B, 2>::from_data(
            Data::new(
                position_embedding_vec.clone(),
                Shape::new([
                    position_embedding_arr.shape()[0],
                    position_embedding_arr.shape()[1],
                ]),
            ).convert(),
            device,
        );

        let layer_norm_config = Gpt2LayerNormConfig {
            layer_norm_dir: model_dir.join("ln_f"),
        };

        let block_config = BlockConfig {
            model_dir: model_dir.to_owned(),
            num_heads: self.n_head,
            depth: self.n_layer,
        };

        Model {
            token_embedding,
            position_embedding,
            blocks: block_config.init(device),
            layer_norm: layer_norm_config.init(device),
        }
    }
}

3. Transformer 块（Block）

一个块包含一个注意力层和一个前馈层。每个块通过前向传播函数处理输入，并生成输出。

#[derive(Module, Debug)]
pub struct Block<B: Backend> {
    pub attention: Attention<B>,
    pub feedforward: FeedForward<B>,
}

impl<B: Backend> Block<B> {
    pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
        let x = x.clone() + self.attention.forward(x);
        let x = x.clone() + self.feedforward.forward(x);
        x
    }
}

4. 注意力层（Attention）

注意力层实现了多头自注意力机制，包括查询（Q）、键（K）、值（V）的计算，以及使用softmax函数计算注意力权重。

#[derive(Module, Debug)]
pub struct Attention<B: Backend> {
    pub layer_norm: Gpt2LayerNorm<B>,
    pub expand: Gpt2LinearLayer<B>,
    pub contract: Gpt2LinearLayer<B>,
    pub num_heads: usize,
}

impl<B: Backend> Attention<B> {
    fn attention(
        q: &Tensor<B, 2>,
        k: &Tensor<B, 2>,
        v: &Tensor<B, 2>,
        causal_mask: &Tensor<B, 2>,
    ) -> Tensor<B, 2> {
        let d = (k.dims()[1] as f32).sqrt();
        let kt = k.clone().transpose();
        let qk = q.clone().matmul(kt) / d + causal_mask.clone();
        let probs = activation::softmax(qk, 1);
        let v = probs.matmul(v.clone());
        v
    }

    pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
        let x = self.layer_norm.forward(x);
        let x = self.expand.forward(x.clone());
        let qkv = x.clone().chunk(3, 1);
        let qkv_heads = qkv
            .iter()
            .map(|v| v.clone().chunk(self.num_heads, 1))
            .collect::<Vec<_>>();

        let device = B::Device::default();
        let head_shape = x.dims()[0];
        let causal_mask = (Tensor::ones(Shape::new([head_shape, head_shape]), &device)
            - Tensor::ones(Shape::new([head_shape, head_shape]), &device).tril(0))
            * -1.0e4;

        let out_heads = std::iter::zip(std::iter::zip(&qkv_heads[0], &qkv_heads[1]), &qkv_heads[2])
            .map(|((q, k), v)| Self::attention(q, k, v, &causal_mask))
            .collect();
        let out_heads_concat = Tensor::cat(out_heads, 1);
        let x = self.contract.forward(out_heads_concat);
        x
    }
}

5. 前馈层（FeedForward）

前馈层由两层线性变换组成，中间使用GELU激活函数。

#[derive(Module, Debug)]
pub struct FeedForward<B: Backend> {
    pub layer_norm: Gpt2LayerNorm<B>,
    pub expand: Gpt2LinearLayer<B>,
    pub contract: Gpt2LinearLayer<B>,
    pub activation: Gelu,
}

impl<B: Backend> FeedForward<B> {
    pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
        let x = self.layer_norm.forward(x);
        let x = self.expand.forward(x);
        let x = self.activation.forward(x);
        let output = self.contract.forward(x);
        output
    }
}

6. 层归一化（Layer Normalization）

对输入进行归一化处理，使其均值为0，方差为1，然后进行缩放和平移。

#[derive(Module, Debug)]
pub struct Gpt2LayerNorm<B: Backend> {
    pub beta: Tensor<B, 1>,
    pub gamma: Tensor<B, 1>,
}

impl<B: Backend> Gpt2LayerNorm<B> {
    pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
        let eps = 1e-5;
        let mean = x.clone().mean_dim(1);
        let var = x.clone().var(1);
        let x = (x - mean) / (var + eps).sqrt();
        let gamma = self.gamma.clone().unsqueeze::<2>();
        let gamma = gamma.repeat(0, x.dims()[0]);
        let beta: Tensor<B, 2> = self.beta.clone().unsqueeze::<2>();
        let beta = beta.repeat(0, x.dims()[0]);
        let output = gamma * x + beta;
        output
    }
}

实现了GPT-2模型的主要组件，使用Rust语言构建，并且包含了从文件加载模型参数、初始化模型、执行前向传播和生成文本的功能。

LOADING