用Rust实现了一个Transformer模型的一个变种——GPT-2模型。GPT-2 是基于 Transformer 架构的生成式预训练语言模型。
学习项目
原项目地址:code-cp/pico-gpt (github.com)
前言:
实现一个完整的Transformer模型通常需要以下几个关键组件和步骤:
1. 输入嵌入(Input Embeddings)
将输入序列中的每个token转换为向量表示。这个过程包括词嵌入和位置嵌入。
- 词嵌入(Token Embeddings):将每个词或token转换为固定维度的向量。
- 位置嵌入(Position Embeddings):添加位置信息,因为Transformer模型本身没有序列信息,需要通过位置嵌入将位置信息引入。
2. 自注意力机制(Self-Attention Mechanism)
核心组件,用于捕捉序列中不同位置之间的依赖关系。
- 计算查询(Q)、键(K)、值(V):通过线性变换得到查询矩阵、键矩阵和值矩阵。
- 计算注意力权重:通过查询和键的点积,应用softmax函数得到注意力权重。
- 加权求和:使用注意力权重对值进行加权求和。
3. 多头注意力机制(Multi-Head Attention Mechanism)
将多个自注意力机制并行执行,每个头捕捉不同的子空间信息,然后将结果拼接起来,再进行线性变换。
4. 前馈神经网络(Feedforward Neural Network)
对每个位置的表示进行独立的非线性变换。
5. 加和归一化(Add & Normalize)
在自注意力层和前馈层之后,分别进行残差连接和层归一化。
6. 编码器和解码器堆叠(Stacked Encoders and Decoders)
Transformer模型通常由多个编码器层和解码器层堆叠而成。编码器处理输入序列,解码器生成输出序列。
7. 输出层(Output Layer)
将解码器的输出映射到目标词汇表的概率分布上,通常通过一个线性层和softmax函数实现。
代码实现示例
以下是各个组件的实现示例,基于Rust语言:
输入嵌入
pub struct Embeddings<B: Backend> {
pub token_embedding: Tensor<B, 2>,
pub position_embedding: Tensor<B, 2>,
}
impl<B: Backend> Embeddings<B> {
pub fn new(token_embedding: Tensor<B, 2>, position_embedding: Tensor<B, 2>) -> Self {
Self {
token_embedding,
position_embedding,
}
}
pub fn forward(&self, input_ids: &Tensor<B, 2>) -> Tensor<B, 2> {
let token_embeddings = self.token_embedding.clone().select(0, input_ids);
let position_ids = (0..input_ids.shape()[1]).collect::<Vec<_>>();
let position_embeddings = self.position_embedding.clone().select(0, &position_ids);
token_embeddings + position_embeddings
}
}
自注意力机制
pub struct SelfAttention<B: Backend> {
pub query_weight: Tensor<B, 2>,
pub key_weight: Tensor<B, 2>,
pub value_weight: Tensor<B, 2>,
pub output_weight: Tensor<B, 2>,
pub num_heads: usize,
}
impl<B: Backend> SelfAttention<B> {
pub fn new(query_weight: Tensor<B, 2>, key_weight: Tensor<B, 2>, value_weight: Tensor<B, 2>, output_weight: Tensor<B, 2>, num_heads: usize) -> Self {
Self {
query_weight,
key_weight,
value_weight,
output_weight,
num_heads,
}
}
pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
let q = x.matmul(self.query_weight.clone());
let k = x.matmul(self.key_weight.clone());
let v = x.matmul(self.value_weight.clone());
let d_k = k.dims()[1] as f32;
let scores = q.matmul(k.transpose()) / d_k.sqrt();
let attention_weights = activation::softmax(scores, -1);
let context = attention_weights.matmul(v);
context.matmul(self.output_weight.clone())
}
}
多头注意力机制
pub struct MultiHeadAttention<B: Backend> {
pub heads: Vec<SelfAttention<B>>,
pub output_weight: Tensor<B, 2>,
}
impl<B: Backend> MultiHeadAttention<B> {
pub fn new(heads: Vec<SelfAttention<B>>, output_weight: Tensor<B, 2>) -> Self {
Self { heads, output_weight }
}
pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
let head_outputs: Vec<_> = self.heads.iter().map(|head| head.forward(x.clone())).collect();
let concatenated = Tensor::cat(head_outputs, 1);
concatenated.matmul(self.output_weight.clone())
}
}
前馈神经网络
pub struct FeedForward<B: Backend> {
pub linear1: Tensor<B, 2>,
pub linear2: Tensor<B, 2>,
}
impl<B: Backend> FeedForward<B> {
pub fn new(linear1: Tensor<B, 2>, linear2: Tensor<B, 2>) -> Self {
Self { linear1, linear2 }
}
pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
let x = activation::gelu(x.matmul(self.linear1.clone()));
x.matmul(self.linear2.clone())
}
}
加和归一化
pub struct AddNorm<B: Backend> {
pub layer_norm: LayerNorm<B>,
}
impl<B: Backend> AddNorm<B> {
pub fn new(layer_norm: LayerNorm<B>) -> Self {
Self { layer_norm }
}
pub fn forward(&self, x: Tensor<B, 2>, sublayer_output: Tensor<B, 2>) -> Tensor<B, 2> {
self.layer_norm.forward(x + sublayer_output)
}
}
编码器层
pub struct EncoderLayer<B: Backend> {
pub self_attention: MultiHeadAttention<B>,
pub feed_forward: FeedForward<B>,
pub add_norm1: AddNorm<B>,
pub add_norm2: AddNorm<B>,
}
impl<B: Backend> EncoderLayer<B> {
pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
let attn_output = self.self_attention.forward(x.clone());
let x = self.add_norm1.forward(x, attn_output);
let ff_output = self.feed_forward.forward(x.clone());
self.add_norm2.forward(x, ff_output)
}
}
编码器
pub struct Encoder<B: Backend> {
pub layers: Vec<EncoderLayer<B>>,
}
impl<B: Backend> Encoder<B> {
pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
self.layers.iter().fold(x, |x, layer| layer.forward(x))
}
}
代码结构和关键组件
1. 模型配置(ModelConfig)
该部分定义了模型的配置参数,包括词汇大小、上下文长度、注意力头的数量、嵌入维度和层数。
#[derive(Config)]
pub struct ModelConfig {
pub n_vocab: usize,
pub n_ctx: usize,
pub n_head: usize,
pub n_embd: usize,
pub n_layer: usize,
}
2. 模型初始化(init方法)
该方法从文件系统中加载预训练的权重,并初始化模型的各个部分,包括嵌入层和多个Transformer块。
impl ModelConfig {
pub fn init<B: Backend>(&self, model_dir: PathBuf, device: &B::Device) -> Model<B> {
let token_embedding_arr: Array2<f32> = ndarray_npy::read_npy(model_dir.join("wte.npy")).expect("should load wte");
let token_embedding_vec: Vec<f32> = token_embedding_arr.iter().copied().collect();
let position_embedding_arr: Array2<f32> = ndarray_npy::read_npy(model_dir.join("wpe.npy")).expect("should load wpe");
let position_embedding_vec: Vec<f32> = position_embedding_arr.iter().copied().collect();
let token_embedding: Tensor<B, 2> = Tensor::<B, 2>::from_data(
Data::new(
token_embedding_vec.clone(),
Shape::new([
token_embedding_arr.shape()[0],
token_embedding_arr.shape()[1],
]),
).convert(),
device,
);
let position_embedding: Tensor<B, 2> = Tensor::<B, 2>::from_data(
Data::new(
position_embedding_vec.clone(),
Shape::new([
position_embedding_arr.shape()[0],
position_embedding_arr.shape()[1],
]),
).convert(),
device,
);
let layer_norm_config = Gpt2LayerNormConfig {
layer_norm_dir: model_dir.join("ln_f"),
};
let block_config = BlockConfig {
model_dir: model_dir.to_owned(),
num_heads: self.n_head,
depth: self.n_layer,
};
Model {
token_embedding,
position_embedding,
blocks: block_config.init(device),
layer_norm: layer_norm_config.init(device),
}
}
}
3. Transformer 块(Block)
一个块包含一个注意力层和一个前馈层。每个块通过前向传播函数处理输入,并生成输出。
#[derive(Module, Debug)]
pub struct Block<B: Backend> {
pub attention: Attention<B>,
pub feedforward: FeedForward<B>,
}
impl<B: Backend> Block<B> {
pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
let x = x.clone() + self.attention.forward(x);
let x = x.clone() + self.feedforward.forward(x);
x
}
}
4. 注意力层(Attention)
注意力层实现了多头自注意力机制,包括查询(Q)、键(K)、值(V)的计算,以及使用softmax函数计算注意力权重。
#[derive(Module, Debug)]
pub struct Attention<B: Backend> {
pub layer_norm: Gpt2LayerNorm<B>,
pub expand: Gpt2LinearLayer<B>,
pub contract: Gpt2LinearLayer<B>,
pub num_heads: usize,
}
impl<B: Backend> Attention<B> {
fn attention(
q: &Tensor<B, 2>,
k: &Tensor<B, 2>,
v: &Tensor<B, 2>,
causal_mask: &Tensor<B, 2>,
) -> Tensor<B, 2> {
let d = (k.dims()[1] as f32).sqrt();
let kt = k.clone().transpose();
let qk = q.clone().matmul(kt) / d + causal_mask.clone();
let probs = activation::softmax(qk, 1);
let v = probs.matmul(v.clone());
v
}
pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
let x = self.layer_norm.forward(x);
let x = self.expand.forward(x.clone());
let qkv = x.clone().chunk(3, 1);
let qkv_heads = qkv
.iter()
.map(|v| v.clone().chunk(self.num_heads, 1))
.collect::<Vec<_>>();
let device = B::Device::default();
let head_shape = x.dims()[0];
let causal_mask = (Tensor::ones(Shape::new([head_shape, head_shape]), &device)
- Tensor::ones(Shape::new([head_shape, head_shape]), &device).tril(0))
* -1.0e4;
let out_heads = std::iter::zip(std::iter::zip(&qkv_heads[0], &qkv_heads[1]), &qkv_heads[2])
.map(|((q, k), v)| Self::attention(q, k, v, &causal_mask))
.collect();
let out_heads_concat = Tensor::cat(out_heads, 1);
let x = self.contract.forward(out_heads_concat);
x
}
}
5. 前馈层(FeedForward)
前馈层由两层线性变换组成,中间使用GELU激活函数。
#[derive(Module, Debug)]
pub struct FeedForward<B: Backend> {
pub layer_norm: Gpt2LayerNorm<B>,
pub expand: Gpt2LinearLayer<B>,
pub contract: Gpt2LinearLayer<B>,
pub activation: Gelu,
}
impl<B: Backend> FeedForward<B> {
pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
let x = self.layer_norm.forward(x);
let x = self.expand.forward(x);
let x = self.activation.forward(x);
let output = self.contract.forward(x);
output
}
}
6. 层归一化(Layer Normalization)
对输入进行归一化处理,使其均值为0,方差为1,然后进行缩放和平移。
#[derive(Module, Debug)]
pub struct Gpt2LayerNorm<B: Backend> {
pub beta: Tensor<B, 1>,
pub gamma: Tensor<B, 1>,
}
impl<B: Backend> Gpt2LayerNorm<B> {
pub fn forward(&self, x: Tensor<B, 2>) -> Tensor<B, 2> {
let eps = 1e-5;
let mean = x.clone().mean_dim(1);
let var = x.clone().var(1);
let x = (x - mean) / (var + eps).sqrt();
let gamma = self.gamma.clone().unsqueeze::<2>();
let gamma = gamma.repeat(0, x.dims()[0]);
let beta: Tensor<B, 2> = self.beta.clone().unsqueeze::<2>();
let beta = beta.repeat(0, x.dims()[0]);
let output = gamma * x + beta;
output
}
}
实现了GPT-2模型的主要组件,使用Rust语言构建,并且包含了从文件加载模型参数、初始化模型、执行前向传播和生成文本的功能。