Advertisement

多模态text-image模型之LM loss (blip)

阅读量:

查看完整代码在GitHub

关于生成式模型微调计算损失的讨论:知乎用户对此进行了详细分析。(35封私信与4条消息)在生成式模型中进行微调时如何计算损失函数?其方式与transformer预训练过程是否具有相似之处?

其实就是bert的MLM Loss

回头看:BLIP_Decoder,用于解码图像和生成文本描述

主要代码1:

复制代码
 class BLIP_Decoder(nn.Module):

    
     def __init__(self,                 
    
              med_config = 'configs/med_config.json',  
    
              image_size = 384,
    
              vit = 'base',
    
              vit_grad_ckpt = False,
    
              vit_ckpt_layer = 0,
    
              prompt = 'a picture of ',
    
              ):
    
     """
    
     Args:
    
         med_config (str): path for the mixture of encoder-decoder model's configuration file
    
         image_size (int): input image size
    
         vit (str): model size of vision transformer
    
     """            
    
     super().__init__()
    
     
    
     self.visual_encoder, vision_width = create_vit(vit,image_size, vit_grad_ckpt, vit_ckpt_layer)
    
     self.tokenizer = init_tokenizer()   
    
     med_config = BertConfig.from_json_file(med_config)
    
     med_config.encoder_width = vision_width
    
     self.text_decoder = BertLMHeadModel(config=med_config)    
    
     
    
     self.prompt = prompt
    
     self.prompt_length = len(self.tokenizer(self.prompt).input_ids)-1
    
  
    
     
    
     def forward(self, image, caption):
    
     
    
     image_embeds = self.visual_encoder(image) 
    
 # 创建了一个大小与 image_embeds 的形状相同的张量 image_atts。image_embeds 是图像编码的表示,它# 是一个张量,其形状通常是 [batch_size, sequence_length, hidden_size],其中 batch_size 是批# 量大小,sequence_length 是序列长度,hidden_size 是隐藏单元的数量
    
  
    
 # 填充全 1,模型会考虑所有图像编码的位置,不会忽略任何位置
    
     image_atts = torch.ones(image_embeds.size()[:-1],dtype=torch.long).to(image.device)
    
     
    
     text = self.tokenizer(caption, padding='longest', truncation=True, max_length=40, return_tensors="pt").to(image.device) 
    
     
    
     text.input_ids[:,0] = self.tokenizer.bos_token_id
    
     
    
     decoder_targets = text.input_ids.masked_fill(text.input_ids == self.tokenizer.pad_token_id, -100)   # 等于pad_id的位置用-100替换       
    
     decoder_targets[:,:self.prompt_length] = -100  # 前 self.prompt_length 列设置为 -100,计算损失时忽略这些位置
    
     
    
 # 核心步骤:将input_id和image_embeds都传入text_decode(相当于过了cross attention的整个block)
    
     decoder_output = self.text_decoder(text.input_ids, 
    
                                        attention_mask = text.attention_mask, 
    
                                        encoder_hidden_states = image_embeds,
    
                                        encoder_attention_mask = image_atts,                  
    
                                        labels = decoder_targets,
    
                                        return_dict = True,   
    
                                       )   # 预测的sequence 
    
     loss_lm = decoder_output.loss
    
     
    
     return loss_lm
    
    
    
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-16/t9VBCSeIP02v3Lkyp5iU1xF4nrKM.png)

forward 方法:

  1. init 方法中首要创建了视觉编码模块 self.visual_encoder 与分词组件 self.tokenizer。
  2. 随后基于输入参数构建了文本解码器 self.text_decoder。
  3. 通过编码器将图像转换为嵌入表示 image_embeds。
  4. 对输入文本实施分词与填充处理,并生成了带有填充标记的目标解码器输出张量 decoder_targets。
  5. 随后调用文本解码器执行前向传播过程并计算损失值 loss_lm。
  6. 最后计算得到损失值 loss_lm 并用于模型训练过程。

生成的这一块:

复制代码
 def generate(self, image, sample=False, num_beams=3, max_length=30, min_length=10, top_p=0.9, repetition_penalty=1.0):

    
     image_embeds = self.visual_encoder(image)
    
  
    
     if not sample:
    
         image_embeds = image_embeds.repeat_interleave(num_beams,dim=0)
    
         
    
     image_atts = torch.ones(image_embeds.size()[:-1],dtype=torch.long).to(image.device)
    
     model_kwargs = {"encoder_hidden_states": image_embeds, "encoder_attention_mask":image_atts}
    
     
    
     prompt = [self.prompt] * image.size(0)
    
     input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.to(image.device) 
    
     input_ids[:,0] = self.tokenizer.bos_token_id
    
     input_ids = input_ids[:, :-1] 
    
  
    
     if sample:
    
         #nucleus sampling
    
         outputs = self.text_decoder.generate(input_ids=input_ids,
    
                                               max_length=max_length,
    
                                               min_length=min_length,
    
                                               do_sample=True,
    
                                               top_p=top_p,
    
                                               num_return_sequences=1,
    
                                               eos_token_id=self.tokenizer.sep_token_id,
    
                                               pad_token_id=self.tokenizer.pad_token_id, 
    
                                               repetition_penalty=1.1,                                            
    
                                               **model_kwargs)
    
     else:
    
         #beam search
    
         outputs = self.text_decoder.generate(input_ids=input_ids,
    
                                               max_length=max_length,
    
                                               min_length=min_length,
    
                                               num_beams=num_beams,
    
                                               eos_token_id=self.tokenizer.sep_token_id,
    
                                               pad_token_id=self.tokenizer.pad_token_id,     
    
                                               repetition_penalty=repetition_penalty,
    
                                               **model_kwargs)            
    
         
    
     captions = []    
    
     for output in outputs:
    
         caption = self.tokenizer.decode(output, skip_special_tokens=True)    
    
         captions.append(caption[len(self.prompt):])
    
     return captions
    
    
    
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-16/ReV9Oj4H0C7lcWhmNSU61qyABXMx.png)

generate 方法

该方法旨在基于输入图像生成相应的文本描述。首先对输入图像执行编码操作以获取其嵌入表示 image_embeds。接着判断是否需要采样并重复复制图像编码表示以此构建所需的模型参数配置信息。随后基于给定输入提示(prompt)构建完整的输入序列并调用内部定义的文本解码器 self.text_decoder 实现后续文字生成过程。依据相关设置参数可以选择Nucleus Sampling或其他如Beam Search等采样策略灵活控制解码过程的具体实现方式。最后从输出结果中去除输入提示部分以获取最终完整且有意义的文字描述返回。

这两种方法分别包含前向传播过程以及生成过程,并各自用于模型的训练与推理。

bert model的主要代码 :

复制代码
 class BertLMHeadModel(BertPreTrainedModel):

    
  
    
     _keys_to_ignore_on_load_unexpected = [r"pooler"]
    
     _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
    
  
    
     def __init__(self, config):
    
     super().__init__(config)
    
  
    
     self.bert = BertModel(config, add_pooling_layer=False)
    
     self.cls = BertOnlyMLMHead(config)
    
  
    
     self.init_weights()
    
  
    
     def get_output_embeddings(self):
    
     return self.cls.predictions.decoder
    
  
    
     def set_output_embeddings(self, new_embeddings):
    
     self.cls.predictions.decoder = new_embeddings
    
  
    
     def forward(
    
     self,
    
     input_ids=None,
    
     attention_mask=None,
    
     position_ids=None,
    
     head_mask=None,
    
     inputs_embeds=None,
    
     encoder_hidden_states=None,
    
     encoder_attention_mask=None,
    
     labels=None,
    
     past_key_values=None,
    
     use_cache=None,
    
     output_attentions=None,
    
     output_hidden_states=None,
    
     return_dict=None,
    
     return_logits=False,            
    
     is_decoder=True,
    
     reduction='mean',
    
     mode='multimodal', 
    
     ):
    
     r"""
    
     encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
    
         Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
    
         the model is configured as a decoder.
    
     encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
    
         Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
    
         the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
    
         - 1 for tokens that are **not masked**,
    
         - 0 for tokens that are **masked**.
    
     labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
    
         Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
    
         ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
    
         ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
    
     past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
    
         Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
    
         If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
    
         (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
    
         instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
    
     use_cache (:obj:`bool`, `optional`):
    
         If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
    
         decoding (see :obj:`past_key_values`).
    
     Returns:
    
     Example::
    
         >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig
    
         >>> import torch
    
         >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
    
         >>> config = BertConfig.from_pretrained("bert-base-cased")
    
         >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
    
         >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
    
         >>> outputs = model(**inputs)
    
         >>> prediction_logits = outputs.logits
    
     """
    
     return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    
     if labels is not None:
    
         use_cache = False
    
  
    
     outputs = self.bert(
    
         input_ids,
    
         attention_mask=attention_mask,
    
         position_ids=position_ids,
    
         head_mask=head_mask,
    
         inputs_embeds=inputs_embeds,
    
         encoder_hidden_states=encoder_hidden_states,
    
         encoder_attention_mask=encoder_attention_mask,
    
         past_key_values=past_key_values,
    
         use_cache=use_cache,
    
         output_attentions=output_attentions,
    
         output_hidden_states=output_hidden_states,
    
         return_dict=return_dict,
    
         is_decoder=is_decoder,
    
         mode=mode,
    
     )
    
     
    
     sequence_output = outputs[0]
    
     prediction_scores = self.cls(sequence_output)
    
     
    
     if return_logits:
    
         return prediction_scores[:, :-1, :].contiguous()  
    
  
    
     lm_loss = None
    
     if labels is not None:
    
         # we are doing next-token prediction; shift prediction scores and input ids by one
    
         shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
    
         labels = labels[:, 1:].contiguous()
    
         loss_fct = CrossEntropyLoss(reduction=reduction, label_smoothing=0.1) 
    
         lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
    
         if reduction=='none':
    
             lm_loss = lm_loss.view(prediction_scores.size(0),-1).sum(1)               
    
  
    
     if not return_dict:
    
         output = (prediction_scores,) + outputs[2:]
    
         return ((lm_loss,) + output) if lm_loss is not None else output
    
  
    
     return CausalLMOutputWithCrossAttentions(
    
         loss=lm_loss,
    
         logits=prediction_scores,
    
         past_key_values=outputs.past_key_values,
    
         hidden_states=outputs.hidden_states,
    
         attentions=outputs.attentions,
    
         cross_attentions=outputs.cross_attentions,
    
     )
    
  
    
     def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):
    
     input_shape = input_ids.shape
    
     # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
    
     if attention_mask is None:
    
         attention_mask = input_ids.new_ones(input_shape)
    
  
    
     # cut decoder_input_ids if past is used
    
     if past is not None:
    
         input_ids = input_ids[:, -1:]
    
  
    
     return {
    
         "input_ids": input_ids, 
    
         "attention_mask": attention_mask, 
    
         "past_key_values": past,
    
         "encoder_hidden_states": model_kwargs.get("encoder_hidden_states", None),
    
         "encoder_attention_mask": model_kwargs.get("encoder_attention_mask", None),
    
         "is_decoder": True,
    
     }
    
    
    
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-16/P5cQpGKnleAhtE9RsqZI18xbjuHz.png)

全部评论 (0)

还没有任何评论哟~