多模态text-image模型之LM loss （blip）

阅读量：

查看完整代码在GitHub

关于生成式模型微调计算损失的讨论：知乎用户对此进行了详细分析。（35封私信与4条消息）在生成式模型中进行微调时如何计算损失函数？其方式与transformer预训练过程是否具有相似之处？

其实就是bert的MLM Loss

回头看：BLIP_Decoder，用于解码图像和生成文本描述

主要代码1：

复制代码

 class BLIP_Decoder(nn.Module):

    
     def __init__(self,                 
    
              med_config = 'configs/med_config.json',  
    
              image_size = 384,
    
              vit = 'base',
    
              vit_grad_ckpt = False,
    
              vit_ckpt_layer = 0,
    
              prompt = 'a picture of ',
    
              ):
    
     """
    
     Args:
    
         med_config (str): path for the mixture of encoder-decoder model's configuration file
    
         image_size (int): input image size
    
         vit (str): model size of vision transformer
    
     """            
    
     super().__init__()
    
     
    
     self.visual_encoder, vision_width = create_vit(vit,image_size, vit_grad_ckpt, vit_ckpt_layer)
    
     self.tokenizer = init_tokenizer()   
    
     med_config = BertConfig.from_json_file(med_config)
    
     med_config.encoder_width = vision_width
    
     self.text_decoder = BertLMHeadModel(config=med_config)    
    
     
    
     self.prompt = prompt
    
     self.prompt_length = len(self.tokenizer(self.prompt).input_ids)-1
    
  
    
     
    
     def forward(self, image, caption):
    
     
    
     image_embeds = self.visual_encoder(image) 
    
 # 创建了一个大小与 image_embeds 的形状相同的张量 image_atts。image_embeds 是图像编码的表示，它# 是一个张量，其形状通常是 [batch_size, sequence_length, hidden_size]，其中 batch_size 是批# 量大小，sequence_length 是序列长度，hidden_size 是隐藏单元的数量
    
  
    
 # 填充全 1，模型会考虑所有图像编码的位置，不会忽略任何位置
    
     image_atts = torch.ones(image_embeds.size()[:-1],dtype=torch.long).to(image.device)
    
     
    
     text = self.tokenizer(caption, padding='longest', truncation=True, max_length=40, return_tensors="pt").to(image.device) 
    
     
    
     text.input_ids[:,0] = self.tokenizer.bos_token_id
    
     
    
     decoder_targets = text.input_ids.masked_fill(text.input_ids == self.tokenizer.pad_token_id, -100)   # 等于pad_id的位置用-100替换       
    
     decoder_targets[:,:self.prompt_length] = -100  # 前 self.prompt_length 列设置为 -100，计算损失时忽略这些位置
    
     
    
 # 核心步骤：将input_id和image_embeds都传入text_decode（相当于过了cross attention的整个block）
    
     decoder_output = self.text_decoder(text.input_ids, 
    
                                        attention_mask = text.attention_mask, 
    
                                        encoder_hidden_states = image_embeds,
    
                                        encoder_attention_mask = image_atts,                  
    
                                        labels = decoder_targets,
    
                                        return_dict = True,   
    
                                       )   # 预测的sequence 
    
     loss_lm = decoder_output.loss
    
     
    
     return loss_lm
    
    
    
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-16/t9VBCSeIP02v3Lkyp5iU1xF4nrKM.png)

forward 方法：

在 init 方法中首要创建了视觉编码模块 self.visual_encoder 与分词组件 self.tokenizer。
随后基于输入参数构建了文本解码器 self.text_decoder。
通过编码器将图像转换为嵌入表示 image_embeds。
对输入文本实施分词与填充处理，并生成了带有填充标记的目标解码器输出张量 decoder_targets。
随后调用文本解码器执行前向传播过程并计算损失值 loss_lm。
最后计算得到损失值 loss_lm 并用于模型训练过程。

生成的这一块：

复制代码

 def generate(self, image, sample=False, num_beams=3, max_length=30, min_length=10, top_p=0.9, repetition_penalty=1.0):

    
     image_embeds = self.visual_encoder(image)
    
  
    
     if not sample:
    
         image_embeds = image_embeds.repeat_interleave(num_beams,dim=0)
    
         
    
     image_atts = torch.ones(image_embeds.size()[:-1],dtype=torch.long).to(image.device)
    
     model_kwargs = {"encoder_hidden_states": image_embeds, "encoder_attention_mask":image_atts}
    
     
    
     prompt = [self.prompt] * image.size(0)
    
     input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.to(image.device) 
    
     input_ids[:,0] = self.tokenizer.bos_token_id
    
     input_ids = input_ids[:, :-1] 
    
  
    
     if sample:
    
         #nucleus sampling
    
         outputs = self.text_decoder.generate(input_ids=input_ids,
    
                                               max_length=max_length,
    
                                               min_length=min_length,
    
                                               do_sample=True,
    
                                               top_p=top_p,
    
                                               num_return_sequences=1,
    
                                               eos_token_id=self.tokenizer.sep_token_id,
    
                                               pad_token_id=self.tokenizer.pad_token_id, 
    
                                               repetition_penalty=1.1,                                            
    
                                               **model_kwargs)
    
     else:
    
         #beam search
    
         outputs = self.text_decoder.generate(input_ids=input_ids,
    
                                               max_length=max_length,
    
                                               min_length=min_length,
    
                                               num_beams=num_beams,
    
                                               eos_token_id=self.tokenizer.sep_token_id,
    
                                               pad_token_id=self.tokenizer.pad_token_id,     
    
                                               repetition_penalty=repetition_penalty,
    
                                               **model_kwargs)            
    
         
    
     captions = []    
    
     for output in outputs:
    
         caption = self.tokenizer.decode(output, skip_special_tokens=True)    
    
         captions.append(caption[len(self.prompt):])
    
     return captions
    
    
    
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-16/ReV9Oj4H0C7lcWhmNSU61qyABXMx.png)

generate 方法

该方法旨在基于输入图像生成相应的文本描述。首先对输入图像执行编码操作以获取其嵌入表示 image_embeds。接着判断是否需要采样并重复复制图像编码表示以此构建所需的模型参数配置信息。随后基于给定输入提示（prompt）构建完整的输入序列并调用内部定义的文本解码器 self.text_decoder 实现后续文字生成过程。依据相关设置参数可以选择Nucleus Sampling或其他如Beam Search等采样策略灵活控制解码过程的具体实现方式。最后从输出结果中去除输入提示部分以获取最终完整且有意义的文字描述返回。

这两种方法分别包含前向传播过程以及生成过程，并各自用于模型的训练与推理。

bert model的主要代码：

复制代码

 class BertLMHeadModel(BertPreTrainedModel):

    
  
    
     _keys_to_ignore_on_load_unexpected = [r"pooler"]
    
     _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
    
  
    
     def __init__(self, config):
    
     super().__init__(config)
    
  
    
     self.bert = BertModel(config, add_pooling_layer=False)
    
     self.cls = BertOnlyMLMHead(config)
    
  
    
     self.init_weights()
    
  
    
     def get_output_embeddings(self):
    
     return self.cls.predictions.decoder
    
  
    
     def set_output_embeddings(self, new_embeddings):
    
     self.cls.predictions.decoder = new_embeddings
    
  
    
     def forward(
    
     self,
    
     input_ids=None,
    
     attention_mask=None,
    
     position_ids=None,
    
     head_mask=None,
    
     inputs_embeds=None,
    
     encoder_hidden_states=None,
    
     encoder_attention_mask=None,
    
     labels=None,
    
     past_key_values=None,
    
     use_cache=None,
    
     output_attentions=None,
    
     output_hidden_states=None,
    
     return_dict=None,
    
     return_logits=False,            
    
     is_decoder=True,
    
     reduction='mean',
    
     mode='multimodal', 
    
     ):
    
     r"""
    
     encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
    
         Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
    
         the model is configured as a decoder.
    
     encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
    
         Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
    
         the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
    
         - 1 for tokens that are **not masked**,
    
         - 0 for tokens that are **masked**.
    
     labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
    
         Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
    
         ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
    
         ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
    
     past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
    
         Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
    
         If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
    
         (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
    
         instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
    
     use_cache (:obj:`bool`, `optional`):
    
         If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
    
         decoding (see :obj:`past_key_values`).
    
     Returns:
    
     Example::
    
         >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig
    
         >>> import torch
    
         >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
    
         >>> config = BertConfig.from_pretrained("bert-base-cased")
    
         >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
    
         >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
    
         >>> outputs = model(**inputs)
    
         >>> prediction_logits = outputs.logits
    
     """
    
     return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    
     if labels is not None:
    
         use_cache = False
    
  
    
     outputs = self.bert(
    
         input_ids,
    
         attention_mask=attention_mask,
    
         position_ids=position_ids,
    
         head_mask=head_mask,
    
         inputs_embeds=inputs_embeds,
    
         encoder_hidden_states=encoder_hidden_states,
    
         encoder_attention_mask=encoder_attention_mask,
    
         past_key_values=past_key_values,
    
         use_cache=use_cache,
    
         output_attentions=output_attentions,
    
         output_hidden_states=output_hidden_states,
    
         return_dict=return_dict,
    
         is_decoder=is_decoder,
    
         mode=mode,
    
     )
    
     
    
     sequence_output = outputs[0]
    
     prediction_scores = self.cls(sequence_output)
    
     
    
     if return_logits:
    
         return prediction_scores[:, :-1, :].contiguous()  
    
  
    
     lm_loss = None
    
     if labels is not None:
    
         # we are doing next-token prediction; shift prediction scores and input ids by one
    
         shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
    
         labels = labels[:, 1:].contiguous()
    
         loss_fct = CrossEntropyLoss(reduction=reduction, label_smoothing=0.1) 
    
         lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
    
         if reduction=='none':
    
             lm_loss = lm_loss.view(prediction_scores.size(0),-1).sum(1)               
    
  
    
     if not return_dict:
    
         output = (prediction_scores,) + outputs[2:]
    
         return ((lm_loss,) + output) if lm_loss is not None else output
    
  
    
     return CausalLMOutputWithCrossAttentions(
    
         loss=lm_loss,
    
         logits=prediction_scores,
    
         past_key_values=outputs.past_key_values,
    
         hidden_states=outputs.hidden_states,
    
         attentions=outputs.attentions,
    
         cross_attentions=outputs.cross_attentions,
    
     )
    
  
    
     def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):
    
     input_shape = input_ids.shape
    
     # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
    
     if attention_mask is None:
    
         attention_mask = input_ids.new_ones(input_shape)
    
  
    
     # cut decoder_input_ids if past is used
    
     if past is not None:
    
         input_ids = input_ids[:, -1:]
    
  
    
     return {
    
         "input_ids": input_ids, 
    
         "attention_mask": attention_mask, 
    
         "past_key_values": past,
    
         "encoder_hidden_states": model_kwargs.get("encoder_hidden_states", None),
    
         "encoder_attention_mask": model_kwargs.get("encoder_attention_mask", None),
    
         "is_decoder": True,
    
     }
    
    
    
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-16/P5cQpGKnleAhtE9RsqZI18xbjuHz.png)

全部评论 (0)

还没有任何评论哟~

多模态text-image模型之LM loss （blip）

先贴官方代码：BLIP/models/blip.pyatmain·salesforce/BLIP·GitHub 关于生成式模型微调计算损失的讨论：35封私信/4条消息生成式语言模型的微调，是怎么计算损...

多模态text-image模型之ITM loss（blip）

主要代码： forwardthepositveimagetextpair 正向传播正面的图像文本对 outputpos=self.textencoder.bertencoderembeds=texte...

多模态text-image模型之ITC loss （blip）

ALBEF代码中ITCloss对应的主要代码： simi2t=imagefeat@textfeatall/self.temp simt2i=textfeat@imagefeatall/self.tem...

多模态text-image模型之ITC loss

最近在看多模态内容，记录一下文图模型中常用的损失函数。最先提出ITCloss的是论文ALBEF，下面是文章对该Loss的定义假设有输入图片I经过imageencoder之后变成vcls,v1,…,v...

多模态大模型（3）--BLIP-2

大模型如火如荼，研究者们已经不再满足于基本文本的大语言模型（LLM,LargeLanguageModel），AI领域的热点正逐步向多模态转移，具备多模态能力的多模态大型语言模型（MM（MultiMod...

多模态大模型系列论文（ALBEF、BLIP、BLIP-2）

1\.ALBEF:ALigntheimageandtextBEforeFusing 1.1论文与代码链接： https://arxiv.org/abs/2107.07651 GitHubs...

多模态之论文笔记BLIP，BLIP2，Instruct BLIP

文章目录 BLIP 一.简介 1.1摘要与引言 1.2相关工作 1.3方法模型结构预训练目标函数 CapFilt噪声过滤 1.4实验以及讨论实验设置 CapFilt的讨论 BLIP2 一.简介 ...

多模态模型入门：BLIP与OWL-ViT

BLIP 数据预处理 CapFilt：标题和过滤由于多模态模型需要大量数据集，因此通常必须使用图像和替代文本alttext对从互联网上抓取这些数据集。然而，替代文本通常不能准确描述图像的视觉内容，使...

多模态大模型调研BLIP、BLIP2、InstructBLIP

ITC:图像向量与文本向量对齐在同一特征空间 ITM:二分类任务。负样本构建:前方ITC分错的地方，在对比学习的基础上，更细粒度的对其特征。 LM:GPT的生成任务，将文本重新进行预测。 BLIP另一...

《深入浅出多模态》（六）: 多模态经典模型BLIP

🎉AI学习星球推荐：GoAI的学习社区知识星球是一个致力于提供《机器学习深度学习CVNLP大模型多模态AIGC》各个最新AI方向综述、论文等成体系的学习资料，配有全面而有深度的专栏内容，包括不限于前...

是否确定退出登录?

多模态text-image模型之LM loss （blip）

全部评论 (0)

相关文章推荐

多模态text-image模型之LM loss （blip）

多模态text-image模型之ITM loss（blip）

多模态text-image模型之ITC loss （blip）

多模态text-image模型之ITC loss

多模态大模型（3）--BLIP-2

多模态大模型系列论文（ALBEF、BLIP、BLIP-2）

多模态之论文笔记BLIP，BLIP2，Instruct BLIP

多模态模型入门：BLIP与OWL-ViT

多模态大模型调研BLIP、BLIP2、InstructBLIP

《深入浅出多模态》（六）: 多模态经典模型BLIP