多模态text-image模型之ITM loss（blip）

阅读量：

主要代码：

复制代码

 # forward the positve image-text pair

    
 # 正向传播正面的图像文本对
    
 output_pos = self.text_encoder.bert(encoder_embeds=text_embeds, 
    
                                 attention_mask=text.attention_mask,
    
                                 encoder_hidden_states=image_embeds,
    
                                 encoder_attention_mask=image_atts,      
    
                                 return_dict=True,
    
                                 mode='fusion',
    
                                )            
    
 with torch.no_grad():
    
     bs = image.size(0)  # 获取批量大小          
    
     weights_i2t = F.softmax(sim_i2t[:, :bs], dim=1)  # 对image到text的相似度进行softmax，沿着第二个维度计算
    
     weights_t2i = F.softmax(sim_t2i[:, :bs], dim=1)  # 对text到image的相似度进行softmax，沿着第二个维度计算
    
  
    
     weights_i2t.fill_diagonal_(0)  # 将权重矩阵的对角线设为0
    
     weights_t2i.fill_diagonal_(0)  # 将权重矩阵的对角线设为0
    
  
    
 # select a negative image for each text
    
 # 为每个文本选择一个负面的图像
    
 image_embeds_neg = []    
    
 for b in range(bs):
    
     neg_idx = torch.multinomial(weights_t2i[b], 1).item()  # 根据权重选择负面图像的索引
    
     image_embeds_neg.append(image_embeds[neg_idx])  # 添加负面图像到列表
    
 image_embeds_neg = torch.stack(image_embeds_neg, dim=0)  # 将负面图像张量堆叠起来
    
  
    
 # select a negative text for each image
    
 # 为每张图像选择一个负面的文本
    
 text_embeds_neg = []
    
 text_atts_neg = []
    
 for b in range(bs):
    
     neg_idx = torch.multinomial(weights_i2t[b], 1).item()  # 根据权重选择负面文本的索引
    
     text_embeds_neg.append(text_embeds[neg_idx])  # 添加负面文本到列表
    
     text_atts_neg.append(text.attention_mask[neg_idx])  # 添加负面文本的注意力掩码到列表
    
 text_embeds_neg = torch.stack(text_embeds_neg, dim=0)  # 将负面文本张量堆叠起来
    
 text_atts_neg = torch.stack(text_atts_neg, dim=0)  # 将负面文本的注意力掩码张量堆叠起来
    
  
    
 text_embeds_all = torch.cat([text_embeds, text_embeds_neg], dim=0)  # 拼接所有的文本张量
    
 text_atts_all = torch.cat([text.attention_mask, text_atts_neg], dim=0)  # 拼接所有的文本的注意力掩码张量
    
  
    
 image_embeds_all = torch.cat([image_embeds_neg, image_embeds], dim=0)  # 拼接所有的图像张量
    
 image_atts_all = torch.cat([image_atts, image_atts], dim=0)  # 拼接所有的图像的注意力掩码张量
    
  
    
 output_neg = self.text_encoder.bert(encoder_embeds=text_embeds_all, 
    
                                 attention_mask=text_atts_all,
    
                                 encoder_hidden_states=image_embeds_all,
    
                                 encoder_attention_mask=image_atts_all,      
    
                                 return_dict=True,
    
                                 mode='fusion',
    
                                )                         
    
  
    
 vl_embeddings = torch.cat([output_pos.last_hidden_state[:, 0, :], output_neg.last_hidden_state[:, 0, :]], dim=0)  # 拼接正负样本的嵌入表示
    
 vl_output = self.itm_head(vl_embeddings)  # 输入到信息论训练头部            
    
  
    
 itm_labels = torch.cat([torch.ones(bs, dtype=torch.long), torch.zeros(2 * bs, dtype=torch.long)],  # 创建信息论训练标签
    
                    dim=0).to(image.device)  # 将标签转移到相同的设备上
    
 loss_itm = F.cross_entropy(vl_output, itm_labels)  # 计算信息论训练损失     
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-18/ZJnRm4sUOuKyEqDxle25WtI3LVaN.png)

参考：[多模态text-image模型之ITM loss-博客]( "多模态text-image模型之ITM loss-博客")

求Loss的代码：

复制代码

    loss_itm = F.cross_entropy(vl_output, itm_labels)

vl_output 是模型输出的分类得分，itm_labels 是每个样本的真实标签。

vl_output：模型输出的是经过训练头部（self.itm_head）的得分，这个头部是一个全连接层，用于将模型学到的特征映射到正面和负面类别的得分。

itm_labels：模型对应的标签，包含了每个样本的真实标签。torch.ones(bs, dtype=torch.long) 是正面样本的标签，设为 1，torch.zeros(2 * bs, dtype=torch.long) 是负面样本的标签，设为 0。然后，使用 torch.cat 函数将这些标签连接起来，形成一个完整的标签张量。

loss_itm：通过调用 F.cross_entropy 函数计算模型输出和真实标签之间的交叉熵损失。这个损失反映了模型预测和实际标签之间的差异，用于指导模型参数的更新，以便更好地区分正面和负面样本。

全部评论 (0)

还没有任何评论哟~

多模态text-image模型之ITM loss（blip）

主要代码： forwardthepositveimagetextpair 正向传播正面的图像文本对 outputpos=self.textencoder.bertencoderembeds=texte...

多模态text-image模型之LM loss （blip）

先贴官方代码：BLIP/models/blip.pyatmain·salesforce/BLIP·GitHub 关于生成式模型微调计算损失的讨论：35封私信/4条消息生成式语言模型的微调，是怎么计算损...

多模态text-image模型之ITC loss （blip）

ALBEF代码中ITCloss对应的主要代码： simi2t=imagefeat@textfeatall/self.temp simt2i=textfeat@imagefeatall/self.tem...

多模态text-image模型之ITC loss

最近在看多模态内容，记录一下文图模型中常用的损失函数。最先提出ITCloss的是论文ALBEF，下面是文章对该Loss的定义假设有输入图片I经过imageencoder之后变成vcls,v1,…,v...

多模态大模型（3）--BLIP-2

大模型如火如荼，研究者们已经不再满足于基本文本的大语言模型（LLM,LargeLanguageModel），AI领域的热点正逐步向多模态转移，具备多模态能力的多模态大型语言模型（MM（MultiMod...

多模态大模型系列论文（ALBEF、BLIP、BLIP-2）

1\.ALBEF:ALigntheimageandtextBEforeFusing 1.1论文与代码链接： https://arxiv.org/abs/2107.07651 GitHubs...

多模态之论文笔记BLIP，BLIP2，Instruct BLIP

文章目录 BLIP 一.简介 1.1摘要与引言 1.2相关工作 1.3方法模型结构预训练目标函数 CapFilt噪声过滤 1.4实验以及讨论实验设置 CapFilt的讨论 BLIP2 一.简介 ...

多模态模型入门：BLIP与OWL-ViT

BLIP 数据预处理 CapFilt：标题和过滤由于多模态模型需要大量数据集，因此通常必须使用图像和替代文本alttext对从互联网上抓取这些数据集。然而，替代文本通常不能准确描述图像的视觉内容，使...

多模态大模型调研BLIP、BLIP2、InstructBLIP

ITC:图像向量与文本向量对齐在同一特征空间 ITM:二分类任务。负样本构建:前方ITC分错的地方，在对比学习的基础上，更细粒度的对其特征。 LM:GPT的生成任务，将文本重新进行预测。 BLIP另一...

《深入浅出多模态》（六）: 多模态经典模型BLIP

🎉AI学习星球推荐：GoAI的学习社区知识星球是一个致力于提供《机器学习深度学习CVNLP大模型多模态AIGC》各个最新AI方向综述、论文等成体系的学习资料，配有全面而有深度的专栏内容，包括不限于前...

是否确定退出登录?

多模态text-image模型之ITM loss（blip）

全部评论 (0)

相关文章推荐

多模态text-image模型之ITM loss（blip）

多模态text-image模型之LM loss （blip）

多模态text-image模型之ITC loss （blip）

多模态text-image模型之ITC loss

多模态大模型（3）--BLIP-2

多模态大模型系列论文（ALBEF、BLIP、BLIP-2）

多模态之论文笔记BLIP，BLIP2，Instruct BLIP

多模态模型入门：BLIP与OWL-ViT

多模态大模型调研BLIP、BLIP2、InstructBLIP

《深入浅出多模态》（六）: 多模态经典模型BLIP