CVPR 2024 视觉语音方向总汇（视觉理解、视觉定位、大模型和多模态模型等）

阅读量：

1、Vision-Language(视觉语言)

复制代码

(https://github.com/htyao89/Textual-based_Class-aware_prompt_tuning/ "code")

  * [On Scaling Up a Multilingual Vision and Language Model](http://arxiv.org/abs/2305.18565 "On Scaling Up a Multilingual Vision and Language Model")
  * [CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/abs/2312.08914 "CogAgent: A Visual Language Model for GUI Agents")  
⭐

(https://github.com/THUDM/CogVLM "code")

复制代码

(https://github.com/Zi-hao-Wei/Efficient-Vision-Language-Pre-training-by-Cluster-Masking "code")  
🏠[project](https://zxp46.github.io/cluster-masking/ "project")

  * [VILA: On Pre-training for Visual Language Models](https://arxiv.org/abs/2312.07533 "VILA: On Pre-training for Visual Language Models")
  * [EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models](https://arxiv.org/pdf/2311.15596.pdf "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models")  
⭐

(https://github.com/AdaCheng/EgoThink "code")
🏠project

复制代码

(https://github.com/UX-Decoder/DINOv "code")

  * [Semantics-aware Motion Retargeting with Vision-Language Models](https://arxiv.org/abs/2312.01964 "Semantics-aware Motion Retargeting with Vision-Language Models")
  * [DePT: Decoupled Prompt Tuning](https://arxiv.org/abs/2309.07439 "DePT: Decoupled Prompt Tuning")  
⭐

(https://github.com/Koorye/DePT "code")

Osprey: Pixel Understanding with Visual Instruction Tuning
⭐

复制代码

(https://github.com/CircleRadon/Osprey "code")

  * [FairCLIP: Harnessing Fairness in Vision-Language Learning](http://arxiv.org/abs/2403.19949v1 "FairCLIP: Harnessing Fairness in Vision-Language Learning")  
🏠[project](https://ophai.hms.harvard.edu/datasets/fairvlmed10k "project")

  * [Efficient Test-Time Adaptation of Vision-Language Models](http://arxiv.org/abs/2403.18293v1 "Efficient Test-Time Adaptation of Vision-Language Models")  
⭐

(https://kdiaaa.github.io/tda/ "code")

BioCLIP: A Vision Foundation Model for the Tree of Life
⭐

复制代码

(https://github.com/Imageomics/bioclip "code")

  * [InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks](https://arxiv.org/abs/2312.14238 "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks")  
⭐

(https://github.com/OpenGVLab/InternVL "code")

复制代码

(https://github.com/TreeLLi/APT "code")

  * [SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models](http://arxiv.org/abs/2403.13263v1 "SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models")  
⭐

(https://github.com/ivattyue/SC-Tune "code")

复制代码

(https://github.com/WalBouss/GEM "code")

  * [Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples](https://arxiv.org/abs/2312.00825 "Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples")
  * [Beyond Text: Frozen Large Language Models in Visual Signal Comprehension](http://arxiv.org/abs/2403.07874v1 "Beyond Text: Frozen Large Language Models in Visual Signal Comprehension")  
⭐

(https://github.com/zh460045050/V2L-Tokenizer "code")

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
⭐

复制代码

(https://github.com/lezhang7/Enhance-FineGrained "code")视觉语言构图理解

  * [FFF: Fixing Flawed Foundations in Contrastive Pre-Training Results in Very Strong Vision-Language Models](http://arxiv.org/abs/2405.10286 "FFF: Fixing Flawed Foundations in Contrastive Pre-Training Results in Very Strong Vision-Language Models")
  * [Enhancing Vision-Language Pretraining with Rich Supervisions]
  * [Improved Baselines with Visual Instruction Tuning](https://arxiv.org/abs/2310.03744 "Improved Baselines with Visual Instruction Tuning")  
🏠[project](https://llava-vl.github.io/ "project")

  * [Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment](https://openaccess.thecvf.com/content/CVPR2024/papers/Ishmam_Semantic_Shield_Defending_Vision-Language_Models_Against_Backdooring_and_Poisoning_via_CVPR_2024_paper.pdf "Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment")
  * [Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models](http://arxiv.org/abs/2403.17589v1 "Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models")  
⭐

(https://github.com/YBZh/DMN "code")

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
⭐

复制代码

(https://github.com/jusiro/CLAP "code")

  * [Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding](https://arxiv.org/abs/2312.00081 "Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding")  
⭐

(https://github.com/wjpoom/SPEC "code")

复制代码

(https://github.com/kwonjunn01/Hi-Mapper "code")

  * [Iterated Learning Improves Compositionality in Large Vision-Language Models](http://arxiv.org/abs/2404.02145v1 "Iterated Learning Improves Compositionality in Large Vision-Language Models")
  * [ViTamin: Designing Scalable Vision Models in the Vision-Language Era](http://arxiv.org/abs/2404.02132v1 "ViTamin: Designing Scalable Vision Models in the Vision-Language Era")  
⭐

(https://github.com/Beckschen/ViTamin "code")

Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners
⭐

复制代码

(https://github.com/KHU-AGI/PriViLege "code")

  * [Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models](https://arxiv.org/abs/2404.02233 "Visual Concept Connectome \(VCC\): Open World Concept Discovery and their Interlayer Connections in Deep Models")  
🏠[project](https://yorkucvil.github.io/VCC "project")

  * [Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning](https://arxiv.org/abs/2404.03658 "Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning")  
🏠[project](https://ruili3.github.io/kyn "project")

  * [HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models](https://arxiv.org/abs/2310.14566 "HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models")  
⭐

(https://github.com/tianyi-lab/HallusionBench "code")

复制代码

(https://github.com/google-research/syn-rep-learn "code")

  * [Probing the 3D Awareness of Visual Foundation Models](http://arxiv.org/abs/2404.08636v1 "Probing the 3D Awareness of Visual Foundation Models")  
⭐

(https://github.com/mbanani/probe3d "code")

复制代码

(https://github.com/penghao-wu/vstar "code")

    * [Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding](https://arxiv.org/abs/2311.16922 "Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding")
    * [Pixel Aligned Language Models](https://arxiv.org/abs/2312.09237 "Pixel Aligned Language Models")  
🏠[project](https://jerryxu.net/PixelLLM/ "project")

    * [SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection](http://arxiv.org/abs/2403.03170v1 "SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection")
    * [OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation](https://arxiv.org/abs/2311.17911 "OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation")  
⭐

(https://github.com/shikiw/OPERA "code")多模态大语言模型

复制代码

* [Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs](https://openaccess.thecvf.com/content/CVPR2024/papers/Song_Low-Rank_Approximation_for_Sparse_Attention_in_Multi-Modal_LLMs_CVPR_2024_paper.pdf "Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs")LLMs
* [LISA: Reasoning Segmentation via Large Language Model](https://arxiv.org/abs/2308.00692 "LISA: Reasoning Segmentation via Large Language Model")

⭐

复制代码

(https://github.com/dvlab-research/LISA "code")

    * [Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model](https://openaccess.thecvf.com/content/CVPR2024/papers/Liang_Querying_as_Prompt_Parameter-Efficient_Learning_for_Multimodal_Language_Model_CVPR_2024_paper.pdf "Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model")
    * [Compositional Chain-of-Thought Prompting for Large Multimodal Models](https://arxiv.org/abs/2311.17076 "Compositional Chain-of-Thought Prompting for Large Multimodal Models")  
⭐

(https://github.com/chancharikmitra/CCoT "code")

复制代码

* [Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs](https://arxiv.org/abs/2401.06209 "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs")

🏠project

复制代码

* [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://arxiv.org/abs/2312.06742 "Honeybee: Locality-enhanced Projector for Multimodal LLM")

⭐

复制代码

(https://github.com/kakaobrain/honeybee "code")LLM

    * [HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data](https://arxiv.org/abs/2311.13614 "HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data")  
⭐

(https://github.com/Yuqifan1117/HalluciDoctor "code")

复制代码

* [SEED-Bench: Benchmarking Multimodal Large Language Models](https://arxiv.org/abs/2404.16790 "SEED-Bench: Benchmarking Multimodal Large Language Models")

⭐

复制代码

(https://github.com/AILab-CVC/SEED-Bench "code")

    * [PerceptionGPT: Effectively Fusing Visual Perception into LLM](https://arxiv.org/abs/2311.06612 "PerceptionGPT: Effectively Fusing Visual Perception into LLM")
    * [UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All](http://arxiv.org/abs/2403.12532v1 "UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All")
    * [ModaVerse: Efficiently Transforming Modalities with LLMs](https://arxiv.org/abs/2401.06395 "ModaVerse: Efficiently Transforming Modalities with LLMs")
    * [VCoder: Versatile Vision Encoders for Multimodal Large Language Models](https://arxiv.org/abs/2312.14233 "VCoder: Versatile Vision Encoders for Multimodal Large Language Models")  
⭐

(https://github.com/SHI-Labs/VCoder "code")
🏠project

复制代码

* [mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration](https://arxiv.org/abs/2311.04257 "mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration")
* [MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World](https://arxiv.org/abs/2401.08577 "MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World")

🏠project大语言模型

复制代码

* [RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback](https://arxiv.org/abs/2312.00849 "RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback")

⭐

复制代码

(https://github.com/RLHF-V/RLHF-V "code")

    * [DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model](https://arxiv.org/abs/2404.01342 "DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model")  
⭐

(https://github.com/OpenGVLab/DiffAgent "code")
👍摘要

复制代码

* [Prompt Highlighter: Interactive Control for Multi-Modal LLMs](https://arxiv.org/abs/2312.04302 "Prompt Highlighter: Interactive Control for Multi-Modal LLMs")

⭐

复制代码

(https://github.com/dvlab-research/Prompt-Highlighter "code")  
🏠[project](https://julianjuaner.github.io/projects/PromptHighlighter/ "project")

    * [Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft](https://arxiv.org/abs/2312.09238 "Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft")  
🏠[project](https://yangxue0827.github.io/auto_mc-reward.html "project")

    * [General Object Foundation Model for Images and Videos at Scale](https://arxiv.org/abs/2312.09158 "General Object Foundation Model for Images and Videos at Scale")  
⭐

(https://github.com/FoundationVision/GLEE "code")
🏠project
👍GLEE 华科与字节跳动联手打造全能目标感知基础模型

复制代码

* [Link-Context Learning for Multimodal LLMs](https://arxiv.org/abs/2308.07891 "Link-Context Learning for Multimodal LLMs")

⭐

复制代码

(https://github.com/isekai-portal/Link-Context-Learning "code")LLMs

    * [Cloud-Device Collaborative Learning for Multimodal Large Language Models](https://arxiv.org/abs/2312.16279 "Cloud-Device Collaborative Learning for Multimodal Large Language Models")
    * [LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model](https://arxiv.org/abs/2406.04659 "LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model")  
⭐

(https://github.com/kennethwdk/LocLLM "code")
👍成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM
* Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
⭐

复制代码

(https://github.com/SY-Xuan/Pink "code")  
👍[成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM](https://idm.pku.edu.cn/info/1012/1839.htm "成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM")
    * [LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge](https://arxiv.org/abs/2311.11860 "LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge")  
⭐

(https://github.com/rshaojimmy/JiuTian "code")
🏠projectMLLMs

复制代码

* [GSVA: Generalized Segmentation via Multimodal Large Language Models](https://arxiv.org/abs/2312.10103 "GSVA: Generalized Segmentation via Multimodal Large Language Models")

VLN
- Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation
  ⭐

复制代码

(https://github.com/MrZihan/HNR-VLN "code")  
👍[VILP](https://vipl.ict.ac.cn/news/research/202403/t20240315_207758.html "VILP")

    * [Volumetric Environment Representation for Vision-Language Navigation](http://arxiv.org/abs/2403.14158v1 "Volumetric Environment Representation for Vision-Language Navigation")
    * [OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation](http://arxiv.org/abs/2403.17334v1 "OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation")
    * [Vision-and-Language Navigation via Causal Learning](https://arxiv.org/abs/2404.10241 "Vision-and-Language Navigation via Causal Learning")  
⭐

(https://github.com/CrystalSixone/VLN-GOAT "code")视觉和语言导航

视频语言
Visual Grounding
- Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners
- MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
- Viewpoint-Aware Visual Grounding in 3D Scenes
- Improved Visual Grounding through Self-Consistent Explanations
- Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
  🏠project
- Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and ConsistencyVisual Grounding
- Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding
- Multi-Attribute Interactions Matter for 3D Visual Grounding
- Investigating Compositional Challenges in Vision-Language Models for Visual Grounding
- [Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language](https://openaccess.thecvf.com/content/CVPR2024/papers/Hamilton_Separating_the_Chirp_from_the_Chat_Self-supervised_Visual_Grounding_of_CVPR_2024_paper.pdf "Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language")
多模态模型
- GLaMM: Pixel Grounding Large Multimodal Model
- Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
  ⭐

复制代码

(https://github.com/Yuliang-Liu/Monkey "code")

    * [What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models](https://arxiv.org/abs/2310.06627 "What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models")  
🏠[project](https://bzhao.me/C-VQA/ "project")

    * [Multi-modal Learning for Geospatial Vegetation Forecasting](http://arxiv.org/abs/2303.16198 "Multi-modal Learning for Geospatial Vegetation Forecasting")
    * [Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception](https://arxiv.org/abs/2403.02969 "Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception")
    * [MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception](https://arxiv.org/abs/2312.07472 "MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception")
    * [TRINS: Towards Multimodal Language Models that Can Read](https://arxiv.org/abs/2406.06730 "TRINS: Towards Multimodal Language Models that Can Read")
    * [Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations](http://arxiv.org/abs/2403.07241v1 "Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations")

  * 视觉基础模型 
    * [Three Pillars improving Vision Foundation Model Distillation for Lidar](https://arxiv.org/abs/2310.17504 "Three Pillars improving Vision Foundation Model Distillation for Lidar")  
⭐

(https://github.com/valeoai/ScaLR "code")

复制代码

* [Bridging Remote Sensors with Multisensor Geospatial Foundation Models](http://arxiv.org/abs/2404.01260 "Bridging Remote Sensors with Multisensor Geospatial Foundation Models")
* [Low-Resource Vision Challenges for Foundation Models](http://arxiv.org/abs/2401.04716 "Low-Resource Vision Challenges for Foundation Models")

多视图理解
- Learning to Select Views for Efficient Multi-View Understanding
  ⭐

复制代码

(https://github.com/hou-yz/MVSelect "code")

  * 视觉定位 
    * [Learning to Produce Semi-dense Correspondences for Visual Localization](https://arxiv.org/abs/2402.08359 "Learning to Produce Semi-dense Correspondences for Visual Localization")  
⭐

(https://github.com/TruongKhang/DeViLoc "code")

复制代码

* [PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs](https://arxiv.org/abs/2402.08657 "PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs")定位
* [Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds](https://openaccess.thecvf.com/content/CVPR2024/papers/Moon_Efficient_Privacy-Preserving_Visual_Localization_Using_3D_Ray_Clouds_CVPR_2024_paper.pdf "Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds")

全部评论 (0)

还没有任何评论哟~

CVPR 2024 视觉语音方向总汇（视觉理解、视觉定位、大模型和多模态模型等）

1、VisionLanguage视觉语言 AVisionCheckupforLanguageModels TheNeglectedTailsinVisionLanguageModels BeyondA...

20240708 VLM(视觉大模型和视觉语言模型)

参考网站：万字长文带你全面解读视觉大模型知乎 <https://github.com/IDEAResearch/DINOXAPI detr都dino演进 <https://zhuanlan.zhih...

多模态视觉语言模型

文章目录 1\.多模态大模型概述 1.1模型范式 1.2训练范式 2\.BLIP 3\.BLIP2 4\.LLaVa&LLaVA1.5 5\.QwenVL 5.1模型结构 5.2训练过程 6\.参考 ...

多模态视觉语言模型（VLMs）

多模态视觉语言模型（VisionLanguageModels,VLMs）是一种结合了计算机视觉和自然语言处理技术的先进人工智能系统，旨在通过整合图像、视频、文本等多种模态的信息，实现跨模态的理解和生成...

视觉错觉模型_视错觉大观园

视错觉是指观察者在客观因素干扰下或者自身的心理因素支配下，对图形产生的与客观事实不相符的错误的感觉。视觉错觉的种类可以分为以下三种，听小编一一道来。 1\.几何错觉视觉上的大小、长度、面积、方向、角...

视觉大语言模型~

这是视觉大语言模型在自动驾驶和智能交通中的最新应用一点思考当前VLM,LLM十分的热门,尤其是在自动驾驶领域,自动驾驶的输入本身就是多模态的,这里面最重要的就是图像和语言信息,所以如何将图像和文本...

视觉-语言大模型原理

重磅推荐专栏：《大模型AIGC》《课程大纲》《知识星球》本专栏致力于探索和讨论当今最前沿的技术趋势和应用领域，包括但不限于ChatGPT和StableDiffusion等。我们将深入研究大型模...

视觉-语言大模型原理

重磅推荐专栏：《大模型AIGC》；《课程大纲》本专栏致力于探索和讨论当今最前沿的技术趋势和应用领域，包括但不限于ChatGPT和StableDiffusion等。我们将深入研究大型模型的开发和应用，...

视觉大模型VIT

VIT原理概念 VIT是VisionTransformer的缩写。VisionTransformer是一种基于Transformer架构的计算机视觉模型，最初由谷歌研究团队在2020年提出。它将Tr...

视觉大模型--DeformableDETR

原理大家可以参考这篇文章，我这边主要介绍几个公式和整体源码理解。 1.提出了多尺度可变形注意力MultiscaleDeformableAttention,MSDA.基于此设计了DETR特有的利用多尺度...

是否确定退出登录?

CVPR 2024 视觉语音方向总汇（视觉理解、视觉定位、大模型和多模态模型等）

1、Vision-Language(视觉语言)

全部评论 (0)

相关文章推荐

CVPR 2024 视觉语音方向总汇（视觉理解、视觉定位、大模型和多模态模型等）

20240708 VLM(视觉大模型和视觉语言模型)

多模态视觉语言模型

多模态视觉语言模型（VLMs）

视觉错觉模型_视错觉大观园

视觉大语言模型~

视觉-语言大模型原理

视觉-语言大模型原理

视觉大模型VIT

视觉大模型--DeformableDETR