CVPR 2024 视觉语音方向总汇(视觉理解、视觉定位、大模型和多模态模型等)
1、Vision-Language(视觉语言)
- A Vision Check-up for Language Models
- The Neglected Tails in Vision-Language Models
- Beyond Average: Individualized Visual Scanpath Prediction
- ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models
- Language Models as Black-Box Optimizers for Vision-Language Models
- Distilling Vision-Language Models on Millions of Videos
- SonicVisionLM: Playing Sound with Vision Language Models
- Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model
- Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
- JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models
- MMA: Multi-Modal Adapter for Vision-Language Models
- Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment
- Building Vision-Language Models on Solid Foundations with Masked Distillation
- TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model
⭐
(https://github.com/htyao89/Textual-based_Class-aware_prompt_tuning/ "code")
* [On Scaling Up a Multilingual Vision and Language Model](http://arxiv.org/abs/2305.18565 "On Scaling Up a Multilingual Vision and Language Model")
* [CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/abs/2312.08914 "CogAgent: A Visual Language Model for GUI Agents")
⭐
(https://github.com/THUDM/CogVLM "code")
-
SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training
-
Sequential Modeling Enables Scalable Learning for Large Vision Models
🏠project大型视觉模型 -
Seeing the Unseen: Visual Common Sense for Semantic Placement
(https://github.com/Zi-hao-Wei/Efficient-Vision-Language-Pre-training-by-Cluster-Masking "code")
🏠[project](https://zxp46.github.io/cluster-masking/ "project")
* [VILA: On Pre-training for Visual Language Models](https://arxiv.org/abs/2312.07533 "VILA: On Pre-training for Visual Language Models")
* [EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models](https://arxiv.org/pdf/2311.15596.pdf "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models")
⭐
(https://github.com/AdaCheng/EgoThink "code")
🏠project
- SPIN: Simultaneous Perception Interaction and Navigation
- MAFA: Managing False Negatives for Vision-Language Pre-training
- Visual In-Context Prompting
⭐
(https://github.com/UX-Decoder/DINOv "code")
* [Semantics-aware Motion Retargeting with Vision-Language Models](https://arxiv.org/abs/2312.01964 "Semantics-aware Motion Retargeting with Vision-Language Models")
* [DePT: Decoupled Prompt Tuning](https://arxiv.org/abs/2309.07439 "DePT: Decoupled Prompt Tuning")
⭐
(https://github.com/Koorye/DePT "code")
(https://github.com/CircleRadon/Osprey "code")
* [FairCLIP: Harnessing Fairness in Vision-Language Learning](http://arxiv.org/abs/2403.19949v1 "FairCLIP: Harnessing Fairness in Vision-Language Learning")
🏠[project](https://ophai.hms.harvard.edu/datasets/fairvlmed10k "project")
* [Efficient Test-Time Adaptation of Vision-Language Models](http://arxiv.org/abs/2403.18293v1 "Efficient Test-Time Adaptation of Vision-Language Models")
⭐
(https://kdiaaa.github.io/tda/ "code")
(https://github.com/Imageomics/bioclip "code")
* [InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks](https://arxiv.org/abs/2312.14238 "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks")
⭐
(https://github.com/OpenGVLab/InternVL "code")
- Anchor-based Robust Finetuning of Vision-Language Models
- Multi-Modal Hallucination Control by Visual Information Grounding
- Do Vision and Language Encoders Represent the World Similarly?
- Dual-View Visual Contextualization for Web Navigation
- Any-Shift Prompting for Generalization over Distributions
- Non-autoregressive Sequence-to-Sequence Vision-Language Models
- One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
⭐
(https://github.com/TreeLLi/APT "code")
* [SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models](http://arxiv.org/abs/2403.13263v1 "SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models")
⭐
(https://github.com/ivattyue/SC-Tune "code")
- RegionGPT: Towards Region Understanding Vision Language Model
- Enhancing Vision-Language Pre-training with Rich Supervisions
- Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
⭐
(https://github.com/WalBouss/GEM "code")
* [Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples](https://arxiv.org/abs/2312.00825 "Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples")
* [Beyond Text: Frozen Large Language Models in Visual Signal Comprehension](http://arxiv.org/abs/2403.07874v1 "Beyond Text: Frozen Large Language Models in Visual Signal Comprehension")
⭐
(https://github.com/zh460045050/V2L-Tokenizer "code")
(https://github.com/lezhang7/Enhance-FineGrained "code")视觉语言构图理解
* [FFF: Fixing Flawed Foundations in Contrastive Pre-Training Results in Very Strong Vision-Language Models](http://arxiv.org/abs/2405.10286 "FFF: Fixing Flawed Foundations in Contrastive Pre-Training Results in Very Strong Vision-Language Models")
* [Enhancing Vision-Language Pretraining with Rich Supervisions]
* [Improved Baselines with Visual Instruction Tuning](https://arxiv.org/abs/2310.03744 "Improved Baselines with Visual Instruction Tuning")
🏠[project](https://llava-vl.github.io/ "project")
* [Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment](https://openaccess.thecvf.com/content/CVPR2024/papers/Ishmam_Semantic_Shield_Defending_Vision-Language_Models_Against_Backdooring_and_Poisoning_via_CVPR_2024_paper.pdf "Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment")
* [Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models](http://arxiv.org/abs/2403.17589v1 "Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models")
⭐
(https://github.com/YBZh/DMN "code")
(https://github.com/jusiro/CLAP "code")
* [Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding](https://arxiv.org/abs/2312.00081 "Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding")
⭐
(https://github.com/wjpoom/SPEC "code")
- SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining视觉-语言
- Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
- Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
⭐
(https://github.com/kwonjunn01/Hi-Mapper "code")
* [Iterated Learning Improves Compositionality in Large Vision-Language Models](http://arxiv.org/abs/2404.02145v1 "Iterated Learning Improves Compositionality in Large Vision-Language Models")
* [ViTamin: Designing Scalable Vision Models in the Vision-Language Era](http://arxiv.org/abs/2404.02132v1 "ViTamin: Designing Scalable Vision Models in the Vision-Language Era")
⭐
(https://github.com/Beckschen/ViTamin "code")
(https://github.com/KHU-AGI/PriViLege "code")
* [Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models](https://arxiv.org/abs/2404.02233 "Visual Concept Connectome \(VCC\): Open World Concept Discovery and their Interlayer Connections in Deep Models")
🏠[project](https://yorkucvil.github.io/VCC "project")
* [Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning](https://arxiv.org/abs/2404.03658 "Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning")
🏠[project](https://ruili3.github.io/kyn "project")
* [HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models](https://arxiv.org/abs/2310.14566 "HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models")
⭐
(https://github.com/tianyi-lab/HallusionBench "code")
- Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models
- Learning Vision from Models Rivals Learning Vision from Data
⭐
(https://github.com/google-research/syn-rep-learn "code")
* [Probing the 3D Awareness of Visual Foundation Models](http://arxiv.org/abs/2404.08636v1 "Probing the 3D Awareness of Visual Foundation Models")
⭐
(https://github.com/mbanani/probe3d "code")
-
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning
🏠project -
视觉理解
-
LLM
-
PixelLM: Pixel Reasoning with Large Multimodal Model
🏠project -
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld
-
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
-
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
-
See Say and Segment: Teaching LMMs to Overcome False Premises
-
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
-
Driving Everywhere with Large Language Model Policy Adaptation
🏠project -
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
-
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
🏠project -
Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement
🏠project -
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
⭐
-
(https://github.com/penghao-wu/vstar "code")
* [Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding](https://arxiv.org/abs/2311.16922 "Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding")
* [Pixel Aligned Language Models](https://arxiv.org/abs/2312.09237 "Pixel Aligned Language Models")
🏠[project](https://jerryxu.net/PixelLLM/ "project")
* [SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection](http://arxiv.org/abs/2403.03170v1 "SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection")
* [OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation](https://arxiv.org/abs/2311.17911 "OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation")
⭐
(https://github.com/shikiw/OPERA "code")多模态大语言模型
* [Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs](https://openaccess.thecvf.com/content/CVPR2024/papers/Song_Low-Rank_Approximation_for_Sparse_Attention_in_Multi-Modal_LLMs_CVPR_2024_paper.pdf "Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs")LLMs
* [LISA: Reasoning Segmentation via Large Language Model](https://arxiv.org/abs/2308.00692 "LISA: Reasoning Segmentation via Large Language Model")
⭐
(https://github.com/dvlab-research/LISA "code")
* [Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model](https://openaccess.thecvf.com/content/CVPR2024/papers/Liang_Querying_as_Prompt_Parameter-Efficient_Learning_for_Multimodal_Language_Model_CVPR_2024_paper.pdf "Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model")
* [Compositional Chain-of-Thought Prompting for Large Multimodal Models](https://arxiv.org/abs/2311.17076 "Compositional Chain-of-Thought Prompting for Large Multimodal Models")
⭐
(https://github.com/chancharikmitra/CCoT "code")
* [Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs](https://arxiv.org/abs/2401.06209 "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs")
* [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://arxiv.org/abs/2312.06742 "Honeybee: Locality-enhanced Projector for Multimodal LLM")
⭐
(https://github.com/kakaobrain/honeybee "code")LLM
* [HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data](https://arxiv.org/abs/2311.13614 "HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data")
⭐
(https://github.com/Yuqifan1117/HalluciDoctor "code")
* [SEED-Bench: Benchmarking Multimodal Large Language Models](https://arxiv.org/abs/2404.16790 "SEED-Bench: Benchmarking Multimodal Large Language Models")
⭐
(https://github.com/AILab-CVC/SEED-Bench "code")
* [PerceptionGPT: Effectively Fusing Visual Perception into LLM](https://arxiv.org/abs/2311.06612 "PerceptionGPT: Effectively Fusing Visual Perception into LLM")
* [UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All](http://arxiv.org/abs/2403.12532v1 "UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All")
* [ModaVerse: Efficiently Transforming Modalities with LLMs](https://arxiv.org/abs/2401.06395 "ModaVerse: Efficiently Transforming Modalities with LLMs")
* [VCoder: Versatile Vision Encoders for Multimodal Large Language Models](https://arxiv.org/abs/2312.14233 "VCoder: Versatile Vision Encoders for Multimodal Large Language Models")
⭐
(https://github.com/SHI-Labs/VCoder "code")
🏠project
* [mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration](https://arxiv.org/abs/2311.04257 "mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration")
* [MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World](https://arxiv.org/abs/2401.08577 "MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World")
🏠project大语言模型
* [RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback](https://arxiv.org/abs/2312.00849 "RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback")
⭐
(https://github.com/RLHF-V/RLHF-V "code")
* [DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model](https://arxiv.org/abs/2404.01342 "DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model")
⭐
(https://github.com/OpenGVLab/DiffAgent "code")
👍摘要
* [Prompt Highlighter: Interactive Control for Multi-Modal LLMs](https://arxiv.org/abs/2312.04302 "Prompt Highlighter: Interactive Control for Multi-Modal LLMs")
⭐
(https://github.com/dvlab-research/Prompt-Highlighter "code")
🏠[project](https://julianjuaner.github.io/projects/PromptHighlighter/ "project")
* [Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft](https://arxiv.org/abs/2312.09238 "Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft")
🏠[project](https://yangxue0827.github.io/auto_mc-reward.html "project")
* [General Object Foundation Model for Images and Videos at Scale](https://arxiv.org/abs/2312.09158 "General Object Foundation Model for Images and Videos at Scale")
⭐
(https://github.com/FoundationVision/GLEE "code")
🏠project
👍GLEE 华科与字节跳动联手打造全能目标感知基础模型
* [Link-Context Learning for Multimodal LLMs](https://arxiv.org/abs/2308.07891 "Link-Context Learning for Multimodal LLMs")
⭐
(https://github.com/isekai-portal/Link-Context-Learning "code")LLMs
* [Cloud-Device Collaborative Learning for Multimodal Large Language Models](https://arxiv.org/abs/2312.16279 "Cloud-Device Collaborative Learning for Multimodal Large Language Models")
* [LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model](https://arxiv.org/abs/2406.04659 "LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model")
⭐
(https://github.com/kennethwdk/LocLLM "code")
👍成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM
* Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
⭐
(https://github.com/SY-Xuan/Pink "code")
👍[成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM](https://idm.pku.edu.cn/info/1012/1839.htm "成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM")
* [LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge](https://arxiv.org/abs/2311.11860 "LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge")
⭐
(https://github.com/rshaojimmy/JiuTian "code")
🏠projectMLLMs
* [GSVA: Generalized Segmentation via Multimodal Large Language Models](https://arxiv.org/abs/2312.10103 "GSVA: Generalized Segmentation via Multimodal Large Language Models")
- VLN
(https://github.com/MrZihan/HNR-VLN "code")
👍[VILP](https://vipl.ict.ac.cn/news/research/202403/t20240315_207758.html "VILP")
* [Volumetric Environment Representation for Vision-Language Navigation](http://arxiv.org/abs/2403.14158v1 "Volumetric Environment Representation for Vision-Language Navigation")
* [OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation](http://arxiv.org/abs/2403.17334v1 "OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation")
* [Vision-and-Language Navigation via Causal Learning](https://arxiv.org/abs/2404.10241 "Vision-and-Language Navigation via Causal Learning")
⭐
(https://github.com/CrystalSixone/VLN-GOAT "code")视觉和语言导航
-
视频语言
- VidLA: Video-Language Alignment at Scale
- SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling
- VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens
- VideoLLM-online: Online Video Large Language Model for Streaming Video
🏠project
-
Visual Grounding
-
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners
-
MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
-
Improved Visual Grounding through Self-Consistent Explanations
-
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
🏠project -
Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and ConsistencyVisual Grounding
-
Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding
-
Investigating Compositional Challenges in Vision-Language Models for Visual Grounding
-
[Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language](https://openaccess.thecvf.com/content/CVPR2024/papers/Hamilton_Separating_the_Chirp_from_the_Chat_Self-supervised_Visual_Grounding_of_CVPR_2024_paper.pdf "Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language")
-
-
多模态模型
(https://github.com/Yuliang-Liu/Monkey "code")
* [What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models](https://arxiv.org/abs/2310.06627 "What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models")
🏠[project](https://bzhao.me/C-VQA/ "project")
* [Multi-modal Learning for Geospatial Vegetation Forecasting](http://arxiv.org/abs/2303.16198 "Multi-modal Learning for Geospatial Vegetation Forecasting")
* [Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception](https://arxiv.org/abs/2403.02969 "Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception")
* [MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception](https://arxiv.org/abs/2312.07472 "MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception")
* [TRINS: Towards Multimodal Language Models that Can Read](https://arxiv.org/abs/2406.06730 "TRINS: Towards Multimodal Language Models that Can Read")
* [Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations](http://arxiv.org/abs/2403.07241v1 "Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations")
* 视觉基础模型
* [Three Pillars improving Vision Foundation Model Distillation for Lidar](https://arxiv.org/abs/2310.17504 "Three Pillars improving Vision Foundation Model Distillation for Lidar")
⭐
(https://github.com/valeoai/ScaLR "code")
* [Bridging Remote Sensors with Multisensor Geospatial Foundation Models](http://arxiv.org/abs/2404.01260 "Bridging Remote Sensors with Multisensor Geospatial Foundation Models")
* [Low-Resource Vision Challenges for Foundation Models](http://arxiv.org/abs/2401.04716 "Low-Resource Vision Challenges for Foundation Models")
(https://github.com/hou-yz/MVSelect "code")
* 视觉定位
* [Learning to Produce Semi-dense Correspondences for Visual Localization](https://arxiv.org/abs/2402.08359 "Learning to Produce Semi-dense Correspondences for Visual Localization")
⭐
(https://github.com/TruongKhang/DeViLoc "code")
* [PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs](https://arxiv.org/abs/2402.08657 "PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs")定位
* [Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds](https://openaccess.thecvf.com/content/CVPR2024/papers/Moon_Efficient_Privacy-Preserving_Visual_Localization_Using_3D_Ray_Clouds_CVPR_2024_paper.pdf "Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds")
