Advertisement

CVPR 2024 视觉语音方向总汇(视觉理解、视觉定位、大模型和多模态模型等)

阅读量:

1、Vision-Language(视觉语言)

复制代码
(https://github.com/htyao89/Textual-based_Class-aware_prompt_tuning/ "code")

  * [On Scaling Up a Multilingual Vision and Language Model](http://arxiv.org/abs/2305.18565 "On Scaling Up a Multilingual Vision and Language Model")
  * [CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/abs/2312.08914 "CogAgent: A Visual Language Model for GUI Agents")  
⭐

(https://github.com/THUDM/CogVLM "code")

复制代码
(https://github.com/Zi-hao-Wei/Efficient-Vision-Language-Pre-training-by-Cluster-Masking "code")  
🏠[project](https://zxp46.github.io/cluster-masking/ "project")

  * [VILA: On Pre-training for Visual Language Models](https://arxiv.org/abs/2312.07533 "VILA: On Pre-training for Visual Language Models")
  * [EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models](https://arxiv.org/pdf/2311.15596.pdf "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models")  
⭐

(https://github.com/AdaCheng/EgoThink "code")
🏠project

复制代码
(https://github.com/UX-Decoder/DINOv "code")

  * [Semantics-aware Motion Retargeting with Vision-Language Models](https://arxiv.org/abs/2312.01964 "Semantics-aware Motion Retargeting with Vision-Language Models")
  * [DePT: Decoupled Prompt Tuning](https://arxiv.org/abs/2309.07439 "DePT: Decoupled Prompt Tuning")  
⭐

(https://github.com/Koorye/DePT "code")

复制代码
(https://github.com/CircleRadon/Osprey "code")

  * [FairCLIP: Harnessing Fairness in Vision-Language Learning](http://arxiv.org/abs/2403.19949v1 "FairCLIP: Harnessing Fairness in Vision-Language Learning")  
🏠[project](https://ophai.hms.harvard.edu/datasets/fairvlmed10k "project")

  * [Efficient Test-Time Adaptation of Vision-Language Models](http://arxiv.org/abs/2403.18293v1 "Efficient Test-Time Adaptation of Vision-Language Models")  
⭐

(https://kdiaaa.github.io/tda/ "code")

复制代码
(https://github.com/Imageomics/bioclip "code")

  * [InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks](https://arxiv.org/abs/2312.14238 "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks")  
⭐

(https://github.com/OpenGVLab/InternVL "code")

复制代码
(https://github.com/TreeLLi/APT "code")

  * [SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models](http://arxiv.org/abs/2403.13263v1 "SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models")  
⭐

(https://github.com/ivattyue/SC-Tune "code")

复制代码
(https://github.com/WalBouss/GEM "code")

  * [Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples](https://arxiv.org/abs/2312.00825 "Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples")
  * [Beyond Text: Frozen Large Language Models in Visual Signal Comprehension](http://arxiv.org/abs/2403.07874v1 "Beyond Text: Frozen Large Language Models in Visual Signal Comprehension")  
⭐

(https://github.com/zh460045050/V2L-Tokenizer "code")

复制代码
(https://github.com/lezhang7/Enhance-FineGrained "code")视觉语言构图理解

  * [FFF: Fixing Flawed Foundations in Contrastive Pre-Training Results in Very Strong Vision-Language Models](http://arxiv.org/abs/2405.10286 "FFF: Fixing Flawed Foundations in Contrastive Pre-Training Results in Very Strong Vision-Language Models")
  * [Enhancing Vision-Language Pretraining with Rich Supervisions]
  * [Improved Baselines with Visual Instruction Tuning](https://arxiv.org/abs/2310.03744 "Improved Baselines with Visual Instruction Tuning")  
🏠[project](https://llava-vl.github.io/ "project")

  * [Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment](https://openaccess.thecvf.com/content/CVPR2024/papers/Ishmam_Semantic_Shield_Defending_Vision-Language_Models_Against_Backdooring_and_Poisoning_via_CVPR_2024_paper.pdf "Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment")
  * [Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models](http://arxiv.org/abs/2403.17589v1 "Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models")  
⭐

(https://github.com/YBZh/DMN "code")

复制代码
(https://github.com/jusiro/CLAP "code")

  * [Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding](https://arxiv.org/abs/2312.00081 "Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding")  
⭐

(https://github.com/wjpoom/SPEC "code")

复制代码
(https://github.com/kwonjunn01/Hi-Mapper "code")

  * [Iterated Learning Improves Compositionality in Large Vision-Language Models](http://arxiv.org/abs/2404.02145v1 "Iterated Learning Improves Compositionality in Large Vision-Language Models")
  * [ViTamin: Designing Scalable Vision Models in the Vision-Language Era](http://arxiv.org/abs/2404.02132v1 "ViTamin: Designing Scalable Vision Models in the Vision-Language Era")  
⭐

(https://github.com/Beckschen/ViTamin "code")

复制代码
(https://github.com/KHU-AGI/PriViLege "code")

  * [Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models](https://arxiv.org/abs/2404.02233 "Visual Concept Connectome \(VCC\): Open World Concept Discovery and their Interlayer Connections in Deep Models")  
🏠[project](https://yorkucvil.github.io/VCC "project")

  * [Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning](https://arxiv.org/abs/2404.03658 "Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning")  
🏠[project](https://ruili3.github.io/kyn "project")

  * [HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models](https://arxiv.org/abs/2310.14566 "HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models")  
⭐

(https://github.com/tianyi-lab/HallusionBench "code")

复制代码
(https://github.com/google-research/syn-rep-learn "code")

  * [Probing the 3D Awareness of Visual Foundation Models](http://arxiv.org/abs/2404.08636v1 "Probing the 3D Awareness of Visual Foundation Models")  
⭐

(https://github.com/mbanani/probe3d "code")

复制代码
(https://github.com/penghao-wu/vstar "code")

    * [Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding](https://arxiv.org/abs/2311.16922 "Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding")
    * [Pixel Aligned Language Models](https://arxiv.org/abs/2312.09237 "Pixel Aligned Language Models")  
🏠[project](https://jerryxu.net/PixelLLM/ "project")

    * [SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection](http://arxiv.org/abs/2403.03170v1 "SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection")
    * [OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation](https://arxiv.org/abs/2311.17911 "OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation")  
⭐

(https://github.com/shikiw/OPERA "code")多模态大语言模型

复制代码
* [Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs](https://openaccess.thecvf.com/content/CVPR2024/papers/Song_Low-Rank_Approximation_for_Sparse_Attention_in_Multi-Modal_LLMs_CVPR_2024_paper.pdf "Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs")LLMs
* [LISA: Reasoning Segmentation via Large Language Model](https://arxiv.org/abs/2308.00692 "LISA: Reasoning Segmentation via Large Language Model")  

复制代码
(https://github.com/dvlab-research/LISA "code")

    * [Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model](https://openaccess.thecvf.com/content/CVPR2024/papers/Liang_Querying_as_Prompt_Parameter-Efficient_Learning_for_Multimodal_Language_Model_CVPR_2024_paper.pdf "Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model")
    * [Compositional Chain-of-Thought Prompting for Large Multimodal Models](https://arxiv.org/abs/2311.17076 "Compositional Chain-of-Thought Prompting for Large Multimodal Models")  
⭐

(https://github.com/chancharikmitra/CCoT "code")

复制代码
* [Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs](https://arxiv.org/abs/2401.06209 "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs")  

🏠project

复制代码
* [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://arxiv.org/abs/2312.06742 "Honeybee: Locality-enhanced Projector for Multimodal LLM")  

复制代码
(https://github.com/kakaobrain/honeybee "code")LLM

    * [HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data](https://arxiv.org/abs/2311.13614 "HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data")  
⭐

(https://github.com/Yuqifan1117/HalluciDoctor "code")

复制代码
* [SEED-Bench: Benchmarking Multimodal Large Language Models](https://arxiv.org/abs/2404.16790 "SEED-Bench: Benchmarking Multimodal Large Language Models")  

复制代码
(https://github.com/AILab-CVC/SEED-Bench "code")

    * [PerceptionGPT: Effectively Fusing Visual Perception into LLM](https://arxiv.org/abs/2311.06612 "PerceptionGPT: Effectively Fusing Visual Perception into LLM")
    * [UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All](http://arxiv.org/abs/2403.12532v1 "UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All")
    * [ModaVerse: Efficiently Transforming Modalities with LLMs](https://arxiv.org/abs/2401.06395 "ModaVerse: Efficiently Transforming Modalities with LLMs")
    * [VCoder: Versatile Vision Encoders for Multimodal Large Language Models](https://arxiv.org/abs/2312.14233 "VCoder: Versatile Vision Encoders for Multimodal Large Language Models")  
⭐

(https://github.com/SHI-Labs/VCoder "code")
🏠project

复制代码
* [mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration](https://arxiv.org/abs/2311.04257 "mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration")
* [MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World](https://arxiv.org/abs/2401.08577 "MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World")  

🏠project大语言模型

复制代码
* [RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback](https://arxiv.org/abs/2312.00849 "RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback")  

复制代码
(https://github.com/RLHF-V/RLHF-V "code")

    * [DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model](https://arxiv.org/abs/2404.01342 "DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model")  
⭐

(https://github.com/OpenGVLab/DiffAgent "code")
👍摘要

复制代码
* [Prompt Highlighter: Interactive Control for Multi-Modal LLMs](https://arxiv.org/abs/2312.04302 "Prompt Highlighter: Interactive Control for Multi-Modal LLMs")  

复制代码
(https://github.com/dvlab-research/Prompt-Highlighter "code")  
🏠[project](https://julianjuaner.github.io/projects/PromptHighlighter/ "project")

    * [Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft](https://arxiv.org/abs/2312.09238 "Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft")  
🏠[project](https://yangxue0827.github.io/auto_mc-reward.html "project")

    * [General Object Foundation Model for Images and Videos at Scale](https://arxiv.org/abs/2312.09158 "General Object Foundation Model for Images and Videos at Scale")  
⭐

(https://github.com/FoundationVision/GLEE "code")
🏠project
👍GLEE 华科与字节跳动联手打造全能目标感知基础模型

复制代码
* [Link-Context Learning for Multimodal LLMs](https://arxiv.org/abs/2308.07891 "Link-Context Learning for Multimodal LLMs")  

复制代码
(https://github.com/isekai-portal/Link-Context-Learning "code")LLMs

    * [Cloud-Device Collaborative Learning for Multimodal Large Language Models](https://arxiv.org/abs/2312.16279 "Cloud-Device Collaborative Learning for Multimodal Large Language Models")
    * [LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model](https://arxiv.org/abs/2406.04659 "LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model")  
⭐

(https://github.com/kennethwdk/LocLLM "code")
👍成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM
* Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

复制代码
(https://github.com/SY-Xuan/Pink "code")  
👍[成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM](https://idm.pku.edu.cn/info/1012/1839.htm "成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM")
    * [LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge](https://arxiv.org/abs/2311.11860 "LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge")  
⭐

(https://github.com/rshaojimmy/JiuTian "code")
🏠projectMLLMs

复制代码
* [GSVA: Generalized Segmentation via Multimodal Large Language Models](https://arxiv.org/abs/2312.10103 "GSVA: Generalized Segmentation via Multimodal Large Language Models")
复制代码
(https://github.com/MrZihan/HNR-VLN "code")  
👍[VILP](https://vipl.ict.ac.cn/news/research/202403/t20240315_207758.html "VILP")

    * [Volumetric Environment Representation for Vision-Language Navigation](http://arxiv.org/abs/2403.14158v1 "Volumetric Environment Representation for Vision-Language Navigation")
    * [OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation](http://arxiv.org/abs/2403.17334v1 "OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation")
    * [Vision-and-Language Navigation via Causal Learning](https://arxiv.org/abs/2404.10241 "Vision-and-Language Navigation via Causal Learning")  
⭐

(https://github.com/CrystalSixone/VLN-GOAT "code")视觉和语言导航

复制代码
(https://github.com/Yuliang-Liu/Monkey "code")

    * [What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models](https://arxiv.org/abs/2310.06627 "What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models")  
🏠[project](https://bzhao.me/C-VQA/ "project")

    * [Multi-modal Learning for Geospatial Vegetation Forecasting](http://arxiv.org/abs/2303.16198 "Multi-modal Learning for Geospatial Vegetation Forecasting")
    * [Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception](https://arxiv.org/abs/2403.02969 "Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception")
    * [MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception](https://arxiv.org/abs/2312.07472 "MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception")
    * [TRINS: Towards Multimodal Language Models that Can Read](https://arxiv.org/abs/2406.06730 "TRINS: Towards Multimodal Language Models that Can Read")
    * [Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations](http://arxiv.org/abs/2403.07241v1 "Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations")

  * 视觉基础模型 
    * [Three Pillars improving Vision Foundation Model Distillation for Lidar](https://arxiv.org/abs/2310.17504 "Three Pillars improving Vision Foundation Model Distillation for Lidar")  
⭐

(https://github.com/valeoai/ScaLR "code")

复制代码
* [Bridging Remote Sensors with Multisensor Geospatial Foundation Models](http://arxiv.org/abs/2404.01260 "Bridging Remote Sensors with Multisensor Geospatial Foundation Models")
* [Low-Resource Vision Challenges for Foundation Models](http://arxiv.org/abs/2401.04716 "Low-Resource Vision Challenges for Foundation Models")
复制代码
(https://github.com/hou-yz/MVSelect "code")

  * 视觉定位 
    * [Learning to Produce Semi-dense Correspondences for Visual Localization](https://arxiv.org/abs/2402.08359 "Learning to Produce Semi-dense Correspondences for Visual Localization")  
⭐

(https://github.com/TruongKhang/DeViLoc "code")

复制代码
* [PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs](https://arxiv.org/abs/2402.08657 "PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs")定位
* [Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds](https://openaccess.thecvf.com/content/CVPR2024/papers/Moon_Efficient_Privacy-Preserving_Visual_Localization_Using_3D_Ray_Clouds_CVPR_2024_paper.pdf "Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds")

全部评论 (0)

还没有任何评论哟~