landscape_summary

PEFT Landscape Summary - Adaptive Rank Allocation for Multimodal Models

Current State of the Field

1. Dynamic Rank Allocation Methods

Recent Work (2024-2025):

AdaLoRA (ICLR 2023): Adaptive budget allocation using importance scoring, but single-modality focused
ALoRA (arXiv 2403.16187): Dynamic rank adjustment during training for LLMs
Adaptive Rank Allocation via Integrated Gradients (arXiv 2603.13792): Uncertainty-aware scoring for rank selection
Dynamic Rank Allocation for Foundation Models (arXiv 2506.18267): Heterogeneous adaptation needs across layers

Key Finding: All existing methods treat models as single-modality systems. No work explicitly addresses modality-specific rank requirements.

2. Multimodal PEFT

Vision-Language Models:

Gated Modality LoRA (arXiv 2510.26721): Addresses key-space alignment but uses fixed rank per modality
Cross-Modal Low-rank Adaptation (arXiv 2604.03314): Captures cross-modal interactions but uniform rank allocation
MM-LoRA (arXiv 2410.13733): Parallel LoRAs for vision and language, but fixed rank design
Adaptive Capacity Allocation for VLA (arXiv 2603.07404): Shows robotics tasks need higher rank than language tasks

Key Finding: Existing multimodal PEFT methods recognize modality differences but don't dynamically allocate rank based on modality-specific needs.

3. Layer-wise Heterogeneity

Empirical Observations:

Vision encoders (early layers): High-dimensional feature extraction, potentially need higher rank
Cross-modal alignment layers (Q-Former, projectors): Critical for modality fusion, rank requirements unclear
Language decoder (later layers): Strong pre-trained priors, potentially need lower rank
Gap: No systematic study of rank requirements across modality boundaries

4. Identified Structural Gaps

Modality-Aware Rank Allocation: No method dynamically adjusts rank based on whether a layer processes visual, textual, or cross-modal information
Cross-Modal Alignment Budget: Unclear how much parameter budget should go to alignment layers vs. within-modality layers
Training Dynamics: Existing adaptive methods (AdaLoRA) use gradient-based importance, but don't account for modality-specific learning rates
Evaluation: No benchmark comparing fixed vs. adaptive rank allocation specifically for multimodal models

Recurring Limitations in Literature

"Fixed rank across all layers is suboptimal" (mentioned in 8+ papers)
"Cross-modal alignment is critical but under-parameterized" (Gated Modality LoRA, EMMA)
"Vision and language have different adaptation needs" (MM-LoRA, VLA paper)
"Existing adaptive methods are computationally expensive" (AdaLoRA requires SVD decomposition)

Open Problems

Theoretical: Why do different modalities need different ranks? Is it information density, pre-training quality, or task-specific?
Empirical: What's the optimal rank distribution across vision encoder → cross-modal projector → language decoder?
Efficiency: Can we predict optimal rank allocation without expensive search or training?
Generalization: Does modality-aware rank allocation transfer across different VLM architectures (LLaVA, BLIP-2, Qwen-VL)?

Potential Research Directions

Modality-Aware AdaLoRA: Extend AdaLoRA with modality-specific importance metrics
Cross-Modal Budget Allocation: Learn to allocate parameter budget between modalities
Zero-Shot Rank Prediction: Predict optimal rank from layer statistics without training
Hierarchical Rank Allocation: Coarse-grained (modality-level) + fine-grained (layer-level)

Key Insight: The field has separately studied (1) adaptive rank allocation for LLMs and (2) PEFT for multimodal models, but NOT their intersection. This is the structural gap.