landscape_summary

PEFT Landscape Summary - Adaptive Rank Allocation for Multimodal Models

Current State of the Field

1. Dynamic Rank Allocation Methods

Recent Work (2024-2025):

Key Finding: All existing methods treat models as single-modality systems. No work explicitly addresses modality-specific rank requirements.

2. Multimodal PEFT

Vision-Language Models:

Key Finding: Existing multimodal PEFT methods recognize modality differences but don't dynamically allocate rank based on modality-specific needs.

3. Layer-wise Heterogeneity

Empirical Observations:

4. Identified Structural Gaps

  1. Modality-Aware Rank Allocation: No method dynamically adjusts rank based on whether a layer processes visual, textual, or cross-modal information

  2. Cross-Modal Alignment Budget: Unclear how much parameter budget should go to alignment layers vs. within-modality layers

  3. Training Dynamics: Existing adaptive methods (AdaLoRA) use gradient-based importance, but don't account for modality-specific learning rates

  4. Evaluation: No benchmark comparing fixed vs. adaptive rank allocation specifically for multimodal models

Recurring Limitations in Literature

Open Problems

  1. Theoretical: Why do different modalities need different ranks? Is it information density, pre-training quality, or task-specific?

  2. Empirical: What's the optimal rank distribution across vision encoder → cross-modal projector → language decoder?

  3. Efficiency: Can we predict optimal rank allocation without expensive search or training?

  4. Generalization: Does modality-aware rank allocation transfer across different VLM architectures (LLaVA, BLIP-2, Qwen-VL)?

Potential Research Directions

  1. Modality-Aware AdaLoRA: Extend AdaLoRA with modality-specific importance metrics
  2. Cross-Modal Budget Allocation: Learn to allocate parameter budget between modalities
  3. Zero-Shot Rank Prediction: Predict optimal rank from layer statistics without training
  4. Hierarchical Rank Allocation: Coarse-grained (modality-level) + fine-grained (layer-level)

Key Insight: The field has separately studied (1) adaptive rank allocation for LLMs and (2) PEFT for multimodal models, but NOT their intersection. This is the structural gap.