IDEA_REPORT

Research Idea Report

Direction: 多模态大模型的自适应秩分配 - 针对视觉-语言模型（如LLaVA）中不同模态和层的参数需求差异，设计动态的、模态感知的LoRA秩分配策略

Generated: 2026-05-04
Ideas evaluated: 10 generated → 6 survived filtering → 3 deep validated → 1 recommended
Pilot status: Skipped due to resource constraints (8GB GPU insufficient for LLaVA-7B)

Executive Summary

After systematic literature review, brainstorming, and novelty validation, we identified 1 high-priority research direction: Cross-Modal Budget Allocation (CMBA). This idea addresses a clear gap in the literature - no prior work systematically studies parameter budget allocation ratios across vision encoder, cross-modal projector, and language decoder in multimodal LoRA fine-tuning.

Key finding: Existing adaptive methods (AdaLoRA, NAS) are expensive and don't provide interpretable allocation guidelines. Our proposed simple ablation study would be the first to answer: "What ratio (vision:projector:language) should I use?"

Recommended next step: Run full-scale experiments on CMBA with proper GPU resources (A100/H100).

Landscape Summary

Current State of PEFT for Multimodal Models

Dynamic Rank Allocation (Single-Modality):

AdaLoRA (ICLR 2023): Importance-based adaptive allocation
ALoRA (2024): Dynamic adjustment during training
ARD-LoRA (2025): Heterogeneous allocation across layers/heads
Gap: All single-modality focused

Multimodal PEFT:

MokA (2025): Modality-aware parameters (shared vs specific)
Gated Modality LoRA (2024): Key-space alignment
Cross-Modal LoRA (2024): Cross-modal interactions
Gap: Fixed rank per modality, no ratio optimization

Key Insight: The field has separately studied (1) adaptive rank for LLMs and (2) PEFT for multimodal models, but NOT their intersection with systematic ratio analysis.

Recommended Ideas (Ranked)

Summary: Systematically study optimal parameter budget allocation ratios (vision encoder : cross-modal projector : language decoder) in vision-language models.

Hypothesis: Cross-modal alignment layers (projector/Q-Former) are bottlenecks and should receive disproportionately high rank (2-4x) relative to their size compared to within-modality layers.

Minimum Experiment:

Model: LLaVA-7B or BLIP-2
Dataset: VQAv2 validation set
Ablation: Test 6 allocation ratios
1. Uniform (1:1:1) - baseline
2. Projector-heavy (1:4:1)
3. Projector-heavy (1:2:1)
4. Vision-heavy (2:1:1)
5. Language-heavy (1:1:2)
6. Balanced-high-projector (2:4:1)
Metrics: VQA accuracy, cross-modal alignment quality
Fixed total parameter budget across all conditions

Expected Outcome:

Projector-heavy allocation (1:4:1 or 2:4:1) improves accuracy by 1-2% over uniform
Reveals that projector needs 40-50% of parameter budget despite being <10% of model size
Provides actionable ratio guidelines for practitioners

Novelty: 8/10

Novel: First systematic ratio ablation across modalities
Novel: Isolates cross-modal projector's rank requirements
Novel: Simple diagnostic approach (grid search) vs expensive NAS/AdaLoRA
Not novel: Idea that different modalities need different ranks (MokA, VLA paper)

Closest Prior Work:

Adaptive Capacity Allocation for VLA - VLA not VLM, no projector analysis
Neural Architecture Search for Variable LoRA Rank - Uses NAS, no ratio study
MokA - Modality-aware but no ratio optimization

Key Differentiator: First to answer "What ratio should I allocate?" with interpretable guidelines.

Feasibility:

Compute: ~24 GPU-hours (6 ratios × 4h each on A100)
Data: VQAv2 ✅ (publicly available)
Implementation: Easy (modify LoRA config)
Risk: LOW

Contribution Type: Empirical + Diagnostic

Reviewer's Likely Objection:

"This is just hyperparameter tuning"
Counter: We're revealing a structural property (projector bottleneck) that generalizes across tasks, not just tuning for one task.

Why We Should Do This:

Clear gap in literature (no systematic ratio study)
Actionable output (recommended ratios for practitioners)
Low risk, high impact
Cheap to run (simple ablation, no expensive search)

Pilot Result: SKIPPED - insufficient GPU memory (8GB RTX 4060 Laptop)

Next Steps

Secure GPU resources (A100/H100 cluster)
Implement CMBA ablation study
Run multi-seed experiments (3 seeds minimum)
Analyze results and write paper
If positive: invoke /auto-review-loop for iteration

Generated: 2026-05-04 00:50
Document Version: v1.0
Status: Ready for implementation (pending GPU resources)