novelty_check_cmba
Novelty Check: Cross-Modal Budget Allocation (CMBA)
Proposed Method
Systematically study optimal parameter budget allocation ratios (vision encoder : cross-modal projector : language decoder) in vision-language models like LLaVA. Hypothesis: cross-modal alignment layers need disproportionately high rank (2-4x) relative to their size compared to within-modality layers.
Core Claims to Verify
- Claim 1: Systematic study of budget allocation RATIOS across modalities (vision:projector:language) has not been done
- Claim 2: Cross-modal projector layers are bottlenecks that need disproportionately high rank
- Claim 3: Optimal allocation is NOT uniform across modalities
- Claim 4: The specific ratio (e.g., 1:4:1 or 2:4:1) matters for performance
- Claim 5: This can be studied via ablation without expensive search algorithms
Phase B: Multi-Source Literature Search
Search Results Summary
Claim 1: Systematic study of budget allocation RATIOS
Relevant papers found:
- Multimodal Scaling Laws - Mentions "optimal vision:language parameter ratio is 1:3" but unclear if this is about LoRA rank or model architecture
- Adaptive Capacity Allocation for VLA - Studies capacity allocation but for robotics VLA, not general VLMs
- Neural Architecture Search for Variable LoRA Rank - Uses NAS to find variable ranks, but doesn't study systematic ratio ablations
Claim 2: Cross-modal projector needs high rank
Relevant papers found:
- Towards Efficient Visual-Language Alignment of Q-Former - Applies AdaLoRA to Q-Former, but doesn't compare projector rank needs vs other components
- Locality-enhanced Projector for Multimodal LLM - Focuses on projector design, not rank allocation
- Efficient Visual Projector for Multimodal LLM - Studies projector efficiency via token reduction, not rank
Claim 3: Non-uniform allocation is better
Relevant papers found:
- Dynamic Rank Allocation for Foundation Models - Heterogeneous rank allocation across layers, but single-modality focus
- Multimodal Low-Rank Adaptation (MokA) - Modality-aware parameters, but doesn't study allocation ratios
- Cross-Modal Low-rank Adaptation - Cross-modal LoRA, but uniform rank within each modality
Claim 4: Specific ratios matter (1:1 vs 1:4:1 vs 2:4:1)
No papers found that systematically ablate allocation ratios like 1:1:1, 1:4:1, 2:4:1 across vision:projector:language.
Claim 5: Simple ablation approach (no expensive search)
Most papers use:
- NAS (Neural Architecture Search) - expensive
- AdaLoRA - requires SVD and importance scoring during training
- Gradient-based methods
No paper does simple grid search over allocation ratios.
Phase C: Cross-Model Verification
Key Findings from Literature
-
Multimodal Scaling Laws mentions 1:3 vision:language ratio, but this appears to be about model size (parameters), not LoRA rank allocation during fine-tuning
-
Adaptive Capacity Allocation for VLA (arXiv 2603.07404) is the CLOSEST work:
- Studies capacity allocation for vision-language-action models
- Finds "robotics transfer exhibits a higher and task-dependent capacity need"
- BUT: focuses on VLA (robotics), not general VLMs like LLaVA/BLIP-2
- Does NOT study cross-modal projector separately
-
Neural Architecture Search for Variable LoRA Rank (arXiv 2508.12512):
- Uses NAS to find optimal ranks for different layers
- BUT: treats it as a search problem, not a systematic ablation
- Does NOT report specific allocation ratios
-
No paper systematically studies vision:projector:language allocation ratios
Closest Prior Work Analysis
| Paper | Year | Overlap | Key Difference |
|---|---|---|---|
| Adaptive Capacity Allocation for VLA | 2026 | Studies capacity allocation across modalities | (1) VLA not VLM, (2) No projector analysis, (3) No ratio ablation |
| Neural Architecture Search for Variable LoRA Rank | 2025 | Variable rank in VLMs | Uses NAS not ablation, no ratio study |
| MokA: Multimodal Low-Rank Adaptation | 2025 | Modality-aware LoRA | Shared vs specific params, not allocation ratios |
| Q-Former PEFT | 2024 | Applies AdaLoRA to Q-Former | Doesn't compare projector vs encoder/decoder needs |
| Dynamic Rank Allocation | 2025 | Heterogeneous rank allocation | Single-modality LLMs, not multimodal |
Phase D: Novelty Report
Overall Novelty Assessment
Score: 8/10
Recommendation: ✅ PROCEED
Core Novelty
What is novel:
- Systematic ratio ablation - No prior work tests allocation ratios like 1:1:1, 1:4:1, 2:4:1 across vision:projector:language
- Cross-modal projector focus - Existing work treats projector as part of "alignment" but doesn't isolate its rank requirements
- Simple diagnostic approach - Grid search over ratios is cheaper than NAS or AdaLoRA, provides interpretable results
- Actionable guidelines - Output is a recommended ratio (e.g., "use 2:4:1"), not a learned allocation policy
What is NOT novel:
- The idea that different modalities need different ranks (MokA, VLA paper already show this)
- Using LoRA for multimodal models (standard practice)
- The hypothesis that projector is important (many papers study projector design)
Key Differentiator
The unique angle is the DIAGNOSTIC QUESTION: "What is the optimal budget allocation RATIO?"
Existing work either:
- Uses adaptive methods (AdaLoRA, NAS) that find allocations but don't report interpretable ratios
- Studies modality-aware parameters but doesn't quantify allocation ratios
- Focuses on projector design (architecture) not rank allocation
This work would be the FIRST to answer: "If I have a fixed parameter budget, what ratio should I allocate to vision:projector:language?"
Risk: What a Reviewer Would Cite
Potential objections:
-
"Adaptive Capacity Allocation for VLA (arXiv 2603.07404) already studies this"
- Counter: That's for VLA (robotics), not general VLMs. We study LLaVA/BLIP-2 on standard VQA benchmarks.
-
"Neural Architecture Search already finds optimal ranks"
- Counter: NAS is expensive and doesn't provide interpretable ratios. Our ablation is simpler and gives actionable guidelines.
-
"This is just hyperparameter tuning"
- Counter: We're revealing a structural property (projector needs high rank) that generalizes across tasks, not just tuning for one task.
Suggested Positioning
Title angle: "Where Should Parameters Go? A Systematic Study of Budget Allocation in Multimodal LoRA"
Framing:
- Problem: Existing adaptive methods (AdaLoRA, NAS) are expensive and don't provide interpretable allocation guidelines
- Gap: No one has systematically studied allocation ratios across vision:projector:language
- Contribution: Simple ablation reveals that cross-modal projector needs 2-4x higher rank than within-modality layers
- Impact: Practitioners can use our recommended ratios (e.g., 2:4:1) without expensive search
Key result to emphasize: "Cross-modal projector is a bottleneck - allocating 40-50% of parameter budget to projector (despite it being <10% of model size) improves accuracy by X%"
Sources
Key papers to cite and differentiate from: