novelty_check_cmba

Proposed Method

Systematically study optimal parameter budget allocation ratios (vision encoder : cross-modal projector : language decoder) in vision-language models like LLaVA. Hypothesis: cross-modal alignment layers need disproportionately high rank (2-4x) relative to their size compared to within-modality layers.

Core Claims to Verify

Claim 1: Systematic study of budget allocation RATIOS across modalities (vision:projector:language) has not been done
Claim 2: Cross-modal projector layers are bottlenecks that need disproportionately high rank
Claim 3: Optimal allocation is NOT uniform across modalities
Claim 4: The specific ratio (e.g., 1:4:1 or 2:4:1) matters for performance
Claim 5: This can be studied via ablation without expensive search algorithms

Phase B: Multi-Source Literature Search

Search Results Summary

Claim 1: Systematic study of budget allocation RATIOS

Relevant papers found:

Multimodal Scaling Laws - Mentions "optimal vision:language parameter ratio is 1:3" but unclear if this is about LoRA rank or model architecture
Adaptive Capacity Allocation for VLA - Studies capacity allocation but for robotics VLA, not general VLMs
Neural Architecture Search for Variable LoRA Rank - Uses NAS to find variable ranks, but doesn't study systematic ratio ablations

Claim 2: Cross-modal projector needs high rank

Relevant papers found:

Towards Efficient Visual-Language Alignment of Q-Former - Applies AdaLoRA to Q-Former, but doesn't compare projector rank needs vs other components
Locality-enhanced Projector for Multimodal LLM - Focuses on projector design, not rank allocation
Efficient Visual Projector for Multimodal LLM - Studies projector efficiency via token reduction, not rank

Claim 3: Non-uniform allocation is better

Relevant papers found:

Dynamic Rank Allocation for Foundation Models - Heterogeneous rank allocation across layers, but single-modality focus
Multimodal Low-Rank Adaptation (MokA) - Modality-aware parameters, but doesn't study allocation ratios
Cross-Modal Low-rank Adaptation - Cross-modal LoRA, but uniform rank within each modality

Claim 4: Specific ratios matter (1:1 vs 1:4:1 vs 2:4:1)

No papers found that systematically ablate allocation ratios like 1:1:1, 1:4:1, 2:4:1 across vision:projector:language.

Claim 5: Simple ablation approach (no expensive search)

Most papers use:

NAS (Neural Architecture Search) - expensive
AdaLoRA - requires SVD and importance scoring during training
Gradient-based methods

No paper does simple grid search over allocation ratios.

Phase C: Cross-Model Verification

Key Findings from Literature

Multimodal Scaling Laws mentions 1:3 vision:language ratio, but this appears to be about model size (parameters), not LoRA rank allocation during fine-tuning
Adaptive Capacity Allocation for VLA (arXiv 2603.07404) is the CLOSEST work:
- Studies capacity allocation for vision-language-action models
- Finds "robotics transfer exhibits a higher and task-dependent capacity need"
- BUT: focuses on VLA (robotics), not general VLMs like LLaVA/BLIP-2
- Does NOT study cross-modal projector separately
Neural Architecture Search for Variable LoRA Rank (arXiv 2508.12512):
- Uses NAS to find optimal ranks for different layers
- BUT: treats it as a search problem, not a systematic ablation
- Does NOT report specific allocation ratios
No paper systematically studies vision:projector:language allocation ratios

Closest Prior Work Analysis

Paper	Year	Overlap	Key Difference
Adaptive Capacity Allocation for VLA	2026	Studies capacity allocation across modalities	(1) VLA not VLM, (2) No projector analysis, (3) No ratio ablation
Neural Architecture Search for Variable LoRA Rank	2025	Variable rank in VLMs	Uses NAS not ablation, no ratio study
MokA: Multimodal Low-Rank Adaptation	2025	Modality-aware LoRA	Shared vs specific params, not allocation ratios
Q-Former PEFT	2024	Applies AdaLoRA to Q-Former	Doesn't compare projector vs encoder/decoder needs
Dynamic Rank Allocation	2025	Heterogeneous rank allocation	Single-modality LLMs, not multimodal

Phase D: Novelty Report

Overall Novelty Assessment

Score: 8/10

Recommendation: ✅ PROCEED

Core Novelty

What is novel:

Systematic ratio ablation - No prior work tests allocation ratios like 1:1:1, 1:4:1, 2:4:1 across vision:projector:language
Cross-modal projector focus - Existing work treats projector as part of "alignment" but doesn't isolate its rank requirements
Simple diagnostic approach - Grid search over ratios is cheaper than NAS or AdaLoRA, provides interpretable results
Actionable guidelines - Output is a recommended ratio (e.g., "use 2:4:1"), not a learned allocation policy

What is NOT novel:

The idea that different modalities need different ranks (MokA, VLA paper already show this)
Using LoRA for multimodal models (standard practice)
The hypothesis that projector is important (many papers study projector design)

Key Differentiator

The unique angle is the DIAGNOSTIC QUESTION: "What is the optimal budget allocation RATIO?"

Existing work either:

Uses adaptive methods (AdaLoRA, NAS) that find allocations but don't report interpretable ratios
Studies modality-aware parameters but doesn't quantify allocation ratios
Focuses on projector design (architecture) not rank allocation

This work would be the FIRST to answer: "If I have a fixed parameter budget, what ratio should I allocate to vision:projector:language?"

Risk: What a Reviewer Would Cite

Potential objections:

"Adaptive Capacity Allocation for VLA (arXiv 2603.07404) already studies this"
- Counter: That's for VLA (robotics), not general VLMs. We study LLaVA/BLIP-2 on standard VQA benchmarks.
"Neural Architecture Search already finds optimal ranks"
- Counter: NAS is expensive and doesn't provide interpretable ratios. Our ablation is simpler and gives actionable guidelines.
"This is just hyperparameter tuning"
- Counter: We're revealing a structural property (projector needs high rank) that generalizes across tasks, not just tuning for one task.

Suggested Positioning

Title angle: "Where Should Parameters Go? A Systematic Study of Budget Allocation in Multimodal LoRA"

Framing:

Problem: Existing adaptive methods (AdaLoRA, NAS) are expensive and don't provide interpretable allocation guidelines
Gap: No one has systematically studied allocation ratios across vision:projector:language
Contribution: Simple ablation reveals that cross-modal projector needs 2-4x higher rank than within-modality layers
Impact: Practitioners can use our recommended ratios (e.g., 2:4:1) without expensive search

Key result to emphasize: "Cross-modal projector is a bottleneck - allocating 40-50% of parameter budget to projector (despite it being <10% of model size) improves accuracy by X%"

Sources

Key papers to cite and differentiate from:

Novelty Check: Cross-Modal Budget Allocation (CMBA)

Proposed Method

Core Claims to Verify

Phase B: Multi-Source Literature Search

Search Results Summary

Phase C: Cross-Model Verification

Key Findings from Literature

Closest Prior Work Analysis

Phase D: Novelty Report

Overall Novelty Assessment

Core Novelty

Key Differentiator

Risk: What a Reviewer Would Cite

Suggested Positioning

Sources