novelty_check_cmba

Novelty Check: Cross-Modal Budget Allocation (CMBA)

Proposed Method

Systematically study optimal parameter budget allocation ratios (vision encoder : cross-modal projector : language decoder) in vision-language models like LLaVA. Hypothesis: cross-modal alignment layers need disproportionately high rank (2-4x) relative to their size compared to within-modality layers.

Core Claims to Verify

  1. Claim 1: Systematic study of budget allocation RATIOS across modalities (vision:projector:language) has not been done
  2. Claim 2: Cross-modal projector layers are bottlenecks that need disproportionately high rank
  3. Claim 3: Optimal allocation is NOT uniform across modalities
  4. Claim 4: The specific ratio (e.g., 1:4:1 or 2:4:1) matters for performance
  5. Claim 5: This can be studied via ablation without expensive search algorithms

Search Results Summary

Claim 1: Systematic study of budget allocation RATIOS

Relevant papers found:

Claim 2: Cross-modal projector needs high rank

Relevant papers found:

Claim 3: Non-uniform allocation is better

Relevant papers found:

Claim 4: Specific ratios matter (1:1 vs 1:4:1 vs 2:4:1)

No papers found that systematically ablate allocation ratios like 1:1:1, 1:4:1, 2:4:1 across vision:projector:language.

Claim 5: Simple ablation approach (no expensive search)

Most papers use:

No paper does simple grid search over allocation ratios.

Phase C: Cross-Model Verification

Key Findings from Literature

  1. Multimodal Scaling Laws mentions 1:3 vision:language ratio, but this appears to be about model size (parameters), not LoRA rank allocation during fine-tuning

  2. Adaptive Capacity Allocation for VLA (arXiv 2603.07404) is the CLOSEST work:

    • Studies capacity allocation for vision-language-action models
    • Finds "robotics transfer exhibits a higher and task-dependent capacity need"
    • BUT: focuses on VLA (robotics), not general VLMs like LLaVA/BLIP-2
    • Does NOT study cross-modal projector separately
  3. Neural Architecture Search for Variable LoRA Rank (arXiv 2508.12512):

    • Uses NAS to find optimal ranks for different layers
    • BUT: treats it as a search problem, not a systematic ablation
    • Does NOT report specific allocation ratios
  4. No paper systematically studies vision:projector:language allocation ratios

Closest Prior Work Analysis

Paper Year Overlap Key Difference
Adaptive Capacity Allocation for VLA 2026 Studies capacity allocation across modalities (1) VLA not VLM, (2) No projector analysis, (3) No ratio ablation
Neural Architecture Search for Variable LoRA Rank 2025 Variable rank in VLMs Uses NAS not ablation, no ratio study
MokA: Multimodal Low-Rank Adaptation 2025 Modality-aware LoRA Shared vs specific params, not allocation ratios
Q-Former PEFT 2024 Applies AdaLoRA to Q-Former Doesn't compare projector vs encoder/decoder needs
Dynamic Rank Allocation 2025 Heterogeneous rank allocation Single-modality LLMs, not multimodal

Phase D: Novelty Report

Overall Novelty Assessment

Score: 8/10

Recommendation: ✅ PROCEED

Core Novelty

What is novel:

  1. Systematic ratio ablation - No prior work tests allocation ratios like 1:1:1, 1:4:1, 2:4:1 across vision:projector:language
  2. Cross-modal projector focus - Existing work treats projector as part of "alignment" but doesn't isolate its rank requirements
  3. Simple diagnostic approach - Grid search over ratios is cheaper than NAS or AdaLoRA, provides interpretable results
  4. Actionable guidelines - Output is a recommended ratio (e.g., "use 2:4:1"), not a learned allocation policy

What is NOT novel:

Key Differentiator

The unique angle is the DIAGNOSTIC QUESTION: "What is the optimal budget allocation RATIO?"

Existing work either:

This work would be the FIRST to answer: "If I have a fixed parameter budget, what ratio should I allocate to vision:projector:language?"

Risk: What a Reviewer Would Cite

Potential objections:

  1. "Adaptive Capacity Allocation for VLA (arXiv 2603.07404) already studies this"

    • Counter: That's for VLA (robotics), not general VLMs. We study LLaVA/BLIP-2 on standard VQA benchmarks.
  2. "Neural Architecture Search already finds optimal ranks"

    • Counter: NAS is expensive and doesn't provide interpretable ratios. Our ablation is simpler and gives actionable guidelines.
  3. "This is just hyperparameter tuning"

    • Counter: We're revealing a structural property (projector needs high rank) that generalizes across tasks, not just tuning for one task.

Suggested Positioning

Title angle: "Where Should Parameters Go? A Systematic Study of Budget Allocation in Multimodal LoRA"

Framing:

  1. Problem: Existing adaptive methods (AdaLoRA, NAS) are expensive and don't provide interpretable allocation guidelines
  2. Gap: No one has systematically studied allocation ratios across vision:projector:language
  3. Contribution: Simple ablation reveals that cross-modal projector needs 2-4x higher rank than within-modality layers
  4. Impact: Practitioners can use our recommended ratios (e.g., 2:4:1) without expensive search

Key result to emphasize: "Cross-modal projector is a bottleneck - allocating 40-50% of parameter budget to projector (despite it being <10% of model size) improves accuracy by X%"

Sources

Key papers to cite and differentiate from: