2.8m Gmail.txt Online
: Uses 22k data pairs focusing on textual accuracy (
: Qwen2.5-VL-72B-Instruct is used as the judge model for calculating visual rewards during training [11]. 4. Experimental Results 2.8M GMAIL.txt
: Increasing data from 2M to 2.8M results in no further performance gains, confirming the plateau [22]. Multimodal Structured Reinforcement Learning (MSRL) : : Uses 22k data pairs focusing on textual
The paper demonstrates that MSRL significantly outperforms pure SFT models by optimizing for both textual structure and visual fidelity, effectively surpassing the performance limit reached at 2.8M SFT samples [11, 25]. MSRL Stage Max Dataset Size 2.8 million samples [11, 22] 33k curated samples [11] GPU Requirement 16 H800 GPUs [11] 24 H800 GPUs [11] Training Goal Min. Negative Log-Likelihood [22] Hybrid Text-Visual Reward [11] Outcome Performance Plateaus [22] Breaks SFT Performance Limit [11] The authors use a specific of chart-to-code data
The paper addresses the "SFT plateau," a phenomenon where Supervised Fine-Tuning (SFT) performance on Large Language Models (LLMs) stops improving even as the dataset size increases [11, 22]. The authors use a specific of chart-to-code data to demonstrate this limitation and propose Multimodal Structured Reinforcement Learning (MSRL) as a solution [11, 22]. 2. Methodology Supervised Fine-Tuning (SFT) Phase : Baseline Model : Qwen2.5-VL-7B-Instruct [11, 22].