02 CMU · MMML

Fine-tuning LLaVA for Web Agents

2024

Lede

Group project at CMU. We took LLaVA-v1.5-7B, fine-tuned it on a web-UI dataset, and pushed the open-model score on VisualWebBench up.

Results

Action Prediction: 77.94%
Heading OCR: 54.82%
Hardware: 1× A100

Report

Final report (PDF) ↓ Download PDF ↗ Open in new tab

Fine-tuning

Trained on a WebUI dataset for the right reason.

We LoRA-tuned LLaVA-v1.5-7B on MultiUI, a diverse dataset of web-based UI interactions. As far as we could tell at the time, this was the first time a model was trained on WebUI tasks specifically to improve generalizability to VisualWebBench, rather than treating it as just another evaluation target. Conducted detailed visual attention analyses along the way to track alignment between textual and visual modalities.

Prompting

Task-specific prompt designs and preprocessing did most of the work.

The bigger lever turned out to be the prompt: task-specific designs and preprocessing for OCR, grounding, and captioning moved scores more than several rounds of additional fine-tuning. Bounding-box hints in the prompt mattered more than explicit visual cues in the image.

Results

Best open-source small-model score, briefly.

On VisualWebBench (as of 12/15/2024): 77.94% on Action Prediction, the best result we found at that scale. 54.82% on Heading OCR, the best among open low-parameter models we compared against. Numbers will have moved since, this was a snapshot from the course's final report.

Fine-tuning LLaVA for Web Agents

Trained on a WebUI dataset for the right reason.

Task-specific prompt designs and preprocessing did most of the work.

Best open-source small-model score, briefly.

Visual attention helped most where the model was bluffing.