01 HRI '26

SCOPE

2025

Lede

When you chain a language model and a vision model together, how do you know which one failed? SCOPE is an evaluation harness for modular multimodal pipelines, built around PTZ camera control as the test domain. Published at HRI '26.

Detail

The research question behind SCOPE isn't really about cameras — it's about how to evaluate modular AI systems. When you have a language model doing the planning and a vision model doing the perception, and the pipeline fails a task, you need infrastructure to isolate where it broke down and why. SCOPE builds that: a Blender simulation environment where the camera API mirrors real PTZ hardware exactly, a 536-task benchmark across 8 categories (counting, spatial reasoning, OCR, multi-step commands, comparative relational analysis), and a harness that lets you swap language models and vision models independently to measure each component's contribution to system performance. We tested 19 SLM+VLM combinations. The top configuration — Qwen3-30B-A3B paired with Moondream3 — hit 73.8% overall. The more interesting finding: mixture-of-experts architectures consistently outperformed dense models, which wasn't a given going in.

Stack

Python
Blender
Qwen3
Moondream
vLLM
LLM-as-Judge

Links