SCOPE
When you chain a language model and a vision model together, how do you know which one failed? SCOPE is an evaluation harness for modular multimodal pipelines, built around PTZ camera control as the test domain. Published at HRI '26.
- Tasks
- 536
- SLM × VLM combos
- 19
- Best config
- 73.8%
Modular pipelines are hard to evaluate.
When a vision-language pipeline gets something wrong, the error can sit anywhere: the language model planning the wrong action, the vision model misreading the frame, the glue code between them losing state. Standard benchmarks score the system end-to-end and tell you nothing about which component to swap.
Sim environment with hardware-faithful camera control.
We built a Blender simulator where the PTZ API mirrors real-camera command surface, so models trained on the simulator transfer without rewriting the control layer. On top of that, a 536-task benchmark across 8 categories (counting, spatial reasoning, OCR, multi-step commands, comparative relational analysis, and three others) lets us probe each capability separately. The harness swaps the language model and vision model independently.
MoE consistently beat dense at the same parameter budget.
Across 19 SLM × VLM pairs, Qwen3-30B-A3B + Moondream3 topped the leaderboard at 73.8% overall. The finding that surprised me most was structural: mixture-of-experts language models outperformed dense models of comparable size in almost every category, not just the ones with clear routing signal.
The research question behind SCOPE isn't really about cameras, it's about how to evaluate modular AI systems. When you have a language model doing the planning and a vision model doing the perception, and the pipeline fails a task, you need infrastructure to isolate where it broke down and why. SCOPE builds that: a Blender simulation environment where the camera API mirrors real PTZ hardware exactly, a 536-task benchmark across 8 categories (counting, spatial reasoning, OCR, multi-step commands, comparative relational analysis), and a harness that lets you swap language models and vision models independently to measure each component's contribution to system performance. We tested 19 SLM+VLM combinations. The top configuration, Qwen3-30B-A3B paired with Moondream3, hit 73.8% overall. The more interesting finding: mixture-of-experts architectures consistently outperformed dense models, which wasn't a given going in.
- Python
- Blender
- Qwen3
- Moondream
- vLLM
- LLM-as-Judge