AI-generated Old site ↗
← All projects
01 HRI '26
SCOPE — PTZ camera and Blender simulation

SCOPE

2025/2026
Lede

When you chain a language model and a vision model together, how do you know which one failed? SCOPE is an evaluation harness for modular multimodal pipelines, built around PTZ camera control as the test domain. Published at HRI '26.

Results
Tasks
536
SLM × VLM combos
19
Best config
73.8%

Problem

Modular pipelines are hard to evaluate.

When a vision-language pipeline gets something wrong, the error can sit anywhere: the language model planning the wrong action, the vision model misreading the frame, the glue code between them losing state. Standard benchmarks score the system end-to-end and tell you nothing about which component to swap.

Approach

Sim environment with hardware-faithful camera control.

We built a Blender simulator where the PTZ API mirrors real-camera command surface, so models trained on the simulator transfer without rewriting the control layer. On top of that, a 536-task benchmark across 8 categories (counting, spatial reasoning, OCR, multi-step commands, comparative relational analysis, and three others) lets us probe each capability separately. The harness swaps the language model and vision model independently.

Result

MoE consistently beat dense at the same parameter budget.

Across 19 SLM × VLM pairs, Qwen3-30B-A3B + Moondream3 topped the leaderboard at 73.8% overall. The finding that surprised me most was structural: mixture-of-experts language models outperformed dense models of comparable size in almost every category, not just the ones with clear routing signal.

Detail

The research question behind SCOPE isn't really about cameras, it's about how to evaluate modular AI systems. When you have a language model doing the planning and a vision model doing the perception, and the pipeline fails a task, you need infrastructure to isolate where it broke down and why. SCOPE builds that: a Blender simulation environment where the camera API mirrors real PTZ hardware exactly, a 536-task benchmark across 8 categories (counting, spatial reasoning, OCR, multi-step commands, comparative relational analysis), and a harness that lets you swap language models and vision models independently to measure each component's contribution to system performance. We tested 19 SLM+VLM combinations. The top configuration, Qwen3-30B-A3B paired with Moondream3, hit 73.8% overall. The more interesting finding: mixture-of-experts architectures consistently outperformed dense models, which wasn't a given going in.

Gallery
Agent running in Blender across three urban scenes. The terminal trace on the left logs every tool call, VLM response, and planner reasoning step.
More simulation frames across different task categories. Same harness, same prompts, swappable SLM and VLM backbones. You can also see the scripts opening different worlds and the SLM/VLM stack spinning up immediately as each one loads.
Real PTZ camera in the office. Same agent, same exposed system prompt, same tool definitions; only the planner LLM changes between runs. First pass uses Qwen3-30B-A3B (the 30B mixture-of-experts with 3B active params). Second swaps it for the dense Qwen3-32B in the same series. The visible difference is speed: the MoE plans and acts noticeably faster than the dense model, even at comparable capability scores. A practical reminder that architecture choice can matter more than parameter count.
Stack
  • Python
  • Blender
  • Qwen3
  • Moondream
  • vLLM
  • LLM-as-Judge