# VideoAgent V4 Implementation Plan

## Core Move

V3 was a multi-model voting system. V4 should be an evidence-grounded agent harness:

1. Parse the question.
2. Decide which tools are required.
3. Build a timeline of evidence from the video.
4. Run a specialized solver for that question shape.
5. Verify the answer against cited evidence windows.
6. Spend remaining time only on low-confidence cases.

## Runtime Budget

The system must finish 20 videos in 15 minutes.

| Window | Work |
| --- | --- |
| 0:00-1:00 | Inventory all videos, read prompts, extract duration/resolution/audio-track metadata |
| 1:00-4:00 | Run cheap parallel preprocessing: scene cuts, ASR, OCR preview, target timestamp clips |
| 4:00-9:00 | Run first-pass solver for every question |
| 9:00-13:30 | Re-run only low-confidence answers with heavier tools |
| 13:30-15:00 | Assemble answer string, persist evidence JSON, fail-safe fallback |

## Modules

```text
videoagent_v4/
  src/
    run.py
    router.py
    scheduler.py
    evidence_store.py
    verifier.py
    timeline/
      media_probe.py
      scene_cuts.py
      asr.py
      ocr.py
      keyframes.py
      audio_segments.py
      object_tracks.py
      face_clusters.py
    solvers/
      timestamp_solver.py
      audio_trigger_solver.py
      trailer_scene_solver.py
      recipe_sequence_solver.py
      sports_count_solver.py
      magic_cards_solver.py
      assembly_count_solver.py
      generic_videoqa_solver.py
  outputs/
    run.json
    answer.txt
    evidence/
```

## Solver Routing

| Prompt shape | Solver | Tool calls |
| --- | --- | --- |
| `At exactly 5:00`, `2:30`, `first 19 seconds` | `timestamp_solver` | extract +/- local clip, dense frames, VLM classify |
| quoted speech such as `oh my god` | `audio_trigger_solver` | Whisper ASR, locate phrase, sample around timestamp |
| `how many songs`, `voice chat turns` | `audio_segments` + verifier | ASR/music segmentation, turn counting |
| trailer / movie scene / Statue of Liberty / character count | `trailer_scene_solver` | scene cuts, face clustering, landmark/object detection, VLM verifier |
| recipe / ingredient order | `recipe_sequence_solver` | ASR transcript, scene cuts, ingredient entity ledger |
| football goals / outside penalty area | `sports_count_solver` | first-N-min clip, event segmentation, field geometry review |
| magician cards | `magic_cards_solver` | ASR trigger, dense local frames, hand/card count verifier |
| IKEA dowel insertions | `assembly_count_solver` | object/action event ledger, duplicate suppression |
| unknown or mixed | `generic_videoqa_solver` | V3-style frame set A/B + model ensemble |

## Evidence Contract

Every answer must write this shape:

```json
{
  "video": 5,
  "answer": "C",
  "confidence": 0.74,
  "question_type": ["audio_trigger", "card_count"],
  "evidence": [
    {
      "time": "03:12-03:20",
      "source": "asr",
      "claim": "panel member says oh my god"
    },
    {
      "time": "03:20-03:23",
      "source": "dense_frames",
      "claim": "three visible cards on the mat or in hand"
    }
  ],
  "verifier": {
    "status": "supported",
    "reason": "answer choice C matches the visible card count in the cited window"
  }
}
```

## Why This Is Better Than V3

- V3 answered from sampled frames and model voting.
- V4 answers from a tool-generated evidence timeline.
- V3 treated disagreement as a voting problem.
- V4 treats disagreement as a missing-evidence problem.
- V3 spent roughly similar effort on every video.
- V4 reallocates remaining time to low-confidence answers.
- V3 had weak explicit audio support.
- V4 makes ASR/audio segmentation a first-class evidence source.

## Portfolio Positioning

This is not a claim that the official hidden score is known.
It is a concrete improvement plan for the system:

> I turned a hackathon Video QA ensemble into an agent harness with planning, tool use, evidence memory, verifier loops, and benchmark-style traceability.