summaryrefslogblamecommitdiff
path: root/docs/proposals/feature-multi-agent-review.md
blob: d678756ec151f93cac222b0b5033c6a2e1569ae2 (plain) (tree)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448































































































































































































































































































































































































































































                                                                                                                                                                                                                       
# Feature Proposal: Multi-Agent Parallel Review System

> **Priority:** High
> **Complexity:** Medium
> **Estimated Effort:** 12-18 days
> **Status:** Proposal
> **Date:** 2026-02-09
> **Dependencies:** [Findings Tracking](feature-findings-tracking.md) (recommended)
> **Related:** [Overview Analysis](compound-engineering-analysis.md) · [Workflow Presets](feature-workflow-presets.md)

---

## Problem Statement

Makima's contract lifecycle includes a **Review** phase, but it currently has:

- **No automated review mechanism** — the review phase relies entirely on manual user inspection or a single supervisor task
- **Single-perspective review** — even when a review task is spawned, it examines code from one viewpoint
- **No structured review output** — findings are captured as unstructured text in task output
- **No review templates** — each review must be configured from scratch
- **No synthesis** — when multiple reviewers exist, there's no mechanism to deduplicate and prioritize findings

For complex contracts touching security, performance, and architecture, a single-pass review consistently misses category-specific issues that specialized reviewers would catch.

---

## How Compound Engineering Solves This

The compound engineering plugin spawns **12-15 specialized review agents in parallel**, each examining the code from a unique perspective:

| Agent | Focus Area | Example Findings |
|-------|-----------|-----------------|
| Security Sentinel | Auth, injection, secrets, CSRF | SQL injection in user input handler |
| Performance Oracle | N+1 queries, memory leaks, caching | Unbounded list growth in event handler |
| Architecture Strategist | Coupling, SOLID, layering | Service directly accessing repository internals |
| Code Philosopher | Readability, naming, complexity | Cyclomatic complexity > 15 in payment flow |
| Data Integrity Guardian | Validation, constraints, migrations | Missing NOT NULL constraint on required field |
| Error Resilience Analyzer | Error handling, retries, fallbacks | Unhandled timeout in external API call |
| API Contract Validator | Breaking changes, versioning | Removed required field from response |
| Dependency Health Checker | Vulnerabilities, licensing, freshness | CVE-2025-XXXX in transitive dependency |
| Test Coverage Analyzer | Coverage gaps, edge cases, mocking | No tests for error path in checkout flow |
| Documentation Completeness | Docs accuracy, examples, changelog | Public API endpoint undocumented |
| Concurrency Safety | Race conditions, deadlocks, atomicity | Non-atomic read-modify-write on shared counter |

After all agents complete, a **synthesis agent** deduplicates findings, resolves contradictions, and produces a prioritized report.

```
┌───────────────────────────────────────────────────────┐
│                  Review Orchestrator                   │
│                                                       │
│  spawn-group "review"                                 │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐   │
│  │Security │ │ Perf    │ │  Arch   │ │  Code   │   │
│  │Sentinel │ │ Oracle  │ │Strategy │ │  Phil   │   │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘   │
│       │           │           │           │         │
│  ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐   │
│  │  Data   │ │ Error   │ │  API    │ │  Deps   │   │
│  │Guardian │ │Resilien.│ │Contract │ │ Health  │   │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘   │
│       │           │           │           │         │
│  ┌────┴────┐ ┌────┴────┐ ┌────┴────┐               │
│  │  Test   │ │  Docs   │ │Concurr. │               │
│  │Coverage │ │Complete │ │ Safety  │               │
│  └────┬────┘ └────┬────┘ └────┬────┘               │
│       │           │           │                     │
│  wait-group "review"                                 │
│       ▼           ▼           ▼                     │
│  ┌──────────────────────────────────────────┐       │
│  │         Synthesis Agent                   │       │
│  │  - Deduplicate findings                   │       │
│  │  - Resolve contradictions                 │       │
│  │  - Prioritize by severity                 │       │
│  │  - Generate summary report                │       │
│  └──────────────────────────────────────────┘       │
│                      │                               │
│                      ▼                               │
│             Structured Findings                      │
│             (P1 / P2 / P3)                           │
└───────────────────────────────────────────────────────┘
```

---

## Proposed Makima Implementation

### 1. New Supervisor Commands

#### `makima supervisor spawn-group`

Spawns multiple tasks as a named group and returns immediately:

```bash
# Spawn a review group with 5 agents
makima supervisor spawn-group "review" \
  --tasks '[
    {"name": "security-review", "plan": "Review for security vulnerabilities..."},
    {"name": "performance-review", "plan": "Review for performance issues..."},
    {"name": "architecture-review", "plan": "Review for architecture concerns..."}
  ]' \
  --share-worktree \
  --read-only
```

**Key parameters:**
- `--tasks` — JSON array of task definitions
- `--share-worktree` — All tasks in the group share the supervisor's worktree (read-only access)
- `--read-only` — Tasks cannot modify files, only produce output
- `--max-concurrent N` — Limit parallel execution (default: unlimited)

#### `makima supervisor wait-group`

Waits for all tasks in a named group to complete:

```bash
# Wait for all review tasks, timeout after 10 minutes
makima supervisor wait-group "review" --timeout 600

# Returns JSON with all task results
```

**Output format:**
```json
{
  "group": "review",
  "status": "completed",
  "tasks": [
    {"name": "security-review", "status": "done", "output": "..."},
    {"name": "performance-review", "status": "done", "output": "..."}
  ],
  "duration_seconds": 127
}
```

#### `makima supervisor review`

High-level command that orchestrates the full review pipeline:

```bash
# Run review with default agent config
makima supervisor review

# Run review with custom config
makima supervisor review --config .makima/review-agents.yaml

# Run only specific review categories
makima supervisor review --only security,performance,architecture
```

### 2. Review Agent Configuration

#### Repository-Level Configuration (`.makima/review-agents.yaml`)

```yaml
# .makima/review-agents.yaml
version: 1
review:
  # Maximum number of concurrent review agents
  max_concurrent: 8

  # Timeout per agent (seconds)
  agent_timeout: 300

  # Auto-trigger review when phase transitions to 'review'
  auto_trigger: true

  # Finding severity that blocks merge
  merge_blocking_severity: P1

  agents:
    - name: security-sentinel
      enabled: true
      plan: |
        You are a Security Sentinel reviewing code changes.

        Focus areas:
        - Authentication and authorization flaws
        - Injection vulnerabilities (SQL, XSS, command injection)
        - Secret/credential exposure
        - CSRF and session management
        - Input validation gaps

        Output format: One finding per section with severity (P1/P2/P3),
        affected file/line, description, and suggested fix.
      priority: critical  # Always runs

    - name: performance-oracle
      enabled: true
      plan: |
        You are a Performance Oracle reviewing code changes.

        Focus areas:
        - N+1 query patterns
        - Memory leaks and unbounded growth
        - Missing caching opportunities
        - Algorithmic complexity issues
        - Database index utilization

        Output format: One finding per section with severity (P1/P2/P3),
        affected file/line, description, and suggested fix.
      priority: standard

    - name: architecture-strategist
      enabled: true
      plan: |
        You are an Architecture Strategist reviewing code changes.

        Focus areas:
        - SOLID principle violations
        - Inappropriate coupling between modules
        - Layering violations (e.g., handler accessing DB directly)
        - Missing abstraction boundaries
        - Inconsistency with existing patterns

        Output format: One finding per section with severity (P1/P2/P3),
        affected file/line, description, and suggested fix.
      priority: standard

    - name: test-coverage-analyzer
      enabled: true
      plan: |
        You are a Test Coverage Analyzer reviewing code changes.

        Focus areas:
        - Missing test coverage for new code paths
        - Untested error/edge cases
        - Test quality (meaningful assertions vs superficial)
        - Integration test gaps
        - Mock appropriateness

        Output format: One finding per section with severity (P1/P2/P3),
        affected file/line, description, and suggested fix.
      priority: standard

    # Users can add custom agents here
    - name: custom-domain-reviewer
      enabled: false
      plan: "Review for domain-specific business logic concerns..."
      priority: optional
```

#### Contract-Level Override

```yaml
# In contract configuration or via CLI
review:
  agents:
    # Disable agents not relevant to this contract
    - name: concurrency-safety
      enabled: false
    # Add contract-specific reviewer
    - name: migration-safety
      enabled: true
      plan: "Review database migrations for data loss risks..."
```

### 3. Synthesis Step

After all review agents complete, a synthesis task:

1. **Collects** all findings from group task outputs
2. **Deduplicates** findings about the same issue from different perspectives
3. **Resolves contradictions** (e.g., one agent says "add caching" while another says "caching adds complexity")
4. **Prioritizes** by severity and cross-agent agreement
5. **Produces** a structured review report as a contract file

```bash
# Synthesis is automatically run after wait-group completes
makima supervisor synthesize-review "review" \
  --output-format findings \
  --create-contract-file
```

### 4. Auto-Review Trigger

When a contract's phase transitions to `review`:

```rust
// In phase transition handler
if new_phase == "review" && contract.review_config.auto_trigger {
    // Spawn review group automatically
    spawn_review_group(contract, review_config).await?;
}
```

---

## Integration with Existing Makima Features

### Supervisor/Worker Hierarchy

Review agents are spawned as **worker tasks** under the supervisor, using existing `spawn-task` infrastructure. The new `spawn-group`/`wait-group` commands are syntactic sugar over batch `spawn-task` + `wait` calls.

### Git Worktree Isolation

Review agents share the supervisor's worktree in **read-only mode** (a new capability). This avoids creating N separate worktrees for review-only tasks. Implementation:
- New `supervisor_worktree_task_id` parameter (already exists in SpawnTask)
- New `read_only: true` flag to prevent file modifications
- Workers see the same code state that triggered the review

### Contract Files

The synthesized review report is stored as a **contract file** attached to the review phase:
```rust
File {
    contract_id: contract.id,
    contract_phase: "review",
    name: "Review Report — 2026-02-09",
    body: vec![
        BodyElement::Heading { level: 1, text: "Review Summary" },
        BodyElement::Paragraph { text: "3 P1 findings, 7 P2 findings, 12 P3 findings" },
        // ... structured findings
    ],
}
```

### Phase Guards

If `phase_guard` is enabled and P1 findings exist, the phase transition from Review to Execute (or Compound) is blocked until P1s are resolved. This integrates with the existing `advance-phase` confirmation flow.

### Completion Gates

Each review agent uses the existing `<COMPLETION_GATE>` mechanism to signal when its review is complete:
```xml
<COMPLETION_GATE>
ready: true
reason: "Security review complete. Found 2 P1 and 3 P2 findings."
progress: "Reviewed 47 files across 12 modules."
</COMPLETION_GATE>
```

### Circuit Breaker

The existing CircuitBreaker protects against review agents getting stuck. If a review agent loops without progress for 3 iterations, it's terminated and its partial findings are included in synthesis.

---

## Implementation Plan

### Phase 1: Group Task Infrastructure (5-7 days)

| Task | Effort | Description |
|------|--------|-------------|
| `spawn-group` command | 2 days | Batch task spawning with named groups |
| `wait-group` command | 1 day | Wait for all tasks in group |
| Group tracking in DB | 1 day | Task group table, membership, status |
| Shared worktree (read-only) | 1-2 days | Workers share supervisor worktree |
| Tests | 1 day | Unit + integration tests |

### Phase 2: Review Agent System (4-6 days)

| Task | Effort | Description |
|------|--------|-------------|
| Review config YAML parser | 1 day | Parse `.makima/review-agents.yaml` |
| `supervisor review` command | 2 days | Orchestrate review pipeline |
| Synthesis agent logic | 1-2 days | Deduplicate, prioritize, format |
| Review report as contract file | 1 day | Store structured output |

### Phase 3: Automation & Polish (3-5 days)

| Task | Effort | Description |
|------|--------|-------------|
| Auto-trigger on phase transition | 1 day | Hook into `advance-phase` |
| P1 merge blocking | 1 day | Phase guard integration |
| Default review agent templates | 1-2 days | Ship 8-10 built-in agents |
| Documentation | 1 day | User guide and config reference |

---

## Configuration Examples

### Minimal Setup (Zero Config)

```bash
# Uses built-in review agents with default settings
makima supervisor review
```

### Custom Review for a Specific Contract

```bash
# Override for this contract only
makima supervisor review \
  --only security,performance \
  --merge-blocking P1 \
  --timeout 300
```

### Full Custom Configuration

```yaml
# .makima/review-agents.yaml
version: 1
review:
  max_concurrent: 6
  agent_timeout: 300
  auto_trigger: true
  merge_blocking_severity: P1

  synthesis:
    dedup_threshold: 0.8        # Similarity score for deduplication
    min_agreement: 2             # Findings flagged by 2+ agents get priority boost
    output_format: "findings"    # "findings" | "report" | "both"
    create_contract_file: true

  agents:
    - name: security-sentinel
      enabled: true
      priority: critical
      plan: |
        ...
    - name: performance-oracle
      enabled: true
      priority: standard
      plan: |
        ...
    # ... more agents
```

---

## Open Questions

1. **Shared worktree read-only enforcement**: Should this be enforced at the filesystem level (mount read-only) or via convention (instructions to the agent)?
2. **Review scope**: Should review agents see all files or only changed files (git diff)?
3. **Incremental review**: When new commits are added during review, should agents re-review or only review the delta?
4. **Agent output parsing**: Should agents output structured YAML findings, or should the synthesis step parse natural language?
5. **Cost control**: With 10+ parallel agents, how do we manage API costs? Should there be a budget ceiling per review?
6. **Finding deduplication**: What similarity threshold should trigger deduplication? How to handle partial overlaps?

---

## Alternatives Considered

| Alternative | Pros | Cons | Decision |
|-------------|------|------|----------|
| Single comprehensive review agent | Simple, no coordination overhead | Misses perspective-specific issues | Rejected — diminishes review quality |
| Sequential reviews (one after another) | Simpler orchestration | 5-10x slower; later reviews can't benefit from earlier ones | Rejected — latency unacceptable |
| External review tools integration | Leverage existing static analysis | Limited to tool capabilities; no semantic review | Complement — can integrate alongside agent review |
| User-configured number of agents | Maximum flexibility | Analysis paralysis for new users | Adopted — sensible defaults + customization |

---

## Priority & Complexity Assessment

- **Priority: HIGH** — Multi-agent review is the highest-impact feature from the compound engineering plugin. It directly improves code quality with no change to developer workflow.
- **Complexity: MEDIUM** — The core `spawn-group`/`wait-group` pattern is straightforward. The synthesis step requires careful design. Shared worktree read-only mode is a new capability.
- **Risk: LOW-MEDIUM** — Main risks are resource consumption (manageable with concurrency limits) and synthesis quality (improvable iteratively).