GitLab Duo CLI Advanced Patterns — Batch Automation, Multi-Model Comparison & Noise Reduction

This is the follow-up to my getting started article. If you haven't tried Duo CLI yet, start there.

Why Batch Automation

Here's the story that got me started.

I was reviewing a large refactoring MR that touched about a dozen modules. Using Duo Chat on the web, I asked about each module one by one. By the fifth one, I'd already forgotten the conclusions from the first. And each time I had to re-provide context, because Duo Chat didn't know what I'd asked earlier (this was before session sync existed).

After that, I started scripting. I queued up review tasks for all twelve modules, ran them in one go, results saved to files. Made myself a coffee, came back, everything waiting for me.

That's the core value of batch automation: decoupling "waiting for AI" from your attention.

Minimum Viable Batch Setup

It doesn't need to be complicated. My setup is three directories:

my-review/
├── cells/        # One JSON file per task
├── state/        # Completion markers (.done files)
└── results/      # AI responses land here

A cell looks like this:

{
  "id": "auth-module-error-handling",
  "target": "/path/to/repo",
  "model": "claude-opus-4-6",
  "goal": "Review the auth module's error handling, focusing on token expiry edge cases"
}

The runner script:

#!/bin/bash
for cell_file in cells/*.json; do
  cell_id=$(jq -r '.id' "$cell_file")

  # Skip completed cells
  [ -f "state/${cell_id}.done" ] && continue

  goal=$(jq -r '.goal' "$cell_file")
  target=$(jq -r '.target' "$cell_file")
  model=$(jq -r '.model' "$cell_file")

  echo "⏳ Running: $cell_id"

  glab duo cli run \
    -C "$target" \
    --goal "$goal" \
    --model "$model" \
    --dangerously-skip-permissions \
    > "results/${cell_id}.md" 2>&1

  touch "state/${cell_id}.done"
  echo "✅ Done: $cell_id"

  sleep 30  # avoid rate limits
done

echo "All done!"

The key is the .done file. Connection drops mid-run? Just re-run and it skips what's already finished. No database, just a touch.

I sleep 30 seconds between cells. Not strictly required, but hitting the API too fast occasionally triggers rate limits. 30 seconds has been stable for me.

Don't Forget Timeouts

Learned this one the hard way. Some tasks make the AI think forever, especially if you give it a huge repo with a vague question. Once I started a 10-cell batch before bed, woke up to find it stuck on cell 3, running for 6 hours straight.

Now I always add a timeout:

# Run in background + track time
glab duo cli run -C "$target" --goal "$goal" \
  --model "$model" --dangerously-skip-permissions \
  > "results/${cell_id}.md" 2>&1 &
PID=$!

ELAPSED=0
while kill -0 "$PID" 2>/dev/null; do
  sleep 10
  ELAPSED=$((ELAPSED + 10))
  if [ "$ELAPSED" -ge 900 ]; then  # 15-minute cap
    kill "$PID" 2>/dev/null || true
    echo "TIMEOUT after 15min" > "results/${cell_id}.md"
    break
  fi
done
wait "$PID" 2>/dev/null || true

touch "state/${cell_id}.done"

15 minutes is my current sweet spot. Normal review tasks finish in 2-5 minutes. Beyond 15 usually means it's stuck in some loop. Mark timeout, skip, next.

Multi-Model Comparison: The Real Data

I mentioned in part one that I cross-compare models. Here are the actual numbers from a formal comparison: same code review task, 9 models.

Model	Output Size	My Take
Claude Opus 4.6	10.2 KB	Covers everything. Good when you're new to the codebase.
Claude Opus 4.8	7.9 KB	Practical: tells you what to fix, not just what's wrong.
Claude Opus 4.7	5.9 KB	Conclusions only. For quick confirmation.
GPT 5.5	7.4 KB	Sometimes approaches from a completely different angle. Surprises you.
Gemini 3.1 Pro	7.1 KB	Clean structure, precise scoping.
Gemini 3.5 Flash	4.4 KB	Fastest. Use as a quick filter.
GPT 5.4	0.3 KB	Refused entirely (safety guardrails too aggressive).

Practical takeaways:

Stop looking for "the best model." There isn't one. My current rotation:

Quick screening: Gemini 3.5 Flash (fastest, cheapest)
Full analysis: Claude Opus 4.6 (when you can't afford to miss anything)
Second opinion: GPT 5.5 (genuinely different reasoning paths)

The intersection of two models is more valuable than the union of one. If two models with completely different reasoning styles both flag the same issue, it's almost certainly real. Conversely, something only one model mentions deserves extra scrutiny, since it could be a false positive.

Scripting multi-model runs is straightforward:

MODELS=("claude-opus-4-6" "claude-opus-4-7" "gemini-3-5-flash")

for model in "${MODELS[@]}"; do
  glab duo cli run -C "$TARGET" \
    --goal "$GOAL" \
    --model "$model" \
    --dangerously-skip-permissions \
    > "results/${CELL_ID}-${model}.md" 2>&1
  sleep 30
done

Git Log Anchoring: Giving AI a Map

This technique made the biggest difference in my output quality.

The problem: when you tell AI to "review this repo," it doesn't know where to start. With thousands of files, it has to decide what to look at and what to skip. The result is often: time spent on irrelevant areas, while the actual issues go unexamined.

The fix: give it a map first.

# Pull commits related to a specific topic
git log --all \
  --grep='error\|exception\|retry\|timeout\|fallback' \
  --regexp-ignore-case \
  --name-only \
  --format='%h %cs %s' \
  > error-handling-anchors.txt

This file tells the AI: "These are the recent changes related to error handling. Focus here."

glab duo cli run -C ~/project \
  --goal "Here are recent commits related to error handling:
$(head -50 error-handling-anchors.txt)

Based on these commits:
1. Is the error handling pattern consistent across modules?
2. Are there modules that should have retry logic but don't?
3. Any signs of swallowed exceptions (caught but not handled)?"

Why it works so well: git log --grep surfaces places that were actually changed. Developers spent time fixing things there, which means that's where problems existed. Having AI analyze "are there similar unfixed issues nearby" is ten times more precise than letting it wander.

Keyword sets I rotate through:

Error handling: error|exception|retry|timeout|fallback
Refactoring: refactor|extract|rename|move|split
Performance: perf|slow|cache|optimize|batch
Tech debt: TODO|FIXME|HACK|workaround|temporary

Prompt Validation Gates: 90% Noise Reduction

After using Duo CLI for a while, my biggest pain point wasn't "AI can't find problems." It was "AI finds too many problems, most are false positives."

A code review task returns 20+ findings. You go through them one by one. Turns out 18 are "theoretically an issue but practically can't happen" or "already handled by something else." Time wasted on filtering exceeds time saved by running AI in the first place.

The turning point: adding validation criteria to the end of prompts.

glab duo cli run -C ~/project \
  --goal "Review error handling patterns in this module.

Before reporting each finding, verify:
1. Does this issue exist under default configuration? (Not only triggered by special setup)
2. Is there already an existing mechanism handling this? (Check context for catches/retries/guards)
3. Does the impact extend beyond normal usage? (Not something requiring extreme actions to trigger)
4. Is this a real functional issue, not a style preference?

Mark failures as 'Not applicable — reason: ___'.
Only report findings that pass all four checks in detail."

Measured result: one task originally returned 21 findings. With validation gates, 2 survived. The AI filtered 19 on its own, each with a clear explanation of why it didn't qualify. I only had to read 2 reports instead of 21.

The insight: make the AI judge twice. First pass finds issues. Second pass validates whether they're real. The gap between those two passes is the manual filtering time you saved.

Validation criteria are customizable per use case. Code review criteria differ from tech debt assessment criteria, but the structure is the same: give the AI a set of standards for "this finding is worth reporting."

The Bigger Toolchain

Last piece: how this fits into a broader workflow. I don't use Duo CLI alone. It's one stage in a pipeline.

Stage	Tool	Why
Breadth scan	Duo CLI	Multi-model, batch-capable, cheapest on GitLab Credits
Deep analysis	Claude Code	Has memory, tool integrations, good for multi-turn
Mechanical verification	Codex CLI	Sandboxed execution, runs confirmation scripts

The flow:

Duo CLI batch-scans all modules (breadth), 2-3 models cross-compared
Findings marked "needs deeper look" go to Claude Code for focused analysis (depth)
Consolidate conclusions, confirm fix direction

Duo CLI's role here is "first filter." It doesn't need to be perfect. It needs to be fast and wide. What it misses, the next stage's deep analysis catches.

Cost-wise, Duo CLI runs on GitLab Credits. For anyone with an Enterprise seat, it's the most economical entry point. Offloading bulk initial screening to it saves budget on other tools.

Scattered Tips

Save results as .md, not .txt. AI responses are usually markdown-formatted. Saved as .md, they render beautifully in any viewer.
Set GITLAB_DUO_MODEL as a default. Saves typing --model every time.
Split large repos into modules. Throwing an entire monorepo at one command rarely works well. Target subdirectories or use --ai-context-items to inject only relevant files.
Be specific in your prompts. "Review this code" loses to "review this code's error handling, especially timeout edge cases" every time.
Keep a prompt template library. Good prompts deserve reuse. I have a folder of templates for different task types. Different job, different template.

Wrapping Up

Advanced Duo CLI usage comes down to three things: batch it, multi-model it, de-noise it. Get those three right and you go from "asking one question at a time" to "scanning an entire codebase at once."

Nothing magical here, just the composability that CLI tools are supposed to have. GitLab put AI in the terminal, and shell scripting superpowers naturally follow.

Hope these patterns are useful. If you've found better approaches, share them in the community. I'm still adjusting my own workflow.