Repo Security Scanner

The Problem

Public repositories leak secrets constantly. API keys buried in commit history, email addresses in configs, internal hostnames in documentation. Tools like gitleaks catch pattern-based secrets well, but miss contextual leaks — things that look innocuous individually but reveal sensitive information in context.

I wanted one command that scans all my public repos, combines regex scanning with AI analysis, and gives me an actionable report.

How It Works

docker run --rm -v ./output:/output -e GITHUB_USERS=myusername repo-security-scanner --all

The scanner discovers all public repos for the given accounts, clones them with full history, runs gitleaks with custom PII rules, then sends each repo’s source to Claude for contextual analysis.

Discover — GitHub/GitLab API to enumerate public repos
Clone — git clone --mirror for full history, cached between runs
gitleaks — regex scan of entire git history
AI scan — Claude analyzes HEAD + recent commits for contextual leaks
Report — JSON + Markdown to mounted volume
Email — optional SMTP delivery

What It Finds

gitleaks catches: API keys, tokens, passwords, private keys, emails, phone numbers, internal IPs — anything matching a pattern across the entire git history.

The AI catches what regex can’t:

Comments mentioning internal tooling or credentials
Filesystem paths revealing usernames
Dev credentials that look like placeholders but aren’t
Commit metadata linking pseudonymous accounts to real identities

In testing against 11 repos, gitleaks found 523 raw hits. After applying user exclusions, that dropped to 8. The AI found 4 additional contextual findings that gitleaks missed entirely.

Features

One command — Docker container, no installation
Dual scanning — gitleaks (regex, full history) + Claude AI (contextual, HEAD)
Flag-controlled — --gitleaks, --ai, --all
User exclusions — user-ignores.toml for personal allowlists
Cached clones — mount a volume for fast incremental rescans
Structured output — JSON + Markdown with severity classification
Exit codes — 0=clean, 1=findings, 2=error (CI/CD friendly)
Multi-account — comma-separated GitHub/GitLab usernames
Private repos — token-authenticated discovery and cloning

User Exclusions

The scanner supports a user-ignores.toml file for personal exclusions that never get committed:

[gitleaks_allowlist]
patterns = [".*@example\\.com", "contact@mysite\\.com"]

[gitleaks_file_excludes]
patterns = ["package-lock\\.json$", "go\\.sum$"]

[gitleaks_rule_excludes]
rules = ["pii-phone"]

[ai_exclusions]
notes = ["contact@mysite.com is my public email — not a leak"]

Gitleaks patterns filter findings post-scan. File and rule excludes drop entire categories. AI exclusions get injected into Claude’s prompt as context before analysis.

Architecture

A single Docker container running a Python pipeline:

CLI — argparse, flags control which scanners run
Discovery — GitHub + GitLab APIs with pagination
Clone Manager — mirror clone with token injection and redaction
gitleaks — custom .gitleaks.toml with PII rules, severity classification
AI Scanner — content extraction, Bedrock API, structured JSON output
Reporter — JSON + Markdown generation, optional SMTP delivery

The AI scanner extracts source files from HEAD plus the last 50 commit messages, capped at 80,000 characters per repo.

Web Frontend

There’s also a PWA for running scans from a phone — enter GitHub/GitLab usernames or paste individual repo URLs, pick a scan type, and hit scan. Results stored in PostgreSQL, browse past scans with severity filters.

The frontend spawns the scanner container via Docker socket and parses the output JSON. Findings are paginated for large scan results.

Tech Stack

Runtime: Python 3.12, Docker
Scanning: gitleaks, custom TOML config
AI: Claude via AWS Bedrock (Haiku for cost-effective scanning)
Frontend: FastAPI, PostgreSQL, Tailwind, mkcert HTTPS
Output: JSON, Markdown, SMTP

What’s Next

Trace + Scrub — feed findings into a history tracer (git log -S to find the first commit a secret appeared), then generate git-filter-repo scripts to rewrite history. Finding a secret is only useful if you can also remove it.

Status

Built and tested, producing actionable results. TLS moved from baked-in uvicorn to an Nginx sidecar — app containers only accessible on the internal Docker network, HTTPS on the exposed port. Built with The Forge framework conventions.

Source: Scanner | Frontend