Code Sentinel — AI Code Audit Framework

Built for your world

Different orgs, different nightmares.

6 engineers. Demo on Tuesday.

"We need to know if this thing is going to embarrass us on launch day."

Advisory modeScores without gates. No blocked PRs, no broken sprints.
CI/CD in 4 linesDrop the GitHub Action. Trends visible in week one.
Binary gateSecrets, hardcoded creds, unprotected admin routes fire even in advisory mode.
5-page executive reportHand it to your investor. Looks like you have a 40-person eng team.

What your CTO tells the board:"We have automated quality governance from day one. Our technical debt is measured, not guessed."

12 clients. 12 codebases. 1 reputation.

"Half our projects are inherited code we didn't write."

Multi-repo batch auditingSweep your entire portfolio.
Operational readiness overlayRace conditions, failover, load capacity.
Delta trackingSprint-over-sprint comparison. Measurable improvement.
Coding standards compliance"61% conformance, 0% automation."

What your delivery head tells the client:"Every project comes with an independently scored quality audit. Here's the evidence."

The CISO needs the report by Thursday.

"200 microservices, 4 compliance frameworks, a vendor who swears everything is fine."

8 compliance overlaysHIPAA + SOC 2 + PCI + GDPR + ISO 27001 + NIST + OWASP + SOX.
VPC / on-premCode never leaves your boundary. Cryptographic erasure.
Full reliability bundleκw, AC1, ppos/pneg, R* gating on 95% CI lower bound.
Enforcement modeOne secret = area F. No override without a logged reason.

What your CIO tells the auditor:"Every codebase has a scored, evidence-tagged, compliance-mapped audit with full provenance. The confidence tier is T4."

Deploy your way

One instrument. Four deployment modes.

☁️

Fastest to start

SaaS Platform

Upload your repo. Scored report in minutes. Zero infra, zero config. Code purged after delivery.

→ app.codesentinel.dev/audit/new → Drag & drop or connect GitHub

🔁

Best for teams

CI/CD Pipeline

Gate every merge. GitHub Actions, GitLab CI, or Jenkins. Binary gates fail the build. Scores trend over sprints.

- uses: sentinel/audit@v2 with: strictness: enforcement overlay: soc2

▸_

Max control

CLI Tool

Terminal audits. JSON output to your dashboards. Offline with local models or BYOK cloud keys.

$ sentinel audit ./src \ --strictness standard \ --output technical \ --overlay hipaa,soc2

🔒

Enterprise / Air-gapped

VPC / On-Prem

Deploy inside your network. Code never leaves your boundary. Your GPU infra. SOC 2 & HIPAA-ready.

$ helm install sentinel \ sentinel/agent --values \ custom-values.yaml

How it works

From repo to governance action.

Every stage of the measurement instrument — the typing, the evidence, the gates, the reliability, the governance.

Step 1

The Big Picture

The core insight: Code audit is a measurement instrument, not an opinion. Every finding declares its type, its evidence, and its confidence.

Step 2

Classify Every Atom by Type

Binary

Yes/No. One fail = area collapse.

Coverage

What % meets the standard?

Architectural

Pattern judgment. 4-level rubric.

Tooling

Automation maturity level.

Config

Correct vs. misconfigured.

Why this matters: Without typing, audits average a security breach (B:0.0) with "nice architecture" (A:0.85), hiding catastrophes behind a decent number.

Step 3

Every Finding Declares Its Evidence

OBSERVED

"I saw it."

Direct evidence: exact file path, line number, verbatim code snippet.

INFERRED

"I deduced it."

Indirect evidence: logical deduction from other findings. Higher false-positive risk.

ABSENT

"It's not there."

Confirmed missing. Not "I didn't look" — "I looked and it's gone."

Why this matters: INFERRED findings have a higher false-positive rate. The framework tracks this separately so you know which findings to verify.

Step 4

The Binary Gate

Scenario A — No Binary failures

B:1 B:1 C:.8 A:.7 → 0.82

Normal weighted average applies.

Scenario B — One Binary failure

B:0 B:1 C:.8 A:.7 → 0.00

Binary gate triggers — entire area fails. One API key in code = security score zero.

Step 4b

Strictness Levels

Advisory

Discovery Mode

Gate OFF. All findings are recommendations. Nothing blocks.

Use: First engagement, onboarding.

Standard

Dual-Lens

Gate fires as flag but doesn't override composite. Both views side by side.

Use: Sprint reviews, health checks.

Enforcement

Full Governance

Gate auto-Fs the area. R*-gated. ABSENT enforced. Can BLOCK pipelines.

Use: Release readiness, compliance.

Step 5

The Reliability Bundle

v2.5.3 moves beyond simple κ. Three independent models audit, then a full reliability bundle measures how trustworthy the measurements are — type by type.

κ_w — Weighted Kappa

Quadratic weighted for ordinal atoms (C, A, F). Large disagreements penalised more than small ones.

AC1 — Gwet's AC1

Paradox-resistant fallback. Standard κ collapses under high prevalence. AC1 fires when p_o ≥ 0.90 AND κ ≤ 0.60.

p_pos / p_neg

Agreement decomposition for binary atoms. "Agree on failures" ≠ "agree on passes." p_pos ≥ 0.70 required for BLOCK.

R* = min(R_ord, R_bin)

Composite gate. Both ordinal AND binary must be reliable. Gates on 95% CI lower bound.

The κ paradox: High agreement can produce low κ when prevalence is skewed. The bundle detects this and engages appropriate fallbacks. The audit measures how trustworthy its own measurements are.

Step 6

R*-Gated Governance

R* ≤ 0.60

0.60–0.80

R* > 0.80

WARN ONLY

Informational. No action forced.

WARN + ROUTE

Flagged to decision-makers. Ticket requiring human review.

FULL AUTHORITY

Block deployments, auto-create remediation tickets.

Step 7

Observer Declaration & Confidence

68%

Coverage — files examined

87%

Resolution — deps traced

R* 0.83

Reliability — cross-validation

Audit Confidence Score

T1 · 0.40–0.54

T2 · 0.55–0.69

T3 · 0.70–0.84

T4 · 0.85–1.00

Source code+0.40

Multi-model (R* ≥ 0.80)+0.08

Enriched mode+0.05

Infrastructure artifacts+0.10

Team context+0.07

Coding standards+0.05

Design documentation+0.06

Load test results+0.05

Monitoring data+0.05

SBOM / dependency data+0.04

Previous audit+0.05

Compliance Overlays

Not a checkbox. A lens.

HIPAA touches 15+ of 28 areas. When active, it amplifies what matters, adds framework-specific questions, and requires deeper evidence.

🏥

HIPAA

45 CFR 160, 162, 164

13 mandatory43 overrides10 Qs

ePHI identification, column-level encryption, audit log separation.

🔒

SOC 2 II

Trust Services Criteria

13 mandatory26 overrides5 Qs

Change management, IaC enforcement, trust service scoping.

💳

PCI DSS 4.0

Requirements 1–12

13 mandatory27 overrides6 Qs

PAN tokenisation, CDE segmentation, 3.0× audit trail.

🇪🇺

GDPR

EU 2016/679

6 mandatory19 overrides7 Qs

Erasure cascading, portability, consent management.

🛡️

ISO 27001

Annex A (2022)

14 mandatory38 overrides8 Qs

ISMS alignment, risk traceability, access control.

🏛️

NIST 800-53

Rev 5 Moderate

16 mandatory52 overrides12 Qs

FedRAMP controls, continuous monitoring, FIPS 140.

🕸️

OWASP ASVS

Level 2

11 mandatory34 overrides9 Qs

Auth verification, session management, API security.

📈

SOX / J-SOX

§302, §404

9 mandatory22 overrides6 Qs

Financial integrity, audit trail immutability.

Elevate Areas

Force mandatory areas

Amplify Weights

1.5×–3.0× multipliers

Raise Evidence

Deeper proof required

Add Questions

Framework-specific atoms

Score Maturity

Ad Hoc → Optimised

Output Styles

Same data. Right format.

📋

Executive

2–5 pages

Traffic-light scorecard, narrative, top 5 findings, risk heatmap.

CTO · Board · Client Sponsor

🔬

Technical

20–40 pages

Full Q-by-Q with code evidence, NC log, remediation roadmap.

Dev Team · Tech Lead · Architect

📜

Compliance

40+ pages

Full evidence chain, risk register, instrument certificate.

Compliance Officer · Auditor

Complete Pipeline

The audit isn't linear. It's a system.

Ingestion feeds classification. Classification feeds measurement. Measurement feeds cross-validation. Cross-validation feeds governance. Governance feeds the report. The report feeds the organisational loop.

Ingest

Codebase + context + overlays

→

Classify

408 atoms × B/C/A/T/F

→

Measure

OBS / INF / ABS evidence

→

Gate

Binary collapse per strictness

↓

Purge

AES-256 key destroyed

Report

EXEC · TECH · COMPLIANCE

←

Observe

Coverage + overlays + ACS

←

Validate

κw + AC1 + ppos/pneg → R*

←

Three-Layer Separation

Architecture

LLM layer (non-deterministic, runs once) → Math layer (deterministic, replayable, milliseconds) → Presentation layer (interactive). CTO adjusts thresholds; report updates instantly. No re-running the audit.

Layer 1: expensive, stochastic Layer 2: cheap, deterministic Layer 3: interactive, zero-cost

Seven-Phase Ingestion

Engine

tree-sitter AST parsing → import resolution → call graph → heritage chains → Leiden community detection → BFS process tracing. Single ingestion feeds all downstream modules.

Structure → Parse → Imports → Calls → Heritage → Communities → Processes

Graph Schema

Data Model

15 node types (Repo through Artifact), 12 edge types with epistemic typing. Confidence-scored relationships flow through to governance. Postgres hybrid: JSONB events + adjacency + pgvector.

15 nodes · 12 edges CALLS · IMPLEMENTS · SATISFIES EVIDENCES (OBS/INF/ABS)

Confidence Scoring

Epistemic

Every edge carries a confidence band. Static imports: 85–95%. Heuristic calls: 70–85%. Confidence propagates through evidence chains — uncertainty compounds multiplicatively.

≥80% → OBSERVED downstream 60–79% → INFERRED <60% → UNRESOLVED → review

Feature Attribution

PPR Engine

Personalised PageRank from each entrypoint with centrality-based infrastructure dampening. Two-layer scoring: feature-local findings vs Epic 0 platform health. Epistemic typing on every linkage.

weight = PPR × (1 − dampener) dampener = min(1, BC/threshold) threshold = 95th percentile BC

Epic 0 Isolation

Infrastructure

Shared infrastructure (auth middleware, logging, DB pools) stripped from feature epics into auto-generated Epic 0. Features scored accurately. Systemic issues visible separately.

"Epics 1–15 avg 0.82. Epic 0 (platform): 0.41. Your features are fine. Your infrastructure isn't."

Algorithm Selection

Self-Tuning

The codebase tells the system which math to use. Graph topology, distribution shape, and data availability select the solver. Every decision is an immutable event. Self-calibrating across engagements.

High entrypoint → backward slice Low entrypoint → PPR + seeds Power-law dist → degree+BC Flat dist → eigenvector

Epic Mapping Reverse

Submodule

Second extraction pass over the same ingested graph. Produces a structured Epic Map: hierarchical numbering, epic grouping, role assignment, per-feature quality scores — all evidence-tagged.

OBSERVED: /api/checkout → Stripe INFERRED: subscriptions table ABSENT: vs forward map only

Delta Engine

v2 Future

Forward map (specified) vs reverse map (built). Entity resolution: embeddings → LLM scoring → ILP global optimisation (not Hungarian — handles 1:many). CONFIRMED / DISPUTED / ABSENT / ORPHAN.

Step 1: embed candidates Step 2: LLM pairwise score Step 3: ILP global optimise Step 4: threshold governance

Graph RAG + MCP

Integration

Query the audit graph in natural language via semantic + graph traversal. Exposed as MCP server: your AI coding agent can ask "what's the quality score of this file?" in real time.

get_file_quality get_call_chain_risk get_feature_findings search_findings (NL)

Temporal Diff

v3 Future

Same audit, different time. Detects: regression/improvement, hidden scope creep, false completion (mock APIs, zero-assertion tests), remediation verification, velocity validation.

Event store makes diff trivial: query by audit_run_id, compare. Hidden scope creep = files changed with no sprint item.

Full-Stack Cross-Ref

v3 Future

Mobile + web + APIs + microservices in one graph. Detects: dead APIs, broken integrations, auth inconsistency, feature parity gaps, data model divergence. INFERRED → OBSERVED end-to-end.

Single-repo: INFERRED Multi-repo: OBSERVED e2e Observation cone: repo → system

Interactive Recomputation

Layer 2

Every parameter is a slider. Infrastructure threshold, alignment confidence, κ vs AC1 toggle, governance strictness. All Layer 2 — millisecond recomputation. Implicit sensitivity analysis.

PPR 10K nodes: ms Centrality: ms ILP 200 stories: seconds Fragile scores auto-flagged

Architecture

Three layers. One invariant.

The LLM layer runs once — expensive, stochastic, unrepeatable. Its outputs freeze as immutable events. The Math layer takes those frozen outputs and computes everything else: scores, attribution, feature boundaries, reliability, alignment. All deterministic. All replayable. All parameterisable.

The Presentation layer reads the math and renders. When a parameter changes, it asks the Math layer to recompute. Milliseconds. No LLM calls. No waiting. No cost.

The LLM outputs are frozen. The math is live.

Layer Stack

Layer 1 · LLMNon-deterministic · Runs once · Expensive

Layer 2 · MathDeterministic · Replayable · Milliseconds

Layer 3 · PresentationInteractive · Zero-cost · Parameterisable

Seven-Phase Pipeline

1StructureDir tree + file classification

2Parsetree-sitter ASTs (multi-lang)

3Import ResolutionTS aliases, Rust, Java, Go

4Call Graph3-tier symbol resolution

5HeritageInheritance + interfaces + mixins

6CommunitiesLeiden clusters → feature candidates

7Process TracingBFS from entrypoints

Ingestion Engine

Single ingestion. Multiple consumers.

The audit engine, the reverse Epic Mapper, and the documentation generator all consume the same graph. Two graphs means two realities means untrustworthy traceability.

Context enrichment injects dependency signatures into each chunk — method names, type hints, public interfaces. Without it, architectural assessment degenerates into file-level pattern matching.

Precompute at index time, serve complete context in one query. The graph answers "what depends on UserService?" in a single structured response.

Data Model

15 node types. 12 edge types.

The graph represents the full relational structure of audited codebases and the measurement outputs attached to them. Stored as adjacency tables in Postgres, projectable into a graph database when traversal becomes the bottleneck.

Edges carry epistemic metadata: IMPLEMENTS has weight and epistemic type. SATISFIES has alignment score. EVIDENCES carries OBSERVED/INFERRED/ABSENT. This isn't a code graph — it's a measurement graph.

Graph Schema

RepoCommitAuditRunFileSymbolRouteFeatureStoryAtomAreaFindingRequirementClaimOverrideToolReportArtifact

CALLSDEFINESDECLARES_ROUTEIMPLEMENTSSATISFIESEVIDENCESMEASURESCHANGED_INAFFECTSREADSWRITESIMPORTS

Area 28

Business Logic Hygiene

Business logic correctness requires domain knowledge. Business logic hygiene — whether code is structured so a domain expert could verify correctness — is measurable from static analysis. 7 new atoms.

28.1 · Domain Layer Separation A

28.2 · Business Rule Centralisation C

28.3 · Named Constants + Rationale C

28.4 · Explicit State Machines A

28.5 · Business Expectation Tests C

28.6 · Calculation Isolation A

28.7 · Edge Case Explicitness C

Trust & Security

How your code is handled.

Where does the code go? Who sees it? How is it purged?

Model A

Cloud SaaS

Upload → ZDR API transit → findings in client tenant → cryptographic erasure after configurable window.

AES-256-GCMTLS 1.3ZDR API

Model B

BYOK

Your API key. Platform becomes stateless orchestrator. No code stored beyond active session.

Client keyStatelessNo storage

Model C

VPC / On-Prem

Zero egress. Container in your infra. Bedrock, Vertex, or self-hosted LLM. Code never leaves your boundary.

Air-gapFedRAMPITAR

Model D

Analyst-Assisted

Human analysts on your infrastructure. NDA'd, background-checked. No external software deployed.

Human-mediatedNDA'd

Data Lifecycle — Cryptographic Erasure

INGEST

CHUNK

PROCESS

ASSEMBLE

DELIVER

PURGE 🔥

Why this matters: Model C satisfies Fortune 50 requirements. Model B leverages your existing LLM relationship. The trust architecture is the sales enabler — methodology only matters after security says yes.

The instrument that doesn't average away your catastrophes.

6 engineers. Demo on Tuesday.

12 clients. 12 codebases. 1 reputation.

The CISO needs the report by Thursday.

SaaS Platform

CI/CD Pipeline

CLI Tool

VPC / On-Prem

The Big Picture

Classify Every Atom by Type

Every Finding Declares Its Evidence

"I saw it."

"I deduced it."

"It's not there."

The Binary Gate

Scenario A — No Binary failures

Scenario B — One Binary failure

Strictness Levels

Discovery Mode

Dual-Lens

Full Governance

The Reliability Bundle

κw — Weighted Kappa

AC1 — Gwet's AC1

ppos / pneg

R* = min(Rord, Rbin)

R*-Gated Governance

WARN ONLY

WARN + ROUTE

FULL AUTHORITY

Observer Declaration & Confidence

Audit Confidence Score

HIPAA

SOC 2 II

PCI DSS 4.0

GDPR

ISO 27001

NIST 800-53

OWASP ASVS

SOX / J-SOX

Elevate Areas

Amplify Weights

Raise Evidence

Add Questions

Score Maturity

Same data. Right format.

Executive

Technical

Compliance

The audit isn't linear. It's a system.

Three-Layer Separation

Seven-Phase Ingestion

Graph Schema

Confidence Scoring

Feature Attribution

Epic 0 Isolation

Algorithm Selection

Epic Mapping Reverse

Delta Engine

Graph RAG + MCP

Temporal Diff

Full-Stack Cross-Ref

Interactive Recomputation

Three layers. One invariant.

Single ingestion. Multiple consumers.

15 node types. 12 edge types.

Business Logic Hygiene

Cloud SaaS

BYOK

VPC / On-Prem

Analyst-Assisted

Data Lifecycle — Cryptographic Erasure

Code Audit = The Sensing Function

Your codebase as a knowledge graph.

κ_w — Weighted Kappa

p_pos / p_neg

R* = min(R_ord, R_bin)