Living document · Architecture Decision Register

Every Decision, Documented

Every architectural decision made in the youPersonic build — with the context that made it a real choice, the alternatives that were seriously considered, and the exact condition under which each decision will be revisited. Decisions are never deleted. When they change, they are superseded. The history is the point.

— Accepted

— Proposed

— Open

— Total

Status

Phase

ADR-001

Open-source foundations over managed SaaS

✓ Accepted All phases Cross-cutting

›

Decision

All platform components must be self-hostable and open-source licensed. No managed SaaS at the data layer.

Context

In a regulated environment, three properties are non-negotiable: you must be able to explain what the component does, control where data flows, and not depend on a vendor for compliance-critical functionality. A proprietary component in the audit trail or vector store creates dependency precisely where dependency is most dangerous.

Alternatives considered

Pinecone (managed vector)Rejected — data leaves infrastructure boundary; cannot guarantee sovereignty
Auth0 / Okta (managed identity)Rejected — identity tokens and RBAC policies must be sovereign
LangSmith (managed observability)Rejected — agent reasoning traces contain sensitive context

Consequences

Positive

Zero vendor lock-in
Predictable costs at scale
Total data residency control
Components auditable by regulators

Trade-offs

Higher operational burden
Slower initial deployment vs managed
Team must manage Kubernetes operators

Review trigger

Revisit when: A managed service offers true on-premises or private-cloud deployment with full data sovereignty guarantees AND the operational burden of self-hosting exceeds the regulatory risk of the managed alternative.

ADR-002

Two-VPC topology — Data Gravity and Compute Gravity

✓ Accepted All phases Cross-cutting

›

Decision

Separate the platform into two AWS VPCs: Data Gravity region (us-east-1) for everything that persists — sovereign, auditable. Compute Gravity region (us-west-1) for everything that executes transiently — stateless, replaceable.

Context

Data accumulates. Compute executes. These are fundamentally different forces. Data in a regulated context cannot move freely — it has sovereignty requirements and pulls everything dependent on it toward where it lives. Compute is stateless by design — an inference pod rebuilt tonight loses nothing of lasting consequence. Putting both in the same VPC either violates data sovereignty or prevents GPU access.

Alternatives considered

Single-region us-east-1Rejected — no GPU availability, no compute gravity separation
Single-region us-west-1Rejected — data sovereignty violation, PII in compute-optimised region
Multi-account separationDeferred — adds isolation but increases operational complexity; revisit at Phase 2

Consequences

Positive

Data sovereignty by topology not policy
GPU costs optimised independently
Compliance boundary architecturally explicit

Trade-offs

~60ms cross-region latency on data retrieval
~$15/month Transit Gateway
More complex networking

Cost impact

~$15/month Transit Gateway

Review trigger

Revisit when: A single AWS region offers both sovereign data isolation guarantees AND GPU compute at parity with us-west-1 pricing.

ADR-004

RDS PostgreSQL as primary data store

✓ Accepted POC L3 Knowledge L7 Traceability

›

Decision

AWS RDS PostgreSQL with pgvector extension for the POC phase. Single-region, right-sized for POC load (~db.t3.medium).

Context

The platform needs a primary relational store for agent decision records, RAG metadata, policy version archives, and the L7 audit trail. PostgreSQL is the right long-term choice — ACID compliance for audit records, pgvector for embeddings. The POC does not require distributed SQL. That complexity is a Phase 2 concern.

Alternatives considered

YugabyteDBPhase 2 — correct when multi-region active-active is required; PostgreSQL-compatible so migration is schema-only
Aurora PostgreSQLRejected — higher cost, Aurora-specific APIs create mild lock-in
DynamoDBRejected — no ACID, no pgvector, wrong model for audit records

Consequences

Positive

ACID guarantees for L7 audit
pgvector co-located with metadata
SQLAlchemy ORM compatibility

Trade-offs

Single-AZ recovery time risk at POC
Migration to YugabyteDB needed at scale

Cost impact

~$35/month (db.t3.medium, single-AZ)

Review trigger

Revisit when: Vector count exceeds 5M rows (pgvector degrades) OR multi-region active-active becomes a hard requirement. Either triggers YugabyteDB migration.

ADR-005

pgvector for vector embeddings (POC)

✓ Accepted POC L3 Knowledge

›

Decision

pgvector as the vector store for POC, running as an extension on the existing RDS PostgreSQL instance. Zero additional infrastructure.

Context

L3 requires vector storage for RAG retrieval. The POC benefit of pgvector is operational simplicity — one database, one backup strategy, one query interface. Metadata and vectors co-located enable JOIN queries without cross-service calls. Milvus is the Phase 2 choice when vector count and ANN performance justify a separate cluster.

Cost impact

Zero additional (PostgreSQL extension)

Review trigger

Revisit when: Average vector query latency exceeds 100ms OR vector count exceeds 5 million rows. Either triggers ADR-025 (Milvus).

ADR-006

AWS Bedrock for LLM inference (POC)

✓ Accepted POC L3 L4 L6

›

Decision

AWS Bedrock for all LLM inference at POC. All inference calls go through an OpenAI-compatible abstraction layer — the agent code is model-agnostic. Validate on Bedrock. Earn the GPU when the architecture is proven.

Alternatives considered

OpenAI APIRejected — data leaves AWS boundary; sovereignty violation
NVIDIA NIMPhase 2 — requires GPU cluster; not cost-justified until architecture is validated
Ollama (self-hosted)Local dev only — correct for development; not production-ready without GPU

Cost impact

Variable — ~$0.003–$0.015 per 1K tokens

Review trigger

Revisit when: Monthly Bedrock cost exceeds $200 OR latency drops below 200ms requirement. Either triggers ADR-024 (NVIDIA NIM).

ADR-007

Redis for agent state and pub-sub

✓ Accepted POC L4 Planning L5 Execution

›

Decision

Redis for ephemeral agent working memory, pub-sub messaging for agent handoffs, and hot-vector caching. Redis state is ephemeral by design — durable state lives in Temporal and PostgreSQL.

Cost impact

~$15/month (ElastiCache cache.t3.micro)

Review trigger

Revisit when: Pub-sub volume exceeds 5,000 messages/second OR audit event stream requires replay. Either triggers Kafka migration (ADR-016).

ADR-009

Temporal for durable workflow execution

✓ Accepted POC L4 Planning L7 Traceability

›

Decision

Temporal.io as the durable workflow engine for all L4 multi-step agent workflows. Workflows survive pod crashes, network failures, and Bedrock timeouts — resuming from the exact last completed step.

Context

Temporal provides three properties essential for regulated agentic workflows: fault tolerance (resumes from last completed step), state persistence (every intermediate step persisted and replayable), and built-in audit history (immutable record of every workflow action). This is not a convenience tool. It is how L4's durable state checkpointing capability is implemented.

Alternatives considered

Celery + RedisRejected — task queue, not durable workflow; tasks cannot resume mid-execution after failure
AWS Step FunctionsRejected — managed, but AWS lock-in; event history less queryable than Temporal
Custom checkpoint implementationRejected — Temporal solves this correctly; reinventing it is engineering waste

Cost impact

~$20/month (EKS pod, shared PostgreSQL backend)

Review trigger

Revisit when: Temporal's workflow history storage costs exceed the value provided. Or a purpose-built agentic workflow engine with better compliance story emerges.

ADR-013

Event-driven microservices as system topology

✓ Accepted All phases Architecture Layer 2A

›

Decision

Event-driven microservices. Each capability layer boundary maps to independently deployable services. Services communicate via events (Redis Pub-Sub for POC, Kafka at Phase 2) rather than synchronous REST where possible.

Context

Three properties of agentic regulated workloads make event-driven microservices correct: different capability layers have different scaling requirements; event-driven communication allows L7 to record every inter-layer event without blocking the producing layer; each service can be deployed and updated independently, reducing blast radius.

Review trigger

Revisit when: Operational overhead of managing multiple services exceeds scaling benefits for the current team size. At that point, merge L4+L5 into one service.

ADR-014

CAP positioning per data store, not per system

✓ Accepted All phases Architecture Layer 2A

›

Decision

Each data store has its own CAP position: PostgreSQL/L7 audit = CP (never lose a record). Redis/L4 agent state = AP (availability over consistency, ephemeral). Temporal/L4 workflow = CP (consistent or fail). pgvector/L3 = AP (stale context acceptable; freshness validation in L3 catches it).

Review trigger

Revisit when: An L7 audit record is lost during a partition event. That outcome is architecturally inadmissible and requires immediate redesign.

ADR-015

Append-only PostgreSQL as L7 audit trail (CQRS write model)

✓ Accepted POC L7 Traceability

›

Decision

Append-only PostgreSQL tables as the L7 write model. No UPDATE or DELETE on audit records. SHA-256 hash chaining within PostgreSQL for tamper detection. Separate read model (materialised views) for regulator queries. CQRS without full event sourcing complexity at POC.

Context

Full event sourcing is architecturally correct long-term. But at POC scale, its complexity exceeds its benefit. Append-only tables provide immutability with minimal overhead. The CQRS discipline established now makes the Phase 2 upgrade to full event sourcing an evolution, not a redesign.

Review trigger

Revisit when: Audit record volume exceeds 10 million rows OR a cryptographically verifiable audit trail (blockchain-grade) is required by regulation.

ADR-016

Redis Pub-Sub (POC) → Kafka (Phase 2) migration path

✓ Accepted POC L4 L5

›

Decision

Redis Pub-Sub for all async messaging at POC. Kafka (AWS MSK) at Phase 2. The migration trigger is message volume and replay requirements, not a calendar date.

Review trigger

Revisit when: Audit event stream requires message replay OR pub-sub throughput exceeds 5,000 messages/second.

ADR-017

Server-Sent Events for real-time compliance officer alerts

✓ Accepted POC L5 Execution

›

Decision

Server-Sent Events (SSE) via FastAPI for the mid-analysis compliance officer alert channel. Communication is unidirectional — server pushes events to the dashboard. WebSocket complexity is not justified for one-way push.

Review trigger

Revisit when: CCO dashboard requires bidirectional interaction through the alert channel. That triggers WebSocket migration.

ADR-018

LangGraph for L4 agent orchestration state machine

✓ Accepted POC L4 Planning

›

Decision

LangGraph as the state machine framework for all L4 agent orchestration. Nodes = reasoning steps. Edges = conditional transitions. State = accumulated context. External checkpointing to PostgreSQL via the LangGraph checkpointer interface.

Context

LangGraph's node-and-edge model maps directly to the L4 requirement: the graph is explicit and auditable, supports external state persistence, and enables SOP constraint enforcement at the edge level — certain transitions only allowed if policy conditions are met.

Alternatives considered

LangChain ReAct AgentsRejected — ReAct loop is less auditable; agent decides next step each iteration without explicit graph constraints
AutoGenRejected — conversation-based multi-agent; less suited to structured SOP-constrained workflows
Custom state machineRejected — reinventing what LangGraph provides correctly

Review trigger

Revisit when: LangGraph's checkpointing model is incompatible with Temporal at Phase 2 scale.

ADR-019

CrewAI for multi-agent specialist coordination

◎ Proposed Phase 1 L4 L5

›

Decision (proposed)

Defer CrewAI to Phase 1. POC uses LangGraph's supervisor pattern for multi-agent coordination. CrewAI evaluated when the Fraud Detection use case (Ch13) makes multi-agent complexity concrete.

Open question

Does CrewAI's crew abstraction add measurable value over LangGraph's supervisor pattern for seven-specialist-agent coordination? This question cannot be answered correctly without a real use case.

Review trigger

Accelerate when: Fraud Detection use case reveals LangGraph supervisor pattern is insufficient for seven-specialist-agent coordination.

ADR-020

Keycloak for L1/L2 identity and RBAC

◎ Proposed POC (minimal) L1 Identity L2 Policy

›

Decision (proposed)

Keycloak deployed at POC in minimal configuration (single realm, basic RBAC). Full configuration — fine-grained permissions, maker-checker roles, per-agent identity scopes — at Phase 1. Authentication established from day one to avoid retrofitting.

Alternatives considered

AWS CognitoRejected — does not support fine-grained ABAC for L2; AWS lock-in for identity
Simple JWT (custom)Rejected — no RBAC framework; complete replacement required at Phase 1
Auth0Rejected — data sovereignty concern; tokens processed outside sovereign infrastructure

Review trigger

Revisit when: Keycloak operational overhead exceeds one engineer-hour/month at POC scale.

ADR-022

Casbin for SOP policy constraint enforcement

! Open Phase 1 L2 Policy

›

Open questions (must answer before accepting)

1. Does Casbin support policy versioning? Can two policy versions coexist and be evaluated independently?

2. Does Casbin's ABAC model support context-dependent permissions — same agent, different permissions based on risk tier?

3. What is the performance overhead of a Casbin evaluation at the L4→L5 boundary for every tool call?

Proposed direction

Casbin with PostgreSQL adapter for policy storage. Policy versions stored as separate Casbin policy sets, pinned at request time via L2.

ADR-023

Per-agent identity management — GAP-001

! Open Phase 1 L1 Identity

›

Context (GAP-001)

The current platform has no per-agent identity system. When Agent A delegates to Agent B, the audit trail cannot distinguish their individual actions. This directly violates IMDA MGF §2.1.2 and is documented in the Regulatory Traceability Matrix as REG-004.

Proposed direction

OAuth 2.0 Token Exchange (RFC 8693) via Keycloak. Human token → Orchestrator agent token → Specialist agent tokens, each with narrower scope. Delegation chain recorded in L7 audit trail.

Open questions

1. Does RFC 8693 Token Exchange in Keycloak support the delegation depth required (human → L4 orchestrator → L5 specialist → external tool)?

2. What is the token exchange latency overhead per delegation level?

3. How are agent identities registered — pre-defined at deploy time or dynamic at workflow instantiation?

ADR-024

NVIDIA NIM for Phase 2 LLM inference

◎ Proposed Phase 2 L3 L4

›

Decision (proposed)

NVIDIA NIM replaces AWS Bedrock when monthly inference cost exceeds $200 OR latency drops below 200ms requirement. OpenAI-compatible API maintained — agent code requires no changes. Earn the GPU when the architecture is proven.

Cost impact

~$500/month (GPU instance + NIM at Phase 2 scale)

ADR-025

Milvus for Phase 2 vector store

◎ Proposed Phase 2 L3 Knowledge

›

Decision (proposed)

Milvus replaces pgvector when vector count exceeds 5M OR average query latency exceeds 100ms. Same abstraction layer — L3 retrieval code swaps implementations without agent code changes.

Cost impact

~$40/month (Milvus cluster, 3 nodes on EKS)

ADR-026

Agenta for prompt version registry — GAP-003

! Open — deferred Phase 2 L7 Traceability

›

Why deferred

Prompt CI/CD requires real production prompts and real regression cases to design correctly. Writing this ADR before the first use case is built means designing against hypothetical prompts. This ADR is opened when Ch23 (Prompts Are Code) is written.

GAP-003 — L7 hook required now

Even though the Agenta pipeline is deferred, the L7 integration contract for prompt version recording must be designed now. Every inference call must record which prompt template version was used. Retrofitting this breaks the audit chain.

Version History

Apr 2026

ADR-013 through ADR-026 authored. Architecture layer (2A–2D) decisions formalised. Component layer ADRs for LangGraph, CrewAI, Keycloak, NIM, Milvus. GAP-001 (ADR-023) and GAP-003 (ADR-026) opened. Casbin research pending (ADR-022).

Mar 2026

ADR-001 through ADR-012 established during infrastructure build. Open-source foundations, Two-VPC topology, Terraform, PostgreSQL, pgvector, Bedrock, Redis, MinIO, Temporal, aws-vault, GitLab, CloudFront.