Living document
·
Architecture Decision Register
Every Decision, Documented
Every architectural decision made in the youPersonic build — with the context that made it a real choice, the alternatives that were seriously considered, and the exact condition under which each decision will be revisited. Decisions are never deleted. When they change, they are superseded. The history is the point.
— Accepted
— Proposed
— Open
— Total
Status
Phase
ADR-001
›
Decision
All platform components must be self-hostable and open-source licensed. No managed SaaS at the data layer.
Context
In a regulated environment, three properties are non-negotiable: you must be able to explain what the component does, control where data flows, and not depend on a vendor for compliance-critical functionality. A proprietary component in the audit trail or vector store creates dependency precisely where dependency is most dangerous.
Alternatives considered
- Pinecone (managed vector)Rejected — data leaves infrastructure boundary; cannot guarantee sovereignty
- Auth0 / Okta (managed identity)Rejected — identity tokens and RBAC policies must be sovereign
- LangSmith (managed observability)Rejected — agent reasoning traces contain sensitive context
Consequences
Positive
- Zero vendor lock-in
- Predictable costs at scale
- Total data residency control
- Components auditable by regulators
Trade-offs
- Higher operational burden
- Slower initial deployment vs managed
- Team must manage Kubernetes operators
Review trigger
Revisit when: A managed service offers true on-premises or private-cloud deployment with full data sovereignty guarantees AND the operational burden of self-hosting exceeds the regulatory risk of the managed alternative.
ADR-002
›
Decision
Separate the platform into two AWS VPCs: Data Gravity region (us-east-1) for everything that persists — sovereign, auditable. Compute Gravity region (us-west-1) for everything that executes transiently — stateless, replaceable.
Context
Data accumulates. Compute executes. These are fundamentally different forces. Data in a regulated context cannot move freely — it has sovereignty requirements and pulls everything dependent on it toward where it lives. Compute is stateless by design — an inference pod rebuilt tonight loses nothing of lasting consequence. Putting both in the same VPC either violates data sovereignty or prevents GPU access.
Alternatives considered
- Single-region us-east-1Rejected — no GPU availability, no compute gravity separation
- Single-region us-west-1Rejected — data sovereignty violation, PII in compute-optimised region
- Multi-account separationDeferred — adds isolation but increases operational complexity; revisit at Phase 2
Consequences
Positive
- Data sovereignty by topology not policy
- GPU costs optimised independently
- Compliance boundary architecturally explicit
Trade-offs
- ~60ms cross-region latency on data retrieval
- ~$15/month Transit Gateway
- More complex networking
Cost impact
~$15/month Transit Gateway
Review trigger
Revisit when: A single AWS region offers both sovereign data isolation guarantees AND GPU compute at parity with us-west-1 pricing.
ADR-004
›
Decision
AWS RDS PostgreSQL with pgvector extension for the POC phase. Single-region, right-sized for POC load (~db.t3.medium).
Context
The platform needs a primary relational store for agent decision records, RAG metadata, policy version archives, and the L7 audit trail. PostgreSQL is the right long-term choice — ACID compliance for audit records, pgvector for embeddings. The POC does not require distributed SQL. That complexity is a Phase 2 concern.
Alternatives considered
- YugabyteDBPhase 2 — correct when multi-region active-active is required; PostgreSQL-compatible so migration is schema-only
- Aurora PostgreSQLRejected — higher cost, Aurora-specific APIs create mild lock-in
- DynamoDBRejected — no ACID, no pgvector, wrong model for audit records
Consequences
Positive
- ACID guarantees for L7 audit
- pgvector co-located with metadata
- SQLAlchemy ORM compatibility
Trade-offs
- Single-AZ recovery time risk at POC
- Migration to YugabyteDB needed at scale
Cost impact
~$35/month (db.t3.medium, single-AZ)
Review trigger
Revisit when: Vector count exceeds 5M rows (pgvector degrades) OR multi-region active-active becomes a hard requirement. Either triggers YugabyteDB migration.
ADR-005
›
Decision
pgvector as the vector store for POC, running as an extension on the existing RDS PostgreSQL instance. Zero additional infrastructure.
Context
L3 requires vector storage for RAG retrieval. The POC benefit of pgvector is operational simplicity — one database, one backup strategy, one query interface. Metadata and vectors co-located enable JOIN queries without cross-service calls. Milvus is the Phase 2 choice when vector count and ANN performance justify a separate cluster.
Cost impact
Zero additional (PostgreSQL extension)
Review trigger
Revisit when: Average vector query latency exceeds 100ms OR vector count exceeds 5 million rows. Either triggers ADR-025 (Milvus).
ADR-006
›
Decision
AWS Bedrock for all LLM inference at POC. All inference calls go through an OpenAI-compatible abstraction layer — the agent code is model-agnostic. Validate on Bedrock. Earn the GPU when the architecture is proven.
Alternatives considered
- OpenAI APIRejected — data leaves AWS boundary; sovereignty violation
- NVIDIA NIMPhase 2 — requires GPU cluster; not cost-justified until architecture is validated
- Ollama (self-hosted)Local dev only — correct for development; not production-ready without GPU
Cost impact
Variable — ~$0.003–$0.015 per 1K tokens
Review trigger
Revisit when: Monthly Bedrock cost exceeds $200 OR latency drops below 200ms requirement. Either triggers ADR-024 (NVIDIA NIM).
ADR-007
›
Decision
Redis for ephemeral agent working memory, pub-sub messaging for agent handoffs, and hot-vector caching. Redis state is ephemeral by design — durable state lives in Temporal and PostgreSQL.
Cost impact
~$15/month (ElastiCache cache.t3.micro)
Review trigger
Revisit when: Pub-sub volume exceeds 5,000 messages/second OR audit event stream requires replay. Either triggers Kafka migration (ADR-016).
ADR-009
›
Decision
Temporal.io as the durable workflow engine for all L4 multi-step agent workflows. Workflows survive pod crashes, network failures, and Bedrock timeouts — resuming from the exact last completed step.
Context
Temporal provides three properties essential for regulated agentic workflows: fault tolerance (resumes from last completed step), state persistence (every intermediate step persisted and replayable), and built-in audit history (immutable record of every workflow action). This is not a convenience tool. It is how L4's durable state checkpointing capability is implemented.
Alternatives considered
- Celery + RedisRejected — task queue, not durable workflow; tasks cannot resume mid-execution after failure
- AWS Step FunctionsRejected — managed, but AWS lock-in; event history less queryable than Temporal
- Custom checkpoint implementationRejected — Temporal solves this correctly; reinventing it is engineering waste
Cost impact
~$20/month (EKS pod, shared PostgreSQL backend)
Review trigger
Revisit when: Temporal's workflow history storage costs exceed the value provided. Or a purpose-built agentic workflow engine with better compliance story emerges.
ADR-013
›
Decision
Event-driven microservices. Each capability layer boundary maps to independently deployable services. Services communicate via events (Redis Pub-Sub for POC, Kafka at Phase 2) rather than synchronous REST where possible.
Context
Three properties of agentic regulated workloads make event-driven microservices correct: different capability layers have different scaling requirements; event-driven communication allows L7 to record every inter-layer event without blocking the producing layer; each service can be deployed and updated independently, reducing blast radius.
Review trigger
Revisit when: Operational overhead of managing multiple services exceeds scaling benefits for the current team size. At that point, merge L4+L5 into one service.
ADR-014
›
Decision
Each data store has its own CAP position: PostgreSQL/L7 audit = CP (never lose a record). Redis/L4 agent state = AP (availability over consistency, ephemeral). Temporal/L4 workflow = CP (consistent or fail). pgvector/L3 = AP (stale context acceptable; freshness validation in L3 catches it).
Review trigger
Revisit when: An L7 audit record is lost during a partition event. That outcome is architecturally inadmissible and requires immediate redesign.
ADR-015
›
Decision
Append-only PostgreSQL tables as the L7 write model. No UPDATE or DELETE on audit records. SHA-256 hash chaining within PostgreSQL for tamper detection. Separate read model (materialised views) for regulator queries. CQRS without full event sourcing complexity at POC.
Context
Full event sourcing is architecturally correct long-term. But at POC scale, its complexity exceeds its benefit. Append-only tables provide immutability with minimal overhead. The CQRS discipline established now makes the Phase 2 upgrade to full event sourcing an evolution, not a redesign.
Review trigger
Revisit when: Audit record volume exceeds 10 million rows OR a cryptographically verifiable audit trail (blockchain-grade) is required by regulation.
ADR-016
›
Decision
Redis Pub-Sub for all async messaging at POC. Kafka (AWS MSK) at Phase 2. The migration trigger is message volume and replay requirements, not a calendar date.
Review trigger
Revisit when: Audit event stream requires message replay OR pub-sub throughput exceeds 5,000 messages/second.
ADR-017
›
Decision
Server-Sent Events (SSE) via FastAPI for the mid-analysis compliance officer alert channel. Communication is unidirectional — server pushes events to the dashboard. WebSocket complexity is not justified for one-way push.
Review trigger
Revisit when: CCO dashboard requires bidirectional interaction through the alert channel. That triggers WebSocket migration.
ADR-018
›
Decision
LangGraph as the state machine framework for all L4 agent orchestration. Nodes = reasoning steps. Edges = conditional transitions. State = accumulated context. External checkpointing to PostgreSQL via the LangGraph checkpointer interface.
Context
LangGraph's node-and-edge model maps directly to the L4 requirement: the graph is explicit and auditable, supports external state persistence, and enables SOP constraint enforcement at the edge level — certain transitions only allowed if policy conditions are met.
Alternatives considered
- LangChain ReAct AgentsRejected — ReAct loop is less auditable; agent decides next step each iteration without explicit graph constraints
- AutoGenRejected — conversation-based multi-agent; less suited to structured SOP-constrained workflows
- Custom state machineRejected — reinventing what LangGraph provides correctly
Review trigger
Revisit when: LangGraph's checkpointing model is incompatible with Temporal at Phase 2 scale.
ADR-019
›
Decision (proposed)
Defer CrewAI to Phase 1. POC uses LangGraph's supervisor pattern for multi-agent coordination. CrewAI evaluated when the Fraud Detection use case (Ch13) makes multi-agent complexity concrete.
Open question
Does CrewAI's crew abstraction add measurable value over LangGraph's supervisor pattern for seven-specialist-agent coordination? This question cannot be answered correctly without a real use case.
Review trigger
Accelerate when: Fraud Detection use case reveals LangGraph supervisor pattern is insufficient for seven-specialist-agent coordination.
ADR-020
›
Decision (proposed)
Keycloak deployed at POC in minimal configuration (single realm, basic RBAC). Full configuration — fine-grained permissions, maker-checker roles, per-agent identity scopes — at Phase 1. Authentication established from day one to avoid retrofitting.
Alternatives considered
- AWS CognitoRejected — does not support fine-grained ABAC for L2; AWS lock-in for identity
- Simple JWT (custom)Rejected — no RBAC framework; complete replacement required at Phase 1
- Auth0Rejected — data sovereignty concern; tokens processed outside sovereign infrastructure
Review trigger
Revisit when: Keycloak operational overhead exceeds one engineer-hour/month at POC scale.
ADR-022
›
Open questions (must answer before accepting)
1. Does Casbin support policy versioning? Can two policy versions coexist and be evaluated independently?
2. Does Casbin's ABAC model support context-dependent permissions — same agent, different permissions based on risk tier?
3. What is the performance overhead of a Casbin evaluation at the L4→L5 boundary for every tool call?
2. Does Casbin's ABAC model support context-dependent permissions — same agent, different permissions based on risk tier?
3. What is the performance overhead of a Casbin evaluation at the L4→L5 boundary for every tool call?
Proposed direction
Casbin with PostgreSQL adapter for policy storage. Policy versions stored as separate Casbin policy sets, pinned at request time via L2.
ADR-023
›
Context (GAP-001)
The current platform has no per-agent identity system. When Agent A delegates to Agent B, the audit trail cannot distinguish their individual actions. This directly violates IMDA MGF §2.1.2 and is documented in the Regulatory Traceability Matrix as REG-004.
Proposed direction
OAuth 2.0 Token Exchange (RFC 8693) via Keycloak. Human token → Orchestrator agent token → Specialist agent tokens, each with narrower scope. Delegation chain recorded in L7 audit trail.
Open questions
1. Does RFC 8693 Token Exchange in Keycloak support the delegation depth required (human → L4 orchestrator → L5 specialist → external tool)?
2. What is the token exchange latency overhead per delegation level?
3. How are agent identities registered — pre-defined at deploy time or dynamic at workflow instantiation?
2. What is the token exchange latency overhead per delegation level?
3. How are agent identities registered — pre-defined at deploy time or dynamic at workflow instantiation?
ADR-024
›
Decision (proposed)
NVIDIA NIM replaces AWS Bedrock when monthly inference cost exceeds $200 OR latency drops below 200ms requirement. OpenAI-compatible API maintained — agent code requires no changes. Earn the GPU when the architecture is proven.
Cost impact
~$500/month (GPU instance + NIM at Phase 2 scale)
ADR-025
›
Decision (proposed)
Milvus replaces pgvector when vector count exceeds 5M OR average query latency exceeds 100ms. Same abstraction layer — L3 retrieval code swaps implementations without agent code changes.
Cost impact
~$40/month (Milvus cluster, 3 nodes on EKS)
ADR-026
›
Why deferred
Prompt CI/CD requires real production prompts and real regression cases to design correctly. Writing this ADR before the first use case is built means designing against hypothetical prompts. This ADR is opened when Ch23 (Prompts Are Code) is written.
GAP-003 — L7 hook required now
Even though the Agenta pipeline is deferred, the L7 integration contract for prompt version recording must be designed now. Every inference call must record which prompt template version was used. Retrofitting this breaks the audit chain.
Version History
Apr 2026
ADR-013 through ADR-026 authored. Architecture layer (2A–2D) decisions formalised. Component layer ADRs for LangGraph, CrewAI, Keycloak, NIM, Milvus. GAP-001 (ADR-023) and GAP-003 (ADR-026) opened. Casbin research pending (ADR-022).
Mar 2026
ADR-001 through ADR-012 established during infrastructure build. Open-source foundations, Two-VPC topology, Terraform, PostgreSQL, pgvector, Bedrock, Redis, MinIO, Temporal, aws-vault, GitLab, CloudFront.