GEDS Generalized Evaluation & Decision System

HISTORICAL DEEP-DIVE (June 2026) — this proposal predates the factory architecture and the five-mode reframe of GEDS. Its compression core became Mode 1 (anomaly extraction) and its anti-psychosis gates became the handoff-packet discipline. The component named “GEDS Bridge” inside this document is unrelated to the GUTS Bridge, the factory's built control surface. Current framing: geds.html · blueprint.html

You are the single
point of failure

Every ticket, every goal, every evaluation begins with you knowing what to ask. You are the sensor. Sensors have blind spots. GEDS changes the primitive operation from evaluation-on-request to continuous structural inference — finding what you didn't know to look for.

KERNEL point COMPRESSION ENGINE

In code, you don't know what good looks like — so the ticket "this architecture will bottleneck the GONS routing layer at scale" never gets written.

In sales, you don't know what good looks like — so the ticket "your offer framing is burying the outcome clients actually want to buy" never gets written.

Not because the system can't execute them. Because nobody knew to write them.

Two processes. Two clocks.

The LLM handles communication on a fast clock. The compression engine runs structural inference on a slow clock. GEDS bridges them.

Fast Clock · Reactive
LLM Layer
  • Handles communication and ticket generation
  • Executes tasks through GOMS
  • Interfaces with you through GONS
  • Translates structural inferences into language
  • Fires on prompt — strictly reactive
FAST CLOCK GEDS SLOW CLOCK
Slow Clock · Continuous
Compression Engine
  • Accumulates every output, ticket, revision, outcome
  • Compresses into an evolving structural model
  • Tracks where execution flows vs. hits friction
  • Detects gaps between what's worked on and what matters
  • Runs continuously — never waits for a prompt
Three components of the compression engine
embedding space
01 — Embedding Model · Operational History as Geometry
Every ticket, output, revision, and outcome gets embedded — not as retrievable text but as a geometric structure where similar patterns cluster and divergent ones separate. The two clusters above emerge naturally — code patterns and sales patterns — without being labeled. You see the shape of what's happening, not just the sequence of events.
N variables 5–20 latent sparse autoencoder
02 — Sparse Autoencoder · Finding the Latent Variables
A network trained to find the minimum set of latent variables that explain variance in your operational data. The outer ring is all the noise — every ticket, output, revision. The core is the 5–20 underlying factors that actually predict most of what happens. When a new unexplained factor emerges, that's the signal that surfaces upward to GEDS.
COMPRESS ENGINE LLM LAYER GEDS bridge geds bridge
03 — GEDS Bridge · Inference to Action
Takes what the compression engine surfaces, runs it through the LLM for articulation, and generates the unprompted ticket — the reframing you didn't know to ask for. The LLM doesn't do the structural inference. It translates inference into something actionable. You evaluate the inference. You don't have to generate it. You move from sensor to judge.
GEDS in the G Stack

Every other system in the stack executes within the frame it was given. GEDS is the only system whose job is to question whether the frame is right.

GOMS execute GONS coordinate GRAMS market GIMS memory GEMS artifacts GAMS prioritize GEDS compression engine
GEDS — Evaluation & Decision
Center of the stack. Compression engine plus bridge. Generates what nobody asked for.
GOMS — Orchestration & Execution
Executes tickets via agents. Does the work.
GONS — Operations Nervous System
Routes, coordinates, communicates. Interfaces with you directly.
GAMS — Allocation Management
Prioritizes work. Allocates time and agent effort.
GIMS / GEMS — Memory & Artifacts
Source of truth. File and output storage.
GRAMS — Market Interface
Handles economic interactions: proposals, contracts, labor exchanges.
The key shift: every other system executes within whatever frame was given. GEDS is the only system whose job is to question whether the frame is right. You move from sensor — knowing what to ask — to judge — deciding whether what was inferred is worth acting on.
Unprompted Inferences
3
awaiting review
Latent Factors
11
+2 new this week
Structural Gaps
2
high confidence
Acceptance Rate
71%
last 30 days
Unprompted Inferences 3 pending
Codebase · High 2h ago
GONS routing layer will bottleneck at scale — architectural decision needed now
Evidence: 7 of last 9 routing revisions touched the same two modules. Friction pattern suggests a structural assumption that won't hold past ~50 concurrent threads.
Sales · Medium 6h ago
Offer framing is burying the outcome clients actually want to buy
Evidence: 3 recent outreach drafts led with process and capability. High-converting patterns lead with the specific problem being removed from the client's life.
Time Allocation · Low 1d ago
High-leverage work is being deferred for visible but low-impact tasks
Evidence: GAMS logs show 68% of last week on execution tasks. GEDS-flagged strategic work deferred 4 times across 3 days.
Latent Factor Map 11 tracked
scope 84% routing 71% deferral 68% sales 61% verify 44% ctx 38%
Scope underestimation — 84%
Routing abstraction debt — 71%
Strategic deferral — 68%
Sales framing — 61%
Verification rate — 44%
Context exhaustion — 38%
Structural Drift — 14 Days 2 gaps widening
Gap between the compression model's predicted trajectory and actual output direction.
predicted codebase sales 14d ago today
Compression Log live
09:14:32
New factor emerging: GOMS_TICKET_CHURN — same tickets revised 2x+ before completion. May indicate upstream specification problem.
08:51:07
Compression model updated with 23 new GOMS outputs. Routing abstraction debt confidence 64% → 71%.
07:30:18
Sales framing inference generated. Evidence threshold met (3 corroborating patterns). Queued for approval.
yesterday
Structural gap: TIME_ALLOCATION diverging from stated priorities — 4th consecutive day.
2d ago
GONS routing inference accepted. Ticket ARCH-0041 queued in GOMS. Tracking outcome for model calibration.
Approval Queue — Awaiting Your Judgment 3 pending
GONS routing bottleneck — Convert inference to GOMS architectural review ticket. Est. 2–4 hours. Will pause two dependent tickets.
Sales framing — Rewrite outreach template to lead with client problem removal rather than system capability. Generate 2 variant drafts.
Time allocation drift — Reschedule GRAMS schema work to tomorrow AM. Block 3 hours. Flag to GAMS as protected time.

What it's
made of

Not a single model. A stack of well-understood components arranged around a new objective function. Each piece exists. The arrangement is the invention.

SQLite · Vector Store GUTS trace SAE scoring LLM articulate scoring loop autoencoder FULL STACK

A system that watches everything that happens in the operational stack, maintains a set of competing structural explanations for why things happen the way they do, and fires an alert when one of those explanations suddenly gets better at predicting the record — without being asked. The alert becomes a ticket. You judge whether the inference is real. That judgment feeds back into which explanations survive.

Five components in build order
event · ts · type · data ticket_created inference_rejected time_block_displaced sales_attempt·outcome GUTS event trace
01 — GUTS · Structured Event Log
Before any compression is possible you need a precise event log — not conversation history, not text blobs. Typed events with schema. Every ticket created, revised, or abandoned. Every inference accepted or rejected. Every time block allocated or displaced. Every sales attempt and outcome. Every code change and revert.

The schema is the hardest design decision in the whole system. Vague events produce vague latent factors. The more precisely you define what counts as an event and what its attributes are, the more useful the compression becomes.

Storage: SQLite — append-only, fixed schema, modest data volume, handles concurrent reads. A handful of tables: events, inferences, model_scores, feedback.
input latent sparse sparse autoencoder
02 — Sparse Autoencoder · The Compression Engine
A standard autoencoder compresses input to a bottleneck and reconstructs it. A sparse autoencoder adds a constraint: most hidden neurons must be inactive at any given time. That sparsity pressure forces interpretable latent features rather than distributed encodings that work mathematically but mean nothing to a human.

The diagram shows it: many input nodes → small bottleneck → sparse hidden layer where only a few neurons activate per event. Those active neurons are the latent factors GEDS tracks — the underlying variables that explain why your operational stack behaves the way it does.

Trained in: PyTorch. Once trained, the decoder is discarded. Only the encoder is used to embed new events as they arrive.

Embeddings stored in: ChromaDB or LanceDB — both embeddable, no separate server. SQLite blobs work fine at V1 scale.

Theoretical grounding: The objective function the autoencoder serves — rewarding compression progress rather than compression achieved — is formalized in Schmidhuber's 2009 paper Driven by Compression Progress. The core argument is that interestingness is not raw novelty but the rate of improvement in a system's ability to compress its observations. That is the reward signal GEDS is built around. arxiv.org/pdf/0812.4360
Δ compression progress GEDS fires compression progress detector
03 — Compression Progress Detector · The GEDS Signal
This is what makes GEDS different from a dashboard. A dashboard tells you what happened. The detector tells you when a new explanation suddenly accounts for a lot of previously scattered behavior.

The chart above shows it: two competing models improving slowly over time, then one suddenly jumps — a new latent factor collapses a lot of previously uncompressed variance. That jump is the signal. GEDS fires.

The metric: delta compression per unit of model complexity. Did this new explanation make a lot of scattered stuff simpler without adding too much machinery? Stated technically: did reconstruction error drop significantly relative to the increase in model parameters?

In V1 this is not classical RL. It's a Bayesian scoring function. Each competing model has a score. New events raise or lower that score based on whether the model predicted them. A significant score jump triggers a surface event.
P S T D U PASS surface predicts survives testable decays uncertain anti-psychosis validator
04 — Anti-Psychosis Validator · Before Anything Surfaces
Compression engines can go wrong. Conspiracy theories are compression engines. Delusions are compression engines. "Everything is about X" is often a high-compression, low-truth model. Before any inference reaches you it must pass five gates.

Improves Prediction
Must make a falsifiable prediction about something not yet in the trace. If it can't be wrong, it isn't an inference.
Survives Contradiction
Must account for the times the pattern didn't hold. A model that ignores counterevidence is a delusion.
Produces Testable Consequences
Must imply a concrete action whose outcome will confirm or disconfirm it. No untestable inferences reach the queue.
Decays If It Stops Predicting
Models that were right but stop predicting get retired. No permanent conclusions. Confidence must be earned continuously.
latent factor score: 0.84 LLM UNPROMPTED TICKET GONS routing will bottleneck at scale. Evidence: 7/9 revisions → 2 modules llm articulation layer
05 — LLM Articulation Layer · Translation Only
The last component to touch an inference, and the narrowest. Its job is strictly translation: take what the compression engine surfaced and the validator approved, and render it as a human-readable unprompted ticket with evidence attached.

The LLM does not evaluate whether the inference is true. It does not add interpretation. It does not generate its own structural analysis. It renders. This is the division that keeps the "what's the point" organ outside the language model, where it belongs.

The LLM also handles the second-pass jobs: explaining the evidence in plain language, generating the GOMS ticket from an accepted inference, writing the sales draft, producing the code review. All of that is articulation work. None of it is compression work.

API cost stays low because the LLM only fires when a compression passes the validator threshold — not on every event, not on every cycle of the scoring loop.
Event schema design

The schema is the most consequential design decision in the system. Vague events produce vague latent factors. These are the event types the V1 trace needs to capture.

Event Type Key Attributes Why it matters
ticket_created domain, source, scope_estimate Reveals what kind of work is entering the system and from where
ticket_revised ticket_id, revision_num, modules_touched Repeated revision of same modules → abstraction debt signal
ticket_abandoned ticket_id, reason_tag, time_spent Abandoned work reveals where scope estimation failed
time_block_allocated domain, task_type, strategic_flag Tracks what categories of work are actually receiving time
time_block_displaced original_task, displaced_by, reason Displacement patterns reveal the real priority ordering
inference_surfaced model_id, delta_score, domain, evidence_count Creates a record of what GEDS believed and when
inference_accepted inference_id, action_taken The central training signal — what judgments were made
inference_rejected inference_id, rejection_reason Rejection reasons calibrate the validator thresholds
sales_attempt framing_type, channel, outcome, response_lag Connects offer framing to conversion — the signal for sales latent factors
code_changed modules, change_type, preceded_by_ticket Changes without preceding tickets reveal unplanned work patterns
V1: one domain, one model, one inference

The minimum viable version is narrow by design. Prove the architecture works in one domain before building the rest.

1
Define the event schema for time allocation only
Time allocation is the most tractable domain. Events are measurable, stakes of a wrong inference are low, and the pattern (strategic deferral under execution load) is already visible. Build GUTS for this domain only and log two weeks of real operational data before touching anything else.
2
Train one sparse autoencoder on the trace
Two weeks of time allocation events should be enough to find 3–5 latent factors. Train in PyTorch. Store embeddings in SQLite blobs initially. Read the latent factors manually first — before adding any scoring logic — to verify they're meaningful rather than artifacts.
3
Implement the Bayesian scoring loop
For each latent factor, maintain a score that updates when new events arrive. Define a delta threshold for "compression progress detected." Run it on the historical trace first — did it fire at the right moments looking backward? Calibrate the threshold before running live.
4
Build the validator and surface one inference
Implement the five gates as a simple rule-based filter. Let the system surface one inference. Evaluate it manually. Was it right? Did the evidence actually support the conclusion? Does it pass all five gates in practice, not just in theory? This is the experiment that tells you if the architecture has legs.
5
Wire the feedback loop and let it run
Accept or reject the inference. Track whether accepting it improved downstream prediction. Let the scoring loop update from the feedback. Run for another two weeks. The capability that doesn't exist anywhere else is this loop — the system getting better at finding the real question every time you tell it whether its last guess was right.
The thing that makes this worth building isn't the individual inferences. It's that the system gets better at finding the real question every time you tell it whether its last guess was right. That feedback loop is the product. Everything else is scaffolding.