Celebal Technologies

Autonomous Remediation at
Speed with Eagle Eye IQ

12 min readJune 08, 2026
Blog thumbnail

How we built a multi-agent AI system that closes the data quality loop, and why Databricks Lakebase is the operational infrastructure that makes it run

The loop in data observability has always had an open end.

An anomaly is detected. A score drops. An alert fires. A Slack message lands in an engineer's DMs at 11 pm. They open the dashboard, stare at the spike, open the pipeline, trace the issue backward through a lineage graph that may or may not be current, find the root cause, fix it, validate the fix, and close the incident. Hours later, the loop is closed, manually, expensively, and entirely because a human did what the observability platform could not.

Every observability tool on the market is excellent at opening that loop. Almost none of them close it.

Eagle Eye IQ was built specifically to close it. Aquila AI, Eagle Eye IQ's autonomous agent engine, runs 35+ specialized agents that detect, investigate, diagnose, and remediate data quality incidents without human intervention. The platform does not page an engineer. It resolves the incident and logs what it did.

But building a system that genuinely closes the loop, rather than just claiming to, required an operational infrastructure that most Databricks-native applications do not need. This post is about that infrastructure: why we chose Databricks Lakebase as the operational backbone of Eagle Eye IQ, and what Aquila AI's remediation architecture actually looks like in production.

What Autonomous Remediation Demands Architecturally

The distinction between a monitoring platform and an autonomous remediation platform is not a product distinction. It is an architectural one.

A monitoring platform reads data and presents it. Its operational requirements are forgiving, analytical queries, tolerable latency; eventual consistency is fine. The human on the other end of the dashboard absorbs the inconsistency and fills in the gaps with judgment.

An autonomous remediation system cannot defer to human judgment. It has to act, and it has to act correctly, consistently, and quickly. Aquila's agents need to:

Coordinate across 35+ parallel workstreams without stepping on each other, which requires a task queue with genuine transactional semantics and no race conditions.

Persist investigation context across agent handoffs so that when one agent passes work to another, nothing is lost and nothing is duplicated.

Write every remediation action transactionally so the audit trail is always consistent with what actually happened.

Serve live incident status to the UI without requiring an analytical cluster to wake up first.

None of these requirements are analytical. They are operational. And operational workloads have fundamentally different infrastructure needs, specifically, sub-10ms transactional reads and writes on small records, not high-throughput scans across large datasets.

This is where our V1 architecture broke down. We initially ran everything through Spark and Delta Lake, the natural choice for a Databricks-native platform. It worked well for the analytical side of Eagle Eye IQ: running DQ checks across hundreds of millions of rows, computing column-level lineage, scanning notebooks for code quality violations. Spark handles all of that well.

It did not work for Aquila. Agent coordination on Spark meant spinning up compute every time an agent needed to update its task state. The latency was unacceptable, not because Spark is slow at what it does, but because what Aquila needs is fundamentally different from what Spark is designed to do. We were using a high-throughput analytical engine as a transactional operational database. The mismatch was architectural, not a tuning problem.

The alternative; an external managed database like RDS or Cloud SQL, addressed the latency requirement but introduced a different problem. Eagle Eye IQ's value proposition is built on data governance and compliance. Routing our own operational state through an external network boundary, outside Unity Catalog's scope, into a second security perimeter that enterprise customers would need to independently audit, was a structural contradiction we were not willing to accept.

Databricks Lakebase gave us what we needed: a PostgreSQL-compatible, sub-10ms transactional, serverless database running natively inside the Databricks workspace. Aquila gets its operational state store. Eagle Eye IQ keeps its single security perimeter. There is nothing to audit outside the workspace boundary.

Eagle Eye IQ Databricks Platform Architecture

The Control Plane: 10 Modules, One Loop

Autonomous remediation at speed means Eagle Eye IQ has to do five things in sequence, without human intervention: Detect, Investigate, Diagnose, Remediate, and Govern. Eagle Eye IQ's control plane is built from 10 modules, each owning a specific step in that sequence. Together they cover the full surface area of enterprise data reliability, from ingestion to governance.

Each step has a Lakebase dependency that is not incidental; it is structural. Lakebase is not doing the remediation. Aquila is. But without Lakebase's transactional guarantees and sub-10ms performance, Aquila cannot coordinate at the speed that makes autonomous remediation meaningfully faster than human remediation.

Detect

DQ Guardian

The quality gate at ingestion. Over 100 predefined rules, automated data profiling, column-level drift detection, and SLA monitoring. Bad data is caught before it reaches a dashboard or a downstream consumer. Results are written immediately to Lakebase, not queued for a batch process, so the UI reflects failures the moment they happen.

ObserveIQ

Real-time pipeline health monitoring. Freshness failures, schema change events, and DQ scores are served live to the UI without spinning up analytical compute. Lakebase holds the operational metadata that makes this possible at sub-10ms without a Spark cluster.

Code Inspector

Quality gates for notebooks and pipelines before they reach production. Catches structural issues, anti-patterns, and policy violations at the code layer, closing the loop before a bad pipeline ever runs.

Agent Lens

AI-specific observability. Hallucinations, PII leaks, token cost overruns, and model drift, tracked and surfaced as first-class observability signals alongside traditional data quality metrics.

Investigate & Diagnose

Lineage Lens

Column-level lineage across the full pipeline graph. When Aquila is investigating a quality failure, Lineage Lens tells it exactly where the problem originated, which transformation, which upstream dataset, which schema change, so the investigation narrows quickly rather than broadly.

Recon Engine

Source-to-target reconciliation. ETL and migration outcomes are not assumed correct; they are verified mathematically. Row counts, schema fidelity, business rule checks. When a reconciliation fails, Recon Engine feeds the finding directly into Aquila's investigation queue.

Remediate

Aquila AI

Eagle Eye IQ's autonomous agent engine. 35+ specialized agents that coordinate through Lakebase's transactional task queue, persist investigation context across handoffs, execute remediation actions, and log every step with full audit evidence. Aquila is the closing mechanism of the loop. Lakebase is the operational layer it runs on.

Govern

Contract Vault

Machine-executable data contracts. Schema expectations, SLA thresholds, quality rules, and access control terms are defined once and enforced continuously. Every breach is written to Lakebase transactionally; the detection and the breach record are committed together or not at all.

Perch Hub

Data product marketplace. Teams discover, share, and govern data products across the organization. Lineage, quality scores, and contract status surface alongside each data product so consumers know what they are working with before they build on it.

Insight Studio

Role-based dashboards for every stakeholder. Executives see business impact. Engineers see pipeline health. Compliance teams see contract status and audit trails. One platform, one version of truth, no reconciliation between tools.

Every module writes its operational state, alerts, breach records, reconciliation results, agent findings, contract versions to Lakebase. Every module reads its live metadata from Lakebase. The control plane is unified not because the modules share a UI, but because they share a transactional operational database inside the workspace boundary.

The large-scale analytical work, DQ checks across hundreds of millions of rows, column-level lineage computation, source-to-target reconciliation at volume, runs on Spark and Delta Lake. Lakebase handles everything that needs to be instant, transactional, and available without cluster warmup. That split is what delivered a 60% reduction in infrastructure cost on metadata operations compared to our V1 approach, which routed all of it through Spark.

See Eagle Eye IQ in Action: (10) AI-Powered Data Observability Platform | Eagle Eye IQ by Celebal Technologies - YouTube

Three Lakebase Capabilities Aquila Depends On

Sub-10ms Transactions: The Reason Aquila Can Remediate at Speed

Aquila's parallel agents coordinate through a shared task queue in Lakebase. When a quality incident fires, multiple agents begin working simultaneously, each updating its state dozens of times as the investigation progresses.

For Aquila's coordination to be faster than a human doing the same investigation, the task queue needs to respond in single-digit milliseconds. The moment database latency adds meaningful overhead to each agent state update, autonomous remediation stops being faster than human remediation, and Eagle Eye IQ's central promise breaks down.

The agent_tasks table is the critical path for every Aquila investigation:

-- Aquila agents poll this constantly during active investigations SELECT * FROM agent_tasks WHERE status = 'QUEUED' AND agent_type = $1 ORDER BY priority DESC, created_at ASC LIMIT 1 FOR UPDATE SKIP LOCKED;

The FOR UPDATE SKIP LOCKED pattern is how 35+ agents coordinate on a shared queue without stepping on each other. It requires genuine PostgreSQL semantics; not something replicable on Delta Lake or approximable with Spark. Lakebase's full PostgreSQL 16 compatibility made this pattern available in production without workarounds.

JSONB: How Aquila Gets Smarter Without Breaking the Database

Aquila's diagnostic models improve continuously, new failure categories are added, root cause taxonomies are refined, confidence scoring methods evolve. Every improvement needs to be deployable without a schema migration event.

In a normalized relational schema, every structural change to agent output means a migration: new columns, updated foreign keys, coordinated deployments across every customer environment simultaneously. In a multi-tenant production platform, every model improvement becomes a coordinated risk event. Aquila gets smarter more slowly every time a migration stands between improvement and production.

We store all AI-generated outputs, agent investigation results, root cause hypotheses, remediation action details, quality check payloads, in JSONB columns. Structure is owned by the application layer. Aquila updates its taxonomy; the database absorbs the change. No migration, no downtime window, no coordination across environments.

The critical thing we validated before committing to this pattern: JSONB in PostgreSQL does not sacrifice queryability for flexibility. Breach detail, root cause classification, affected column lists, all queryable at performance that meets production requirements:

-- Querying breach detail from JSONB without sacrificing index performance SELECT contract_id, breach_detail->>'root_cause' AS root_cause, breach_detail->'affected_columns' AS columns FROM contract_breaches WHERE breach_detail->>'severity' = 'critical' AND created_at > NOW() - INTERVAL '24 hours';

This pattern, JSONB for structural flexibility, PostgreSQL indexing for query performance, is what allows Eagle Eye IQ to deploy model improvements on a continuous cadence rather than a migration-gated release cycle.

Serverless Scale-to-Zero: The Cost Model That Makes Per-Workspace Isolation Viable

Eagle Eye IQ runs isolated Lakebase instances per customer workspace. Under a traditional managed database model, that would mean N always-on database instances, each accumulating cost regardless of whether any investigation is active. The cost model would make per-workspace isolation impractical at scale.

Lakebase's serverless architecture means idle instances cost nothing. Customers who are not running active investigations are not paying for compute. When an incident fires and Aquila begins coordinating, Lakebase is available immediately. When the investigation closes, the instance scales back to zero.

This is not a minor operational detail. It is what makes the deployment model, one isolated Lakebase instance per customer workspace, inside their security perimeter, governed by their Unity Catalog, economically viable as a product architecture.

Deploying Eagle Eye IQ: One Click, Inside the Perimeter

Eagle Eye IQ ships as a Databricks App, one-click installation into a customer workspace, no external SaaS, no data egress, no infrastructure provisioning before the first quality check runs. Databricks Apps injects DATABASE_URL at runtime; create_all() bootstraps the schema on first start. The Lakebase connection is a hard dependency, if it fails, nothing else starts. There is no external service to debug because there is no external service.

For enterprise procurement this matters directly. When a security team asks whether data leaves the environment, the answer is no; not as a policy claim, but as an architectural fact. Unity Catalog governs everything inside the workspace. There is no second perimeter to audit, no external endpoint to scope into a data processing agreement, no third-party vendor holding operational records.

What Eagle Eye IQ Delivers in Production

Moving Eagle Eye IQ's operational layer to Lakebase, our V2 architecture, produced three measurable changes:

60% reduction in infrastructure cost on metadata operations

The always-on Spark cluster for metadata queries is gone, replaced by Lakebase's serverless model that costs nothing when idle.

Sub-10ms agent state operations

Aquila's task queue operations that previously required cluster compute now run as direct PostgreSQL queries against Lakebase, removing the latency bottleneck that prevented parallel agent coordination at speed.

85% fewer quality incidents reaching production

DQ Guardian runs at ingestion with results written immediately to Lakebase, not queued for a batch process that runs after data has already propagated downstream.

Industry Case Studies: Where Data Trust Pays Off

Eagle Eye IQ is horizontal: it is the trust layer on top of any Databricks estate, but the cost of untrustworthy data is industry-specific. The pattern is consistent: the larger and more regulated the estate, the higher the return. Celebal's Databricks footprint maps directly onto the estates Eagle Eye IQ observes.

Banking & Financial Services: Model risk, compliance evidence, governed platforms

Governed platforms and model-risk regimes demand provable data quality and lineage, the regulatory and CFO use cases in the Databricks Banking map. Contract Vault, DQ Guardian, and Lineage Lens enforce data contracts and prove lineage, while ObserveIQ keeps compute cost under control. Celebal has delivered Unity Catalog governance and platform migrations on Databricks for sovereign investment authorities, NBFCs, and insurers, where governed estates and compliance evidence have to be a query, not a reconstruction.

Life Sciences & Healthcare: Regulated pipelines (GxP, ALCOA+, GDPR)

Integrating clinical, safety, regulatory, supply-chain, and ERP systems introduces schema drift, incomplete audit trails, delayed anomaly detection, and rising cloud cost across regulated domains. Eagle Eye IQ delivered end-to-end observability and regulatory-aligned data-quality validation for a global pharmaceutical organization running 12 production sites and 250+ pipelines across MES, LIMS, and ERP, using DQ Guardian for automated checks and anomaly detection, ObserveIQ for cost and FinOps observability, and Aquila's multi-agent engine for autonomous investigation and remediation.

Public Sector: Audit-ready governed platforms

Governed public-sector platforms must be auditable and reliable end-to-end. Eagle Eye IQ supplies lineage, contracts, and audit logs aligned to Unity Catalog. Celebal's public-sector Databricks footprint includes energy regulators, grid operators and ISOs, and health departments.

Media & Entertainment and Digital Natives: Event-ingestion quality and SaaS governance

Client-event ingestion runs at massive scale, and SaaS data products must be both trustworthy and shareable. Contract Vault enforces data-product contracts, Agent Lens observes AI workloads for hallucination, PII, and cost, while Perch Hub and Contract Vault provide governance. Celebal's footprint spans game publishers, OTT streamers, audience-measurement platforms, and SaaS scale-ups, all built on the event-ingestion and Unity Catalog governance workloads Eagle Eye IQ hardens.

Manufacturing & Energy: Pipeline reliability and FinOps at scale

Hundreds of pipelines, idle or oversized clusters, and late issue detection drive both risk and cost. ObserveIQ and DQ Guardian deliver pipeline health and cost observability while Aquila remediates autonomously. Celebal runs vast Databricks and Unity Catalog migration estates across semiconductors, oil & gas, steel, and HVAC, and has delivered explicit FinOps cost-transparency work for an HVAC manufacturer.

Retail & CPG: Data quality across the customer-data backbone

Customer and identity data feeds both personalization and supply chain, so bad data corrupts both. DQ Guardian validates at ingestion and Contract Vault enforces producer-consumer agreements across domains. Celebal runs governed Databricks customer-data estates for grocery, fashion, and discount retailers.

It Was Never a Product Problem

The most important thing building Eagle Eye IQ taught us is that autonomous remediation at speed is not a product problem. It is an infrastructure problem.

The reason detection-only tools dominate the market is not that nobody thought to add remediation. It is that building a system like Aquila, one that coordinates 35+ agents in parallel, acts on live data, and closes incidents without human intervention, requires a transactional operational database with sub-10ms performance inside the same governed environment as the analytical platform. Before Lakebase, Databricks customers had to choose between keeping everything in the workspace and having the transactional performance that agent coordination demands. Lakebase removed that trade-off.

PostgreSQL compatibility turned out to matter more specifically than we anticipated. The FOR UPDATE SKIP LOCKED pattern for agent queue coordination, JSONB for flexible AI output schemas, and ACID transactions for contract breach recording; these are not generic database features. They are specific PostgreSQL capabilities Aquila depends on in production. Lakebase's full PostgreSQL 16 compatibility is not a convenience. It is a technical prerequisite.

The third thing we did not fully anticipate is what workspace-native deployment does for enterprise conversations. When the operational database lives inside the customer's Databricks workspace, governed by Unity Catalog, with no external endpoints, you are not asking a security team to accept a new perimeter. You are telling them Eagle Eye IQ lives inside the one they already manage.

That is a different conversation. Aquila closes the loop on data quality. Lakebase is the infrastructure that lets it do so at speed, at scale, and inside the perimeter.

To see how Eagle Eye IQ can bring autonomous remediation to your Databricks environment, reach out at enterprisesales@celebaltech.com