AtomicAttributeGraph (AAG): Graph-Native CDP Framework for Field-Level Recency

Q: What is AtomicAttributeGraph (AAG)?

AtomicAttributeGraph (AAG) is a graph-native customer data modelling framework that represents every distinct attribute value as an independent graph node linked to entities through timestamped edges. Each edge carries the field's own event time, not the record's write time, eliminating record-level timestamp pollution in CDPs.

Q: How does AAG solve timestamp pollution in customer data platforms?

Traditional CDPs timestamp entire records. When a source system writes a full record, every field inherits the same write timestamp regardless of when each field last changed. AAG solves this by storing a separate timestamped edge for every field-value pair. Conflict resolution compares edge timestamps per field independently, so a newer record timestamp in Salesforce cannot overwrite a genuinely newer field value from Shopify.

Q: What infrastructure does AAG require?

AAG requires two Delta tables on Databricks with Unity Catalog: graph.nodes and graph.relationships. The ingestion layer uses Delta Live Tables (DLT) with APPLY CHANGES INTO for SCD Type 2 history. dbt transforms the Silver graph into Gold customer profiles. The full open-source implementation is available at github.com/pradhan-is/atomic-attribute-graph.

Q: How is AAG different from a traditional golden record approach?

A traditional golden record uses an attribute history table — one row per customer-field-source combination. AAG replaces this flat table with a labelled property graph: value nodes are deduplicated by content hash across all customers and sources, and resolution is a graph traversal rather than a window function over a wide table. This enables multi-hop entity traversal (customer → company → asset) with no schema change.

Q: Can AAG model B2B entities beyond individual customers?

Yes. Neither graph table has a customer_id column. Any node — customer, company, asset, product — can be a source or target of any edge. A customer linked to a company via WORKS_AT and that company linked to assets via OWNS is the same two-table query as resolving a customer's email address.

By Santosh Pradhan, MarTech Solutions Architect · Munich, Germany

Every Customer Data Platform has the same silent flaw. A customer updates their phone number in your Shopify storefront at 09:00. At 09:05, Salesforce writes a full contact record sync — with the old phone number, but a newer LastModifiedDate. The CDP resolves on the timestamp and overwrites the correct Shopify value with the stale Salesforce one. The customer's phone is now wrong everywhere that matters.

This is record-level timestamp pollution. It is not an edge case. It is the default behaviour of every system that timestamps rows rather than fields. The Golden Record Framework addresses it with a flat attribute history table. AtomicAttributeGraph (AAG) goes further — it models every distinct attribute value as an independent graph node, making field-level truth a structural property of the data model rather than a query-time computation.

The Core Problem, Stated Precisely

Traditional CDPs and even most modern data warehouses associate timestamps with records, not fields. When a record is written, every field in that record inherits the same write timestamp — regardless of when each individual field last actually changed.

Consider what happens during a Salesforce full-sync:

Contact record written at 2026-04-15T09:05:00
  Email:     santosh@email.com   (genuinely updated today)
  Phone:     +49 111 000 0000    (not changed since 2024, but timestamp says today)
  City:      Munich               (not changed since 2023, but timestamp says today)

A record-timestamped CDP sees this as: "Salesforce has the freshest version of email, phone, and city as of 09:05 today." It wins every conflict. The customer's actual phone — correctly updated in Shopify at 09:00 — gets overwritten.

The Databricks + Hightouch implementation guide mitigates this with an attribute history table and source-specific extraction logic. AAG replaces the row-based history table with a graph model that makes the same guarantee structurally.

The AAG Data Model

AAG uses exactly two Delta tables. graph.nodes stores every entity — customers, companies, assets, and discrete attribute values. graph.relationships stores every directed edge between any two nodes. No column is scoped to a specific entity type; the model is fully generic.

-- graph.nodes: any entity — customer, company, asset, or a discrete attribute value
CREATE TABLE IF NOT EXISTS graph.nodes (
  node_id     STRING    NOT NULL,  -- UUID for entity nodes; SHA-256(node_type:value) for value nodes
  node_type   STRING    NOT NULL,  -- "customer" | "company" | "asset" | "email" | "phone" | "city" | ...
  value       STRING,              -- NULL for entity nodes; the raw value for attribute-value nodes
  checksum    STRING,              -- SHA-256(value) for value nodes; used for deduplication
  created_at  TIMESTAMP NOT NULL,
  properties  MAP<STRING, STRING>  -- arbitrary metadata (e.g. company name, asset serial number)
)
USING DELTA
CLUSTER BY (node_type, checksum);

-- graph.relationships: a directed, timestamped edge between any two nodes
CREATE TABLE IF NOT EXISTS graph.relationships (
  source_node_id  STRING    NOT NULL,  -- origin of the edge (any node type)
  target_node_id  STRING    NOT NULL,  -- destination of the edge (any node type)
  rel_type        STRING    NOT NULL,  -- "HAS_EMAIL" | "HAS_PHONE" | "WORKS_AT" | "OWNS" | ...
  updated_at      TIMESTAMP NOT NULL,  -- when THIS edge last changed — field-level, not record-level
  confidence      FLOAT     NOT NULL DEFAULT 1.0,
  is_current      BOOLEAN   NOT NULL,
  source_system   STRING,
  source_event_id STRING,
  properties      MAP<STRING, STRING>  -- arbitrary edge metadata
)
USING DELTA
TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true')
CLUSTER BY (source_node_id, rel_type);

Two properties define the model's behaviour:

Value nodes are shared across any entity. If 50,000 customers and 200 companies all list "Munich" as their city, there is one node_type="city" node with a value of "Munich". The checksum is the deduplication key — writing the same value for the same type is always idempotent. This is not an optimisation. It is what makes cross-entity identity resolution possible: two customers sharing a phone number share a node, and the graph edge between them is visible.

Relationships carry the field's own timestamp. The updated_at on an edge is the event time for that specific field change — not the write time of the source record that carried it. A Salesforce full-sync does not update the timestamp of phone or city edges if those values have not changed.

Any Entity, Any Relationship

Because neither table has a customer_id column, the model represents any domain. A B2B customer graph, for example:

CustomerNode(C123) ──HAS_EMAIL──►  ValueNode("santosh@email.com")
                   ──HAS_PHONE──►  ValueNode("+49 987 654 3210")
                   ──WORKS_AT──►   CompanyNode("ZEISS")
                                        │
                                   ──OWNS──►  AssetNode("AEM License #4821")
                                   ──OWNS──►  AssetNode("Salesforce Org #EU-09")
                                        │
                                   ──HAS_ADDRESS──►  ValueNode("Oberkochen, DE")

Traversing "which assets does this customer's company own?" is a two-hop walk across graph.relationships — no schema change, no new table, no join to a separate asset store. The same two tables express the entire enterprise knowledge graph.

How a Customer Profile Resolves

For the flat Customer 360 use case — resolving the best known value per attribute for a customer — the resolution query starts from the customer node and walks one hop to attribute value nodes:

-- Resolved attributes for a single customer
SELECT
  src.node_id        AS customer_id,
  r.rel_type         AS attribute_type,
  tgt.value          AS attribute_value,
  r.source_system,
  r.updated_at       AS field_updated_at,
  r.confidence
FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY source_node_id, rel_type
      ORDER BY updated_at DESC, confidence DESC
    ) AS rn
  FROM graph.relationships
  WHERE is_current = TRUE
) r
JOIN graph.nodes src ON r.source_node_id = src.node_id AND src.node_type = 'customer'
JOIN graph.nodes tgt ON r.target_node_id = tgt.node_id
WHERE r.rn = 1;

The node_type = 'customer' filter on the source node scopes resolution to customer-originated edges. Removing that filter returns resolved attributes across all entity types — the same query, different scope.

Returning to the opening example — the timestamp conflict between Salesforce and Shopify — this is what the graph holds:

CustomerNode(C123)
  ──HAS_EMAIL @ 2026-04-15T09:05 (salesforce) ──► ValueNode("santosh@email.com")
  ──HAS_PHONE @ 2026-04-10T14:22 (salesforce) ──► ValueNode("+49 111 000 0000")
  ──HAS_PHONE @ 2026-04-15T09:00 (shopify)    ──► ValueNode("+49 987 654 3210")

Email resolves to the Salesforce node (09:05). Phone resolves to the Shopify node (09:00) — Shopify's field-level timestamp is newer than Salesforce's field-level timestamp for that specific edge, even though the Salesforce record timestamp is later.

Multi-Hop Traversal

The real power of the generic model shows in queries that span entity types. "Which customers work at companies that own an AEM license?" — a common account-based marketing query — is a two-hop traversal with no extra tables:

-- Two-hop: customer → company → asset
SELECT DISTINCT
  r1.source_node_id AS customer_id
FROM graph.relationships r1
JOIN graph.relationships r2
  ON r1.target_node_id = r2.source_node_id
JOIN graph.nodes asset
  ON r2.target_node_id = asset.node_id
WHERE r1.rel_type = 'WORKS_AT'
  AND r2.rel_type = 'OWNS'
  AND asset.node_type = 'asset'
  AND asset.properties['product'] = 'Adobe Experience Manager'
  AND r1.is_current = TRUE
  AND r2.is_current = TRUE;

Extend to three hops and the same pattern surfaces "customers whose company's AEM instance was deployed in the last 90 days" — without any schema change. The graph grows by adding nodes and edges, never by altering tables.

Architecture: Four Layers, Two Tables

BRONZE   Raw events — API ingest, EventHub, Synapse, Fivetran landing zone
           ↓  DLT Streaming (APPLY CHANGES INTO)
SILVER   graph.nodes + graph.relationships  (the entire graph lives here)
           ↓  GraphFrames / PuppyGraph for traversal and identity algorithms
GOLD     Materialised views scoped per use case (customer_profile, company_assets, ...)
           ↓  Serverless SQL endpoint
SERVING  Hightouch reverse ETL · BI tools · Chatbots · Model Serving REST APIs

The Silver layer is deliberately narrow: two tables regardless of how many entity types or source systems exist. Every source feeds the same ingestion contract — a normalised event with a source_node_id, target_node_id, rel_type, and updated_at. Gold views are projections over this graph scoped to specific consumption patterns.

Ingestion: DLT Streaming with APPLY CHANGES

The normalised ingestion contract feeds two parallel DLT pipelines. The node pipeline deduplicates value nodes by checksum; the relationship pipeline uses APPLY CHANGES INTO with SCD Type 2 to maintain full edge history.

import dlt
from pyspark.sql import functions as F

@dlt.table(name="staging_graph_events")
def staging_graph_events():
    """
    Normalised graph events from all sources.
    Schema: source_node_id, source_node_type, target_node_id, target_node_type,
            rel_type, value (for value nodes), updated_at, confidence,
            source_system, source_event_id, properties
    """
    return spark.readStream.table("bronze.graph_events_unified")

@dlt.table(name="graph_nodes")
def graph_nodes():
    """
    Deduplicated nodes. Entity nodes (customer, company, asset) are written once.
    Value nodes (email, phone, city, ...) are deduplicated by (node_type, checksum).
    """
    return (
        dlt.read_stream("staging_graph_events")
        .selectExpr(
            "target_node_id AS node_id",
            "target_node_type AS node_type",
            "value",
            "sha2(value, 256) AS checksum",
            "updated_at AS created_at",
            "properties"
        )
        .dropDuplicates(["node_type", "checksum"])
    )

# Relationship edges — SCD Type 2 preserves full history
dlt.apply_changes(
    target       = "graph.relationships",
    source       = "staging_graph_events",
    keys         = ["source_node_id", "target_node_id", "rel_type", "source_system"],
    sequence_by  = "updated_at",
    stored_as_scd_type = 2
)

APPLY CHANGES with SCD Type 2 closes old edges and inserts new ones automatically, using updated_at as the sequence column. Late-arriving events with an earlier timestamp are correctly ordered without manual MERGE logic.

Identity Resolution via Graph Algorithms

Shared value nodes are natural identity bridges. Two customer nodes that both hold a HAS_EMAIL edge to the same value node are candidates for the same person. GraphFrames makes this traversal explicit without a separate identity resolution service.

from graphframes import GraphFrame
from pyspark.sql import functions as F

# All nodes (entity + value) become graph vertices
vertices = spark.table("graph.nodes") \
    .select(F.col("node_id").alias("id"), "node_type", "value")

# All current edges become graph edges
edges = spark.table("graph.relationships") \
    .filter("is_current = TRUE") \
    .select(
        F.col("source_node_id").alias("src"),
        F.col("target_node_id").alias("dst"),
        "rel_type",
        "confidence"
    )

g = GraphFrame(vertices, edges)

# Connected components: entity nodes sharing high-confidence value nodes
# are identity candidates — no separate entity resolution service needed
components = g.connectedComponents()

# Filter to customer nodes only to get identity clusters
customer_clusters = components \
    .join(vertices.filter("node_type = 'customer'"), "id") \
    .select("id", "component")

Where TigerGraph's enterprise graph analytics requires a dedicated graph database, AAG achieves the same traversal on the lakehouse — no additional infrastructure, same Unity Catalog governance boundary.

PuppyGraph is the right integration point if your team needs Gremlin or openCypher query expressiveness — it presents a graph query interface directly over the two Delta tables without copying data.

GDPR and Field-Level Consent

Right-to-erasure and consent expiration operate on edges, not nodes. Invalidating a customer's relationships removes them from all resolution views without touching the shared value nodes that other entities still reference.

-- Right to erasure: invalidate all edges originating from this customer node
UPDATE graph.relationships
SET    is_current = FALSE, updated_at = CURRENT_TIMESTAMP()
WHERE  source_node_id = 'customer:C123';

-- Field-level consent expiration across all customer nodes
UPDATE graph.relationships
SET    is_current = FALSE
WHERE  rel_type    = 'HAS_EMAIL_CONSENT'
  AND  updated_at  < DATEADD(YEAR, -2, CURRENT_TIMESTAMP())
  AND  is_current  = TRUE;

-- Shared value nodes are untouched — a "Munich" city node used by
-- 10,000 customers is not affected by erasing one customer's edges

Gold Layer and Activation

The Gold layer is a set of scoped views over the graph — one per consumption pattern. The customer profile view pivots one-hop attribute edges into a flat row:

-- Gold view: flat customer profile (one-hop from customer nodes to attribute value nodes)
CREATE OR REPLACE VIEW graph.gold.customer_profile AS
SELECT
  src.node_id AS customer_id,
  MAX(CASE WHEN r.rel_type = 'HAS_EMAIL'         THEN tgt.value END) AS email,
  MAX(CASE WHEN r.rel_type = 'HAS_EMAIL'         THEN r.source_system END) AS email_source,
  MAX(CASE WHEN r.rel_type = 'HAS_EMAIL'         THEN r.updated_at END) AS email_updated_at,
  MAX(CASE WHEN r.rel_type = 'HAS_FIRST_NAME'    THEN tgt.value END) AS first_name,
  MAX(CASE WHEN r.rel_type = 'HAS_LAST_NAME'     THEN tgt.value END) AS last_name,
  MAX(CASE WHEN r.rel_type = 'HAS_PHONE'         THEN tgt.value END) AS phone,
  MAX(CASE WHEN r.rel_type = 'HAS_PHONE'         THEN r.source_system END) AS phone_source,
  MAX(CASE WHEN r.rel_type = 'HAS_CITY'          THEN tgt.value END) AS city,
  MAX(CASE WHEN r.rel_type = 'HAS_COUNTRY'       THEN tgt.value END) AS country,
  MAX(CASE WHEN r.rel_type = 'HAS_EMAIL_CONSENT' THEN tgt.value END) AS email_consent,
  MAX(CASE WHEN r.rel_type = 'HAS_EMAIL_CONSENT' THEN r.updated_at END) AS email_consent_updated_at,
  CURRENT_TIMESTAMP() AS profile_resolved_at
FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY source_node_id, rel_type
      ORDER BY updated_at DESC, confidence DESC
    ) AS rn
  FROM graph.relationships
  WHERE is_current = TRUE
) r
JOIN graph.nodes src ON r.source_node_id = src.node_id AND src.node_type = 'customer'
JOIN graph.nodes tgt ON r.target_node_id = tgt.node_id
WHERE r.rn = 1
GROUP BY src.node_id;

A second Gold view can scope to company assets without touching the customer profile view:

-- Gold view: company asset inventory (two-hop)
CREATE OR REPLACE VIEW graph.gold.company_assets AS
SELECT
  co.node_id                    AS company_id,
  co.properties['name']         AS company_name,
  a.node_id                     AS asset_id,
  a.properties['product']       AS product,
  a.properties['license_type']  AS license_type,
  r2.updated_at                 AS asset_linked_at
FROM graph.relationships r1
JOIN graph.nodes co ON r1.source_node_id = co.node_id AND co.node_type = 'company'
JOIN graph.relationships r2 ON r1.target_node_id = r2.source_node_id
JOIN graph.nodes a  ON r2.target_node_id = a.node_id  AND a.node_type = 'asset'
WHERE r1.rel_type = 'WORKS_AT'
  AND r2.rel_type = 'OWNS'
  AND r1.is_current = TRUE
  AND r2.is_current = TRUE;

Both Gold views are projections over the same two Silver tables. Hightouch, BI tools, and Model Serving REST endpoints consume the Gold views without awareness of the graph beneath them.

Performance at Scale

Operation	Latency (10M customers)	Notes
Single profile lookup (Gold view)	5ms	Clustered on source_node_id; Photon vectorised aggregation
10K customer bulk export	800ms	Serverless SQL endpoint against materialised Gold table
Graph identity resolution (connected components)	~2s	GraphFrames on Photon-enabled cluster
DLT incremental ingest (100K events)	<30s	Streaming micro-batch, APPLY CHANGES
Field-level erasure (GDPR)	<1s	UPDATE on source_node_id — no node table scan

Key optimisations: cluster graph.relationships on (source_node_id, rel_type); add a Bloom filter index on target_node_id for join-heavy traversal queries; run OPTIMIZE ZORDER BY (source_node_id) weekly on both tables; materialise the Gold customer profile as a Delta table (not just a view) and refresh it after each DLT pipeline run for sub-millisecond Hightouch query latency.

Why Not a Dedicated Graph Database?

TigerGraph, Neo4j, and Amazon Neptune all handle graph traversal well. The operational trade-off is a separate ingestion pipeline, a separate governance perimeter, and a separate operations team. For most CDP workloads — scheduled identity resolution, segmentation, reverse ETL activation — that complexity is not justified.

PuppyGraph eliminates the trade-off: it presents a Gremlin/openCypher interface directly over Delta tables, with no data movement. If your team needs multi-hop graph query expressiveness without a dedicated store, PuppyGraph over AAG's two Silver tables is the correct integration.

How AAG Extends the Golden Record Framework

The Golden Record Framework uses a flat attribute history table. AAG replaces that table with a generic graph model. The resolution guarantee is identical; the graph model adds three capabilities the flat model cannot provide:

Cross-entity relationships. A flat history table is scoped to one entity type. The graph natively expresses customer → company → asset hierarchies, shared value nodes across entity types, and any relationship that emerges as the domain grows — without schema changes.

Structural identity signals. Shared value nodes are identity bridges visible in the data model itself, not derived by a separate matching service. Two entity nodes sharing a phone value node are connected in the graph — connected components finds them automatically.

GraphRAG readiness. An AI agent traversing "what do we know about this customer, their employer, and their employer's software contracts?" walks the graph natively. A flat table must be joined, enriched, and denormalised at query time to answer the same question. The graph is the right foundation for LLM-driven customer intelligence.

Summary

AtomicAttributeGraph delivers what record-timestamped CDPs have always promised but never structurally guaranteed: a customer profile where every field carries its own truth, independent of the record that carried it. Two Delta tables. A fully generic node-relationship model where source and target can be any entity type. DLT streaming ingest. GraphFrames for identity resolution. Scoped Gold views that any existing activation tool can consume without modification.

The two-table model is not a simplification — it is the correct representation of a domain where customers relate to companies, companies relate to assets, assets relate to contracts, and every relationship has its own timestamp and confidence. Start with customer attribute resolution. The graph is already ready for everything else.

The full implementation — DLT pipeline, dbt models with versioning, Python ingestion library, and GitHub Actions CI — is open source at github.com/pradhan-is/atomic-attribute-graph. The interactive demo walks through all three scenarios — timestamp conflict, multi-hop B2B traversal, and GDPR erasure — directly in the browser.

Frequently Asked Questions

What is AtomicAttributeGraph (AAG)?

AtomicAttributeGraph (AAG) is a graph-native customer data modelling framework designed by Santosh Pradhan, a MarTech Solutions Architect based in Munich. It represents every distinct attribute value as an independent graph node linked to entities through timestamped edges — each carrying the field's own event time, not the record's write time.

How does AAG solve timestamp pollution in customer data platforms?

Traditional CDPs timestamp entire records. When a source system writes a full record, every field inherits the same write timestamp regardless of when each field last changed. AAG stores a separate timestamped edge for every field-value pair. Conflict resolution compares edge timestamps per field independently, so a newer record timestamp in Salesforce cannot overwrite a genuinely newer field value from Shopify.

What infrastructure does AAG require?

AAG requires two Delta tables on Databricks with Unity Catalog: graph.nodes and graph.relationships. The ingestion layer uses Delta Live Tables (DLT) with APPLY CHANGES INTO for SCD Type 2 history. dbt transforms the Silver graph into Gold customer profiles. The full open-source implementation is at github.com/pradhan-is/atomic-attribute-graph.

How is AAG different from a traditional golden record approach?

A traditional golden record uses an attribute history table — one row per customer-field-source combination. AAG replaces this flat table with a labelled property graph: value nodes are deduplicated by content hash across all customers and sources, and resolution is a graph traversal rather than a window function over a wide table. This also enables multi-hop entity traversal (customer → company → asset) with no schema change.

Can AAG model B2B entities beyond individual customers?

Yes. Neither graph table has a customer_id column. Any node — customer, company, asset, product — can be a source or target of any edge. A customer linked to a company via WORKS_AT and that company linked to assets via OWNS uses the same two tables and the same resolution logic as resolving a customer's email address.

AtomicAttributeGraph (AAG): A Graph-Native Customer Data Model for Field-Level Truth