← Blogs / Data Architecture

The Golden Record Framework: Field-Level Identity Resolution for Customer 360

Santosh Pradhan·April 14, 2026

By Santosh Pradhan, MarTech Solutions Architect · Munich, Germany

Most Customer 360 implementations contain a silent flaw. Data engineers build a merge pipeline that joins customer records from CRM, commerce, support, and web — and they resolve conflicts by picking the most recently updated record. It seems logical. It is wrong. A record updated two minutes ago may carry an email address that was last changed eighteen months ago, overwriting a more recent update from a different system. The problem is not data quality. It is resolution granularity. Record-level timestamps are not a proxy for field-level freshness.

The Golden Record Framework is a structured approach to solving this at the attribute level. It defines how to ingest, model, and resolve customer data so that each field in the customer profile reflects the most authoritative, most recent value from any source — independently of every other field. The framework is implementable on any modern data platform and maps cleanly onto Databricks Delta Lake, but the patterns apply equally to Snowflake, BigQuery, and on-premises warehouses.

Why Record-Level Timestamps Fail

Every source system publishes a record_updated_at timestamp. That timestamp tells you one thing: that something in the row changed at that point in time. It does not tell you which field changed, by how much, or whether the change was to a field you care about. In enterprise environments, a CRM record may be touched dozens of times a day — by a sync process updating a metadata field, by a sales rep logging a call, by an integration writing a status code. Each of those touches updates record_updated_at. None of them necessarily change the customer's email address, phone number, or consent status.

The consequence is predictable. Imagine a customer who updated their email address via a self-service portal at 09:00. At 09:15, a CRM integration sync writes back a field called last_campaign_sent, updating the record's timestamp in the process. Your merge pipeline runs at 10:00 and sees two versions of the customer record: the portal record with the correct email and a 09:00 timestamp, and the CRM record with the stale email and a 09:15 timestamp. The pipeline picks the CRM record. The customer's email update is silently discarded. This is not an edge case. In any organisation with more than three source systems, it is the default outcome of record-level resolution.

The correct mental model is this: each customer attribute is its own change stream. Email address has a history. Phone number has a history. Consent has a history. Preferred channel has a history. Those histories are independent. Any resolution logic that treats them as a single stream attached to a record timestamp is conflating things that should not be conflated.

The Golden Record Framework

The Golden Record Framework has four layers: an ingestion contract, a resolution data model, a field-level freshness algorithm, and a serving architecture. Each layer has a defined responsibility. Together they produce a customer profile where every attribute can be traced to its source, its timestamp, and the logic used to select it.

Layer 1 — Ingestion Contract

The ingestion contract defines what every source system must provide, or what your pipeline must reconstruct if the source cannot provide it. There are three source tiers, each requiring a different ingestion strategy.

Tier 1 — Attribute-event sources. These are sources that emit an event per field change: "email address changed from X to Y at time T by user U on system S." Modern event-driven CRMs, preference centres built on event sourcing, and customer portal backends can often provide this natively. When a source provides attribute-level events, capture them directly into a raw attribute events table with five mandatory columns: customer_id, attribute_name, attribute_value, source_event_time, and source_system. Add source_event_id for idempotency.

Tier 2 — Row-change sources (CDC/CDF). Most enterprise systems fall here. They publish a full row whenever anything changes, with a single updated_at timestamp. The correct approach is to use Change Data Feed (CDF) on Delta tables, or source-level CDC via Debezium or equivalent, to reconstruct row version history. Compare consecutive versions of the same row to infer which fields changed. The resulting field-level change events are equivalent to Tier 1 output. This reconstruction step must happen before the data reaches the serving layer — in the bronze-to-silver transition on a medallion architecture, or in the staging layer equivalent.

Tier 3 — Snapshot sources. Some systems provide only periodic full snapshots: a daily file export, a nightly database dump. Snapshot diffing works by comparing the current snapshot against the previous snapshot for each customer entity. Any field that differs between snapshots generates a field-level change event with the snapshot timestamp as the effective time. This is the least precise option — the effective time is the snapshot timestamp, not the actual change time — but it is far more accurate than using record-level timestamps for resolution, and it is the appropriate fallback when no better source is available.

Layer 2 — Resolution Data Model

The Golden Record Framework uses two tables per customer entity. This separation of concerns is non-negotiable. Mixing current state with history in a single table forces compromises in both query patterns and write logic.

The current profile table contains one row per customer. It holds the latest resolved value for every attribute, the source system that last wrote each attribute, and the timestamp of that source event. It is the serving table — the one your CDP activation layer, campaign engines, and real-time APIs query. It must be fast to read and trivial to join. Schema example:

CREATE TABLE customer_profile (
  customer_id       STRING        NOT NULL,
  email             STRING,
  email_source      STRING,
  email_updated_at  TIMESTAMP,
  phone             STRING,
  phone_source      STRING,
  phone_updated_at  TIMESTAMP,
  consent_email     BOOLEAN,
  consent_source    STRING,
  consent_updated_at TIMESTAMP,
  -- ... additional attributes
  profile_updated_at TIMESTAMP,
  PRIMARY KEY (customer_id)
);

The attribute history table contains one row per field change per customer. This is the source of truth for the resolution algorithm and the audit trail for compliance. It implements the Slowly Changing Dimension Type 2 (SCD-2) pattern at field granularity — not row granularity. Schema:

CREATE TABLE customer_attribute_history (
  customer_id      STRING     NOT NULL,
  attribute_name   STRING     NOT NULL,
  attribute_value  STRING,
  effective_from   TIMESTAMP  NOT NULL,
  effective_to     TIMESTAMP,
  is_current       BOOLEAN    NOT NULL,
  source_system    STRING     NOT NULL,
  source_event_id  STRING,
  change_reason    STRING,
  ingestion_time   TIMESTAMP  NOT NULL
);

The attribute_history table answers two fundamentally different business questions. "What is the customer's current email?" is answered from customer_profile. "What was the customer's email on the 15th of last month?" — relevant for GDPR audits, campaign replay, and dispute resolution — is answered from attribute_history with a point-in-time filter. These questions must not share the same query path.

Layer 3 — Field-Level Freshness Algorithm

The resolution algorithm runs after new attribute events land in the history table. Its job is to update the current profile table so that every attribute reflects the most authoritative, most recent value available. The algorithm has four steps.

Step 1 — Partition by customer and attribute. For each combination of customer_id and attribute_name, collect all known events from all source systems.

Step 2 — Order by field event time, descending. Use source_event_time as the primary sort key. This is the timestamp of the business event that caused the field to change — not the ingestion time, not the record update time. Use ingestion_time only as a tiebreaker when two sources report the same field with the same event time.

Step 3 — Apply source priority as a secondary tiebreaker. When timestamps are genuinely equal or unavailable, resolve by source trust rank. Consent management platforms beat CRM beats commerce beats data enrichment vendors. The source priority table should be a configuration artefact that business and engineering own jointly — not a hardcoded constant in a pipeline.

Step 4 — Select the latest non-null value. Null propagation is a common mistake. If a source sends a null for an attribute that already has a valid value from an earlier, more authoritative source, the null should not overwrite the existing value unless the null is explicitly meaningful (for example, a consent withdrawal where null carries business meaning). Distinguish between "this source does not know the value" (suppress the null) and "this source confirms the value has been removed" (apply the null).

The SQL implementation of this algorithm using a window function:

-- Resolve the current value for every attribute of every customer
WITH ranked AS (
  SELECT
    customer_id,
    attribute_name,
    attribute_value,
    source_system,
    source_event_time,
    ingestion_time,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id, attribute_name
      ORDER BY
        source_event_time  DESC NULLS LAST,
        ingestion_time     DESC,
        source_priority    ASC   -- lower number = higher trust
    ) AS rn
  FROM customer_attribute_history
  WHERE is_current = TRUE
    AND attribute_value IS NOT NULL
)
SELECT
  customer_id,
  attribute_name,
  attribute_value,
  source_system,
  source_event_time
FROM ranked
WHERE rn = 1;

This query returns exactly one resolved value per customer per attribute. Pivot the result to produce the flat row that populates customer_profile.

Layer 4 — Serving Architecture

The resolved customer profile must be available in at least two modes: batch (for campaign segmentation, BI, and full-profile exports) and real-time or near-real-time (for personalisation APIs, next-best-action engines, and journey triggers). The architecture that serves both modes without duplication is the medallion pattern with a clear gold layer contract.

Bronze ingests raw events and row changes from all source systems with no transformation. It is append-only. Nothing is deleted. This is your audit log.

Silver contains the customer_attribute_history table. All field-change extraction (CDC replay, snapshot diffing, attribute event normalisation) happens in the bronze-to-silver transition. Silver is where the resolution algorithm runs. Silver tables are the source of truth.

Gold contains customer_profile — the resolved, denormalised, serving-ready current state. Gold is what your activation tools connect to. Gold rows are rebuilt from silver on each resolution cycle, not updated in-place, which prevents the corruption that accumulates when multiple processes write to the same profile row.

Databricks Implementation

On Databricks, the framework maps directly onto Delta Lake with three available implementation patterns depending on latency requirements.

Delta Change Data Feed (CDF) is the right choice for Tier 2 sources already on Delta. Enable CDF on source tables with ALTER TABLE ... SET TBLPROPERTIES (delta.enableChangeDataFeed = true) and read the change feed with table_changes() to extract field-level diffs. This avoids full table scans for incremental processing and makes field-change detection efficient at scale.

DLT APPLY CHANGES is the managed path for pipelines built entirely in Delta Live Tables. APPLY CHANGES INTO handles SCD-2 history automatically when you set STORED AS SCD TYPE 2, generating the effective_from, effective_to, and __is_current columns without manual MERGE logic. This is the lowest-friction implementation for greenfield Customer 360 projects on Databricks.

MERGE-based SCD is the most portable pattern, working on any Delta table and any warehouse that supports MERGE syntax. Write a MERGE that inserts new attribute change events, closes the previous row's effective_to, and sets is_current = FALSE on the superseded version. Trigger this from an Airflow or Databricks Jobs DAG on each ingestion cycle.

-- MERGE pattern for attribute history (Databricks SQL / Delta)
MERGE INTO customer_attribute_history AS target
USING new_attribute_events AS source
  ON  target.customer_id    = source.customer_id
  AND target.attribute_name = source.attribute_name
  AND target.is_current     = TRUE
WHEN MATCHED AND source.attribute_value != target.attribute_value THEN
  UPDATE SET
    target.is_current   = FALSE,
    target.effective_to = source.source_event_time
WHEN NOT MATCHED THEN
  INSERT (
    customer_id, attribute_name, attribute_value,
    effective_from, effective_to, is_current,
    source_system, source_event_id, ingestion_time
  )
  VALUES (
    source.customer_id, source.attribute_name, source.attribute_value,
    source.source_event_time, NULL, TRUE,
    source.source_system, source.source_event_id, CURRENT_TIMESTAMP()
  );

Aligning the Serving Schema with MACH ODM

The MACH Alliance Open Data Model (ODM), published in April 2025, defines a canonical customer entity schema intended as a shared translation layer across composable stacks — CMS, CRM, CDP, commerce engine, loyalty platform. The customer entity specifies core fields including id, email, firstName, lastName, dateOfBirth, and a nested profile object, with an open extensions namespace for vendor-specific and platform-specific metadata. Field-level timestamps and source attribution are not part of the core spec — they belong in extensions. That is a deliberate design choice that has direct consequences for how the Golden Record Framework connects to a MACH stack.

The alignment problem has two dimensions. The first is naming: your internal attribute_name values in the history table — whatever you called them at ingestion time — may not match ODM canonical field names. The second is shape: your flat customer_profile table does not match the ODM's nested JSON structure, and the source attribution metadata that makes the golden record traceable has nowhere to go in the core ODM fields. Both are solvable without changing the history table or the resolution algorithm. The solution is a two-step mapping strategy applied before the gold layer is exposed.

Step 1 — Align Attribute Names at Ingestion, Not at Serving Time

The cleanest approach is to normalise source field names to ODM canonical names during the bronze-to-silver transition, before events reach the attribute history table. This means maintaining an attribute name registry — a configuration table that maps each source system's field names to the corresponding ODM canonical name. Every field-change event is looked up against this registry on ingestion and written to history with the ODM name as attribute_name.

CREATE TABLE attribute_name_registry (
  source_system      STRING NOT NULL,
  source_field_name  STRING NOT NULL,
  odm_field_name     STRING NOT NULL,  -- e.g. 'email', 'firstName', 'dateOfBirth'
  odm_version        STRING NOT NULL,  -- e.g. 'v1.0'
  PRIMARY KEY (source_system, source_field_name, odm_version)
);

A registry row for Salesforce CRM might map Emailemail, FirstNamefirstName, BirthdatedateOfBirth. Mapping at ingestion means the history table is already ODM-aligned. No downstream transformation is needed to know which history rows belong to which ODM field. When a new ODM version ships and renames or restructures a field, you add a new registry row for the new version without touching existing history.

Fields that have no ODM equivalent — internal operational fields, source system metadata, derived scores — are stored with a namespaced name under the extensions prefix: extensions.salesforce.leadScore, extensions.loyalty.tierName. This keeps the history table clean and makes the ODM boundary explicit.

Step 2 — Expose a Versioned ODM Semantic View

The customer_profile table remains a flat, column-per-attribute structure — optimised for the resolution algorithm and for BI tooling. MACH-consuming systems expect the ODM shape: nested JSON, camelCase field names, and source metadata in extensions. Bridge this with a versioned semantic view that maps the internal schema to the ODM contract without touching the underlying table.

-- ODM v1.0 semantic view over customer_profile
CREATE OR REPLACE VIEW customer_profile_odm_v1 AS
SELECT
  customer_id                         AS id,
  STRUCT(
    email                             AS email,
    email_source                      AS source,
    email_updated_at                  AS updatedAt
  )                                   AS emailAddress,
  STRUCT(
    first_name                        AS firstName,
    last_name                         AS lastName,
    date_of_birth                     AS dateOfBirth
  )                                   AS profile,
  STRUCT(
    -- Source attribution for all fields, packed into extensions
    MAP(
      'email.source',         email_source,
      'email.updatedAt',      CAST(email_updated_at AS STRING),
      'phone.source',         phone_source,
      'phone.updatedAt',      CAST(phone_updated_at AS STRING),
      'consent.source',       consent_source,
      'consent.updatedAt',    CAST(consent_updated_at AS STRING)
    )
  )                                   AS extensions,
  profile_updated_at                  AS updatedAt
FROM customer_profile;

The view is versioned by name — customer_profile_odm_v1, customer_profile_odm_v2. MACH-compliant APIs and tools point to a specific view version. When the ODM specification updates, a new view is created and the consuming API version is updated on its own release cycle. Existing consumers keep reading the old view until they migrate. The customer_profile table itself is never modified to accommodate ODM versioning — it has no awareness of the ODM at all.

Source Attribution in extensions

The ODM extensions namespace is the right home for the golden record's traceability metadata. Each field's source and updatedAt values should be packed into extensions under a consistent key pattern: {odmFieldName}.source and {odmFieldName}.updatedAt. This makes the provenance of every resolved value readable to any MACH-compliant consumer that knows to look in extensions, without polluting the core ODM fields.

The practical consequence is that a downstream personalisation engine or API gateway reading the ODM view gets both the resolved value (emailAddress.email) and the evidence for that value (extensions["email.source"], extensions["email.updatedAt"]) in a single read. No secondary lookup against the history table is needed for the common case. The history table remains available for point-in-time queries and compliance audits.

Schema Evolution Without Breaking the History Table

The most important architectural property of this approach is that the history table is insulated from ODM version changes. ODM v1 to v2 transitions — field renames, structural changes, new required fields — are absorbed entirely in the attribute name registry (new mapping rows) and the semantic view (new view version). The resolution algorithm, the MERGE logic, and the history table schema are unchanged. This is the correct separation of concerns: the history table owns the facts; the view layer owns the presentation contract.

When a new ODM version introduces a field that was never captured by any source system — for example, a new preferredLanguage field added to the ODM — the field simply resolves to null in the semantic view until a source system begins emitting it. No pipeline failure, no schema migration. The attribute name registry is updated to map the new field, and the next ingestion cycle from any source that carries it will populate it correctly.

Common Mistakes and How to Avoid Them

Using ingestion time as a proxy for event time. Ingestion time is when your pipeline processed the event, not when the business change occurred. A CDC batch that runs hourly will assign the same ingestion timestamp to changes that happened at any point during that hour. Use source_event_time from the source payload wherever available. Use ingestion time only as a last-resort tiebreaker.

Overwriting history on re-ingestion. When a source re-sends historical records — for backfills, corrections, or system migrations — pipelines that upsert by customer ID will silently overwrite attribute history. Design ingestion to be idempotent via source_event_id: if the event has already been written to history, skip it. Never overwrite a history row.

Applying null values indiscriminately. A source that does not capture a field should not emit a null for it. A source that intentionally clears a field — consent withdrawal, address removal — should emit a null with a change_reason. Treat these as different cases in your resolution logic. A pipeline that cannot distinguish them will randomly blank out customer attributes during re-ingestion.

Building the golden record in the serving layer. Some teams skip the attribute history table and run resolution logic directly against source tables at query time. This works at prototype scale and fails at production scale: the resolution query becomes expensive, hard to test, and impossible to audit. The attribute history table is not optional. It is where correctness lives.

A Decision Guide for Source Integration

Not every source requires the same integration approach. Use this guide to select the right pattern:

Does the source emit an event per field change with a field-level timestamp? Yes → Tier 1. Ingest directly into the attribute events table.

Does the source emit full row changes via CDC (Debezium, Fivetran, native CDC)? Yes → Tier 2. Use CDF or CDC replay to extract field diffs in the bronze-to-silver transition.

Does the source provide only periodic snapshots (files, exports, dumps)? Yes → Tier 3. Implement snapshot diffing. Accept that effective timestamps will be approximate.

Does the source provide no history at all — only a current-state API? → Implement periodic polling and treat each poll as a snapshot. The history table accumulates over time and gradually improves freshness as the polling cadence increases.

If This Is the Right Approach, Why Don't Adobe and Salesforce Build It This Way?

It is a fair question. Adobe Experience Platform and Salesforce Data Cloud are the two most heavily funded customer data platforms on the market. Both have large engineering teams and direct access to the problem. If field-level resolution is the correct approach, why do both platforms default to record-level merge, and why do architects still need to build this pattern themselves?

The answer has four parts: acquisition history, performance economics, market timing, and incentive structure. None of them is a permanent condition, but together they explain why the right pattern and the default vendor behaviour have diverged.

Acquisition-Driven Architecture

Neither Adobe nor Salesforce built their customer data capabilities from scratch. Adobe's data layer is a stitched-together stack: Omniture became Adobe Analytics, Demdex became Audience Manager, Marketo was acquired in 2018, and the Experience Platform was assembled on top of all of it. Salesforce Data Cloud runs on top of Sales Cloud, Marketing Cloud (acquired as ExactTarget), Service Cloud, and Commerce Cloud (acquired as Demandware). Each of those products was designed independently, with its own data model and its own notion of what a customer record means.

When you acquire five products with five different customer schemas and need to ship a "unified profile" that works for existing customers of all five, you reach for the fastest available join key — the record. A field-level history table requires agreement on a canonical attribute vocabulary across all acquired systems before you can build it. That is a multi-year data modelling exercise. Record-level merge ships in quarters. The architectural debt was incurred deliberately in exchange for time to market, and it accumulated with every subsequent acquisition.

Real-Time Serving Creates Pressure to Denormalise

A field-level history table is correct but expensive. For a CDP handling 200 million customer profiles with 80 attributes each, changing at a realistic frequency across 10 source systems, the attribute history table accumulates billions of rows. Writes are amplified: every field change in any source produces a new history row and triggers a resolution query. Reads from the history table require a window function across potentially thousands of rows per customer per attribute before producing the flat profile that an activation API needs to return in under 50 milliseconds.

Both Adobe and Salesforce optimise aggressively for that 50-millisecond profile read, because that is what real-time personalisation requires. They materialise a flat, denormalised profile record as the primary serving artefact and update it in-place on each merge event. That flat record is fast to read. The cost is the loss of field-level provenance. The vendors made a deliberate trade: serving speed over resolution correctness. For the majority of their customers at the time those architectural decisions were made, the trade was reasonable.

Record-Level Merge Was Sufficient for the Initial Use Case

The first generation of CDP use cases — audience segmentation, campaign suppression, basic personalisation — did not require field-level freshness. If you are building an audience of "customers who have purchased in the last 30 days," a record-level merge that is occasionally wrong about which email address is most recent causes negligible harm. The segment is still approximately correct. The campaign still reaches most of the right people.

The use cases where field-level correctness becomes critical are more demanding: real-time consent enforcement, regulatory-grade data subject access requests, AI personalisation that depends on precise attribute freshness, and cross-channel journey orchestration where a stale channel preference causes a customer to receive a communication they have explicitly opted out of. These use cases are mainstream now. They were edge cases when Adobe and Salesforce made their core architectural choices.

Complexity Is Commercially Useful

Enterprise software vendors generate significant revenue from professional services engagements that exist specifically because their platforms do not solve the hardest data problems out of the box. The gap between what AEP's merge policies can do and what a sophisticated Customer 360 actually requires is, in practice, filled by a system integrator charging day rates. Both Adobe and Salesforce have large, well-compensated partner ecosystems whose existence depends on that gap remaining. A platform that solved field-level identity resolution cleanly and automatically would reduce the services revenue attached to every enterprise implementation. That is not a conspiracy — it is an ordinary consequence of how enterprise software businesses work.

Both Vendors Are Moving in This Direction

To be precise: neither platform is standing still. Adobe Experience Platform's merge policies now support dataset priority and "last fragment wins" at the field level within a defined source hierarchy, which is a step toward field-level attribution. Salesforce Data Cloud's harmonisation rules allow per-field source priority configuration, which approximates a trust-ranked field resolution. Both are iterating toward the model described in this framework, constrained by the need to maintain backward compatibility with millions of existing customer configurations.

The practical implication is that if you are building on AEP or Data Cloud today, you can approximate the Golden Record Framework by configuring field-level merge policies deliberately, exporting the resolved profile to a Delta or Snowflake layer where you maintain the history table yourself, and using the platform as an activation surface rather than as the source of resolution truth. The pattern is implementable on both platforms — it just requires you to know what you are building toward and why.

What "Golden Record" Actually Means

The term golden record is used loosely — often to mean "the merged profile" without reference to how conflicts were resolved. That is not a golden record. A golden record is a profile in which every attribute is the most authoritative and most recent value available from any source, where "most recent" is measured at field level, not record level, and "most authoritative" is defined by an explicit, versioned source priority configuration.

A golden record has three properties a merged profile does not. It is traceable: you can identify the source system, the source event time, and the resolution logic for every field value. It is auditable: the attribute history table contains the complete change log, satisfying GDPR right-of-access requests without requiring source system queries. And it is correctable: when a source sends incorrect data, the correction flows through the same history table and resolution algorithm, and the corrected value surfaces in the profile on the next resolution cycle without requiring manual intervention.

These properties matter not just for data quality. They matter for trust. When a personalisation engine surfaces the wrong product recommendation, or a campaign sends to an opted-out customer, the audit trail that traces that decision back to a specific field value from a specific source at a specific time is what turns a data incident into a resolvable root cause. Without it, investigations stall and fixes guess.

Build the history table first. The golden record follows from it naturally.

Frequently Asked Questions

What is a golden record in a Customer Data Platform?

A golden record is a single, authoritative customer profile assembled from multiple source systems. It resolves conflicts by determining the most trustworthy value for each attribute independently. A true golden record is traceable — you can identify the source system, event time, and resolution logic for every field value.

Why do record-level timestamps fail for golden record resolution?

Record-level timestamps tell you when a row was last written, not when each field changed. A CRM sync that updates a metadata field stamps the entire record as fresh — including stale email and phone values. When your merge pipeline selects the most recently updated record, it overwrites genuinely newer field values from other systems.

What is field-level resolution and how does it work?

Field-level resolution tracks a separate update timestamp for every individual attribute. When two sources disagree on a field value, the comparison uses the field's own event time, not the record's write time. Tier 1 sources provide native field-level timestamps. Tier 2 sources require snapshot diffing at ingestion to derive per-field change events.

What is the Golden Record Framework?

The Golden Record Framework, described by Santosh Pradhan (MarTech Solutions Architect, Munich), is a structured methodology for field-level Customer 360 identity resolution. It defines ingestion tiers, an attribute history table, a conflict resolution algorithm (timestamp → confidence → source priority), and a Gold activation layer. The implementation guide covers Databricks Delta Lake with Hightouch.

Santosh Pradhan

Santosh Pradhan

MarTech Solutions Architect · Munich