Data Vault · Hubs · Links · Satellites · Raw & Business Vault

Data Vault Modeling for Scalable and Auditable Data Warehouses

I design and implement Data Vault architectures that scale with growing source systems, remain fully auditable end to end, and support parallel loading at every stage. From the vault I derive queryable star schemas, bridging the flexibility of Data Vault with the intuitive clarity of dimensional models.

Positioning

Data Vault is a data warehouse modeling method that excels when many source systems need to be integrated, requirements change frequently, and complete traceability is mandatory. I have been applying Data Vault for many years in large-scale data warehouse projects - in the public sector, in the insurance industry, and in retail - combining the methodology deliberately with classical dimensional models for the analytical layer.

My approach is pragmatic. Data Vault is not an end in itself, and it is not the right choice for every project. Where the method is a good fit, however, it delivers enormous value: it makes a data warehouse extensible without breaking existing structures, it enables highly parallel loading, and it records the origin and load timestamp of every single record. These are precisely the properties that are often decisive in regulated environments.

Over the years I have encountered Data Vault in very different forms - from classical on-premises data warehouses on SQL Server to cloud-based platforms with Delta Lake. That breadth of experience has shown me that the method is not tied to any particular technology; it represents a pattern of thinking: separate the key from the relationship and from the descriptive attributes, persist every delivery in a revision-safe manner, and derive the analytical views from that foundation. This pattern remains stable while the tools underneath keep evolving - a key reason why well-built Data Vault solutions have such a long service life.

The methodology was originated by Dan Linstedt and has developed over the past two decades into an established standard for enterprise-wide data warehouses. Data Vault 2.0 added hash keys, clear conventions for loading, and a bridge to agile development practices. That is exactly the modern flavor I work with: hash keys for parallel loading, a strict separation of Raw Vault and Business Vault, and metadata-driven generation of structures and load processes. This keeps the data warehouse manageable even when the number of sources and tables grows substantially over time.

In my consulting work I am careful not to sell Data Vault as an ideology but as a tool with clear strengths and clear weaknesses. I explain which problems the method solves, what price it carries, and where its limits lie. That honesty matters to me, because a poorly applied Data Vault can harm a project just as much as an over-stretched star schema. The skill lies in deploying the method where it genuinely plays to its strengths.

Core idea: Data Vault strictly separates business keys (hubs), relationships (links), and descriptive, historized attributes (satellites). This separation is the key to flexibility, parallelism, and auditability.

Why Data Vault?

Classical dimensional models following Kimball or normalized models following Inmon are proven approaches, but they run into limits when a data warehouse grows over years, continuously absorbs new sources, and must simultaneously remain revision-safe. Every new source, every new relationship, and every new attribute often requires extensive changes to existing tables in a pure star schema - bringing corresponding testing and regression effort.

Data Vault addresses this problem through a strict separation of structure and content. New sources are connected via additional satellites, new relationships via additional links - without any existing structures needing to change. The model grows additively. This property, known in the Data Vault world as 'auditability and adaptability', is the primary reason I apply the method in long-lived enterprise data warehouses.

  • Extensibility: add new sources and attributes additively without breaking anything existing
  • Auditability: every record carries its origin (record source) and load timestamp
  • Parallelism: hubs, links, and satellites can be loaded independently and concurrently
  • Historization: changes are captured completely and in a revision-safe manner
  • Separation of concerns: raw data in the Raw Vault, business logic in the Business Vault

Honest context matters here: Data Vault produces more tables and more joins than a lean star schema. For a small, stable reporting environment with few sources, that would be overkill. The method shows its strength when complexity, change, and compliance requirements are present - and that is exactly where I recommend and implement it.

Another often decisive advantage is the ability to represent erroneous or delayed deliveries cleanly. Because the Raw Vault records every delivery with a load timestamp and origin, it is possible to reconstruct retrospectively which data were available at any given point in time. Corrections from sources are appended as new versions rather than overwriting the original - the original delivery remains visible. In audits and under regulatory disclosure requirements, this completeness is invaluable because every figure can be traced back to the original delivery with full transparency.

The often-cited tension between Inmon and Kimball also dissolves elegantly with Data Vault. Data Vault is neither one nor the other, but an independent integration layer that can serve both worlds: it integrates and historizes like an Inmon core, and it provides the foundation for any number of Kimball marts. This role as a neutral, long-lived integration layer is the rightful position of Data Vault in a modern data architecture.

The Building Blocks: Hub, Link, Satellite

Data Vault works with three central building blocks. Their clarity is the true strength of the method: each table type has exactly one responsibility, and any model - however complex - can be built from the combination of these three types.

Data Vault core model of hubs, links, and satellites

The Data Vault core model: hubs hold business keys, links connect hubs to express relationships, satellites carry the descriptive, historized attributes.

This strict division of roles is the real trick of the method. In classical models, tables often mix keys, relationships, and attributes together - which makes them hard to change. Data Vault consistently decomposes these three aspects into dedicated table types. As a result, a change to the attributes of a source affects only that source's satellite, a new relationship requires only a new link, and a new business object requires only a new hub. Existing structures remain untouched. It is exactly this locality of change that makes the model so robust against the inevitable evolution of source systems.

Hub - the Business Key

A hub represents a central business object - for example, customer, product, contract, or account. It contains exclusively the business key, a hash key derived from that business key as a surrogate, the load timestamp, and the record source. Deliberately, a hub carries no descriptive attributes: it is the stable, permanent list of all business keys of a given object type that have ever been seen.

Link - the Relationship

A link models a relationship between two or more hubs - for example, 'customer has contract' or 'order contains product'. The link also carries no descriptive attributes, only the hash keys of the participating hubs, its own link hash key, load timestamp, and record source. Relationships in Data Vault are always modeled as many-to-many, which makes the model robust against later changes in cardinality.

Satellite - the Descriptive Attributes

The satellite carries the actual descriptive and changeable attributes of a hub or link - for example, name, address, status, or amount. It is the place of historization: when an attribute changes, a new record with a new load timestamp is appended while the old record is retained. A hash diff makes it possible to detect efficiently whether anything has changed at all. A hub can have multiple satellites - for example, separated by source system or by rate of change.

Distributing attributes across multiple satellites is a deliberate design decision. I often separate by source system so that each source gets its own independently loadable satellite and changes in one source do not affect the others. Another useful split runs along the rate of change: slowly changing master data belong in a different satellite than frequently updating status information. This keeps the history lean, because not the entire attribute set is duplicated with every minor change.

In addition to these three core building blocks, Data Vault defines several useful specializations. Effectivity satellites record the time period during which a relationship was valid - for example, which contract a customer was assigned to at a given time. Multi-active satellites represent multi-valued attributes, such as multiple phone numbers for a single customer. Reference tables hold code lists and master data that do not need to be historized. I apply these extensions selectively, without overloading the model unnecessarily.

Hash Keys and Their Advantages

A central concept of modern Data Vault is the hash key. Instead of sequential surrogate keys that must be assigned centrally, a deterministic hash value is computed from the business key - for example, using SHA-256 via HASHBYTES in SQL Server. This approach offers several decisive advantages for loading.

  • Hash keys can be computed without first consulting a lookup table
  • Hubs, links, and satellites can therefore be loaded fully in parallel and independently
  • The same business key always yields the same hash key across all systems
  • Hash diffs over descriptive attributes detect changes efficiently

The trade-off is higher storage consumption and slightly more compute for hashing. In practice, however, the loading advantages clearly outweigh these costs, especially with large data volumes processed in parallel. Careful normalization of keys prior to hashing is important - consistent casing, trimming of whitespace, and a defined separator between composite keys - so that the same logical business key reliably produces the same hash.

When choosing a hashing algorithm I weigh collision safety against storage requirements. SHA2_256 provides a virtually collision-free 32-byte representation and is my standard in security-critical environments. Where storage and performance are especially critical and the collision risk is professionally acceptable, a shorter algorithm may occasionally be considered. I do not make this decision categorically, but based on data volume, compliance requirements, and the expected lifetime of the data warehouse.

Practical tip: Always compute the hash on a normalized, canonical representation of the business key. Otherwise minor differences such as a trailing space or inconsistent casing produce different hash keys for the same logical key - a data error that is very hard to track down.

Data Model and DDL

What does a Data Vault model look like in practice? The following example shows the DDL for a hub, a satellite, and a link using the customer, contract, and their relationship as the scenario. The structure is intentionally uniform - this uniformity is the foundation for later generation and automation.

The recurring metadata columns are noteworthy. Every table carries a load timestamp and a record source; every satellite additionally carries a hash diff and an optional LoadEndDate. These columns are not decoration - they are the backbone of auditability: for each record they answer where it came from, when it was loaded, and whether it represents the current state. Because these columns have the same name and the same semantics in every table, generic load procedures can be written that process every hub, every link, and every satellite according to the same pattern.

T-SQL · DDL for Hub, Satellite, and Link
-- HUB: holds exclusively the business key + metadata
CREATE TABLE rv.Hub_Customer (
    Customer_HK     binary(32)    NOT NULL,   -- hash key (SHA2_256)
    Customer_BK     varchar(50)   NOT NULL,   -- business key
    LoadDate        datetime2(3)  NOT NULL,   -- load timestamp
    RecordSource    varchar(100)  NOT NULL,   -- origin (audit)
    CONSTRAINT PK_Hub_Customer PRIMARY KEY (Customer_HK)
);

-- SATELLITE: descriptive, historized attributes for the hub
CREATE TABLE rv.Sat_Customer (
    Customer_HK     binary(32)    NOT NULL,
    LoadDate        datetime2(3)  NOT NULL,
    LoadEndDate     datetime2(3)  NULL,       -- open = current version
    HashDiff        binary(32)    NOT NULL,   -- change detection
    RecordSource    varchar(100)  NOT NULL,
    CustomerName    varchar(200)  NULL,
    City            varchar(100)  NULL,
    Segment         varchar(50)   NULL,
    CONSTRAINT PK_Sat_Customer PRIMARY KEY (Customer_HK, LoadDate)
);

-- LINK: relationship between two hubs (customer <-> contract)
CREATE TABLE rv.Lnk_Customer_Contract (
    Cust_Contract_HK binary(32)   NOT NULL,   -- hash over both BKs
    Customer_HK      binary(32)   NOT NULL,
    Contract_HK      binary(32)   NOT NULL,
    LoadDate         datetime2(3) NOT NULL,
    RecordSource     varchar(100) NOT NULL,
    CONSTRAINT PK_Lnk_Customer_Contract PRIMARY KEY (Cust_Contract_HK)
);

The strict uniformity is striking: every hub, every satellite, and every link follows the same blueprint. Precisely this uniformity makes it possible to generate the model and load procedures from metadata later.

The hash keys and the hash diff are computed during loading. The following example shows how hash values are formed from the normalized business key and the descriptive attributes.

T-SQL · Computing Hash Key and Hash Diff
-- Hash key from the normalized business key, hash diff from the attributes.
-- UPPER + TRIM ensure a canonical, stable representation.

SELECT
    HASHBYTES('SHA2_256',
              UPPER(LTRIM(RTRIM(s.CustomerId)))                  ) AS Customer_HK,
    HASHBYTES('SHA2_256',
              CONCAT_WS('|',
                        UPPER(LTRIM(RTRIM(ISNULL(s.CustomerName,'')))),
                        UPPER(LTRIM(RTRIM(ISNULL(s.City,'')))),
                        UPPER(LTRIM(RTRIM(ISNULL(s.Segment,'')))) )) AS HashDiff,
    s.CustomerId  AS Customer_BK,
    s.CustomerName, s.City, s.Segment,
    SYSUTCDATETIME() AS LoadDate,
    'CRM'            AS RecordSource
FROM stg.Customer AS s;

CONCAT_WS with a defined separator prevents attribute boundaries from shifting. ISNULL ensures that NULL values do not alter the hash in uncontrolled ways.

Parallel Loading of the Raw Vault

The real advantage of Data Vault becomes visible at load time. Because hash keys are computed without lookups, hubs, links, and satellites have no dependency on one another during loading. They can be loaded fully in parallel - an enormous advantage with large data volumes and tight load windows.

Parallel and idempotent loading of hubs, links, and satellites

Parallel loading: hubs, links, and satellites are loaded independently of each other. Each load step is idempotent and writes only new or changed records.

The load logic per building block is simple and always the same. A hub receives only keys it does not already know. A link receives only relationships it does not already know. A satellite receives a new record only when the hash diff has changed relative to the last known state. All three steps are idempotent: re-running a load with the same data changes nothing.

This simplicity has a deeper reason. Because no ordering among the building blocks must be maintained, the complicated dependency management that makes classical load processes so error-prone is eliminated. In a star schema, a dimension must be loaded before the associated fact table can look up its surrogate keys. In Data Vault this lookup dependency does not exist: each building block computes its own hash keys. This greatly simplifies orchestration and makes it possible to distribute load work across many parallel streams without running into sequencing conflicts.

T-SQL · Idempotent Loading of Hub and Satellite
-- HUB load: insert only new business keys.
INSERT INTO rv.Hub_Customer (Customer_HK, Customer_BK, LoadDate, RecordSource)
SELECT DISTINCT s.Customer_HK, s.Customer_BK, s.LoadDate, s.RecordSource
FROM   stg.Customer_Hashed AS s
WHERE  NOT EXISTS (SELECT 1 FROM rv.Hub_Customer AS h
                   WHERE h.Customer_HK = s.Customer_HK);

-- SATELLITE load: insert a new record only when the hash diff has changed.
INSERT INTO rv.Sat_Customer
       (Customer_HK, LoadDate, HashDiff, RecordSource, CustomerName, City, Segment)
SELECT s.Customer_HK, s.LoadDate, s.HashDiff, s.RecordSource,
       s.CustomerName, s.City, s.Segment
FROM   stg.Customer_Hashed AS s
WHERE  NOT EXISTS (
         SELECT 1
         FROM   rv.Sat_Customer AS sat
         WHERE  sat.Customer_HK = s.Customer_HK
           AND  sat.LoadEndDate IS NULL          -- current version
           AND  sat.HashDiff    = s.HashDiff );  -- unchanged -> do nothing

The pattern is nearly identical for hub, link, and satellite. This repeatability is the reason why Data Vault load procedures lend themselves so well to generation and automation.

Link loading follows the same principle. A link also receives only relationships it does not already know; its hash key is computed from the participating business keys. Because relationships in Data Vault are always modeled as many-to-many, there is no update and no delete on a link - only additive insertion of new relationships. If the temporal validity of a relationship needs to be captured, that is done through an associated effectivity satellite.

T-SQL · Idempotent Loading of a Link
-- LINK load: insert only new relationships between customer and contract.
INSERT INTO rv.Lnk_Customer_Contract
       (Cust_Contract_HK, Customer_HK, Contract_HK, LoadDate, RecordSource)
SELECT DISTINCT
       s.Cust_Contract_HK, s.Customer_HK, s.Contract_HK, s.LoadDate, s.RecordSource
FROM   stg.Contract_Hashed AS s
WHERE  NOT EXISTS (SELECT 1 FROM rv.Lnk_Customer_Contract AS l
                   WHERE l.Cust_Contract_HK = s.Cust_Contract_HK);

The link hash key is formed during staging from the normalized business keys of customer and contract. This produces exactly one stable, independently computable key for each combination.

A frequently underestimated advantage of this idempotency is resilience in the event of failure. If a load run aborts mid-processing, it can simply be restarted - already-loaded records will not be written twice, missing records will be added. There is no half-loaded, inconsistent state that first needs to be laboriously cleaned up. Especially in environments with tight load windows and high data volumes, this property is invaluable: it makes restarts trivial and eliminates nightly manual interventions.

Because loading is independent per building block, hundreds of load procedures can be processed in a small number of parallel streams. In projects with very large data volumes, precisely this parallelizability was the decisive factor in meeting load windows.

Raw Vault, Business Vault, and PIT Tables

In practice I divide the vault into two areas. The Raw Vault absorbs raw data exactly as it arrives from the sources - without business interpretation, but fully auditable. The Business Vault builds on top with derived structures in which business logic is applied: computed satellites, weighted relationships, or unified keys across multiple sources.

A typical example of Business Vault logic is the unification of business objects that are maintained differently across multiple source systems. The same customer may have a different identifier in the CRM than in the billing system. In the Raw Vault both identifiers are preserved unchanged; in the Business Vault a same-as-link or mapping logic merges them into a unified view. If the consolidation rule changes later, it can be recomputed from the unaltered Raw Vault - a safety net that simply does not exist in simpler architectures.

This separation matters because it keeps raw data and interpretation apart. When a business rule changes, only the Business Vault is affected - the Raw Vault as the revision-safe source of truth remains untouched. If it later turns out that a rule should have been defined differently, it can be recomputed from the Raw Vault without any information having been lost.

PIT and Bridge Tables

Because a hub can have multiple satellites, and satellites are updated at different frequencies, querying a consistent state at a specific point in time is complex. This is where Point-in-Time tables (PIT tables) and bridge tables help. A PIT table stores, for each business key and snapshot date, the appropriate satellite load timestamps, dramatically reducing the number of joins needed. Bridge tables bundle frequently required link paths. Both are pure performance helper structures and can be rebuilt from the vault at any time.

The separation of Raw and Business Vault also has organizational benefits. It enables a clean division of responsibility: one team can focus on reliable, source-close loading of the Raw Vault, while business departments together with developers shape the business logic in the Business Vault. Because the Raw Vault immutably holds the truth from the sources, business rules can be experimentally tested, discarded, and recomputed without any information ever being lost. This decoupling of raw data and interpretation substantially reduces the risk of consequential errors.

T-SQL · Building a Point-in-Time Table (PIT)
-- A PIT table holds, per business key and snapshot date, the matching
-- satellite load timestamp, reducing the number of joins later on.

INSERT INTO bv.Pit_Customer (Customer_HK, SnapshotDate, Sat_Customer_LoadDate)
SELECT  h.Customer_HK,
        cal.SnapshotDate,
        (SELECT MAX(s.LoadDate)
         FROM   rv.Sat_Customer AS s
         WHERE  s.Customer_HK = h.Customer_HK
           AND  s.LoadDate   <= cal.SnapshotDate)  AS Sat_Customer_LoadDate
FROM    rv.Hub_Customer AS h
CROSS JOIN ref.SnapshotCalendar AS cal
WHERE   cal.SnapshotDate >= '2020-01-01';

A query then no longer needs to search through all satellite versions; instead it joins directly on the load timestamps stored in the PIT table - which greatly accelerates point-in-time queries.

The additional effort for PIT and bridge tables is well invested. Without them, queries would need to dynamically determine the correct satellite window for every snapshot date, which becomes expensive with many satellites and large histories. Because these helper structures can be derived entirely from the vault, they can be rebuilt whenever needed - for example, after a structural extension or a correction in the Business Vault. They are pure accelerators and carry no truth of their own.

From the Vault to the Star Schema

A Data Vault is excellent for integrating and historizing data - but it is not the model that business users or BI tools prefer to work with. For analysis I derive dimensional models following Kimball from the vault: star schemas with fact tables and dimensions that are intuitively understandable and optimized for tools such as Power BI.

Layers from source through Raw and Business Vault to the star schema

The layers at a glance: sources, staging, Raw Vault, Business Vault, and the derived information mart layer with star schemas for analysis.

This approach combines the best of both worlds. The vault provides integration, historization, and auditability; the star schema provides clarity and performance for analysis. From a hub and its satellites a dimension is built - often as a Slowly Changing Dimension Type 2 (SCD2) to make the history stored in the vault visible. From links and their satellites, fact tables are derived.

T-SQL · Deriving an SCD2 Dimension from the Vault
-- Hub + satellite become a dimensional view with a validity period.
-- The history stored in the vault becomes SCD2 versions in the star schema.

CREATE OR ALTER VIEW im.DimCustomer AS
SELECT
    h.Customer_HK                              AS CustomerKey,
    h.Customer_BK                              AS CustomerBk,
    s.CustomerName, s.City, s.Segment,
    s.LoadDate                                 AS ValidFrom,
    LEAD(s.LoadDate) OVER (PARTITION BY h.Customer_HK
                           ORDER BY s.LoadDate) AS ValidTo,
    CASE WHEN s.LoadEndDate IS NULL THEN 1 ELSE 0 END AS IsCurrent
FROM rv.Hub_Customer  AS h
JOIN rv.Sat_Customer  AS s ON s.Customer_HK = h.Customer_HK;

The window function LEAD elegantly forms a valid-from/valid-to interval from successive load states - without duplicating the history already stored in the vault.

The great appeal of this separation lies in its flexibility on the analytical side. From the same vault, any number of business-specific marts can be derived - a mart for controlling, one for sales, one for a regulatory report. Each mart sees exactly the view it needs and can be evolved independently of the others. If a mart needs to be restructured, it is re-derived from the vault without touching the underlying raw data. This decoupling of integration and analysis is one of the greatest practical gains of the methodology.

In a data warehouse project in the public sector I built exactly this bridge: a Data Vault import layer for integration and historization, with a classical Kimball model built on top for business-oriented analysis.

Layered Architecture Overview

In the overall picture, Data Vault fits into a clearly layered architecture. Sources are loaded into a staging layer where hash keys are computed; the Raw Vault is then populated. The Business Vault implements business logic, and from both layers the information mart layer is derived with queryable star schemas.

  • Staging: raw data from sources, computation of hash keys and hash diffs
  • Raw Vault: revision-safe integration into hubs, links, and satellites
  • Business Vault: derived structures, business logic, unified keys
  • Information Marts: star schemas and views for BI tools
  • Cross-cutting: metadata, orchestration, logging, and testing across all layers

This layering is not rigid but serves as a guiding framework. Depending on the project, the Business Vault may be more extensive or more lean, some marts are materialized, others remain views. The decisive point is that each layer has a clear responsibility and that the revision-safe truth in the Raw Vault remains untouched.

Cross-cutting concerns run through all layers. A central orchestration controls in what order and with what parallelism the layers are loaded. Comprehensive logging with batch IDs makes every run traceable. Automated tests secure both the loading of the vault and the derivation of the marts. These cross-cutting functions are not optional extras - they determine the operability of the entire data warehouse. I plan them in from the start, not only once the first problems appear in production.

In the cloud this layered model maps directly onto modern platforms. The staging layer often corresponds to a bronze level in a data lake; Raw Vault and Business Vault live as Delta or Parquet structures; and the information marts are made available in a high-performance analytical store. The method remains the same - only the underlying technology changes. This independence from any specific platform is another reason why Data Vault proves itself in data warehouses designed for the long term: the model survives a technology migration.

Automation and Metadata

The greatest operational strength of Data Vault is its uniformity. Because every hub, every link, and every satellite follows the same blueprint, both the table structures and the load procedures can be generated from metadata. Instead of writing hundreds of procedures by hand, I maintain a metadata description of sources, keys, and attributes and generate DDL and load logic from it.

This metadata-driven approach offers several advantages. It drastically reduces implementation effort, it enforces consistency across all building blocks, and it turns connecting a new source into a configuration task rather than a programming project. Tools such as data modeling applications - for example PowerDesigner, which I have used in several data warehouse projects - support this approach, as do custom-built generators based on metadata tables.

Generation covers not only the tables but the entire load stack: DDL for hubs, links, and satellites, hash key computation, idempotent load procedures, and entries in the control tables that orchestrate the parallel run. In the ideal case, describing a new source in the metadata and running the generator is sufficient to obtain a fully integrated, tested connection. This leverage transforms tasks that would otherwise take days into a matter of hours.

A metadata-driven architecture also changes the way a team works. Connecting a new source becomes a well-defined task: describe the source, assign business keys and attributes, generate, test. Instead of each developer contributing their own style, the generator enforces a uniform, verified implementation. This reduces the onboarding time for new team members and makes the data warehouse less dependent on individual people - an important aspect for long-term operability.

The metadata itself becomes a valuable asset. It documents which sources are connected, which business keys are used, and how attributes are distributed across satellites. This description simultaneously serves as technical documentation and as the blueprint for the system. When a source changes, the metadata description is updated and the affected portion is regenerated - model, load logic, and documentation remain automatically consistent. Maintaining this consistency by hand in a large data warehouse is virtually impossible.

Rule of thumb: whoever maintains Data Vault by hand is squandering its greatest strength. Only through metadata and generation does the method become an efficiently operable, scaling data warehouse.

Engagement and Collaboration

A Data Vault project begins with business modeling: which business objects exist, which business keys identify them, which relationships hold between them? I clarify these questions together with the business departments, because the business keys are the foundation of the entire model. Only then do technical implementation and automation follow.

  • Analysis: clarify business objects, keys, and relationships with the business departments
  • Modeling: design hubs, links, and satellites; define the boundary between Raw and Business Vault
  • Implementation: generate and test DDL and load logic from metadata
  • Marts: derive queryable star schemas for BI tools
  • Operations: parallel loading, monitoring, testing, and documentation

I make a point of applying Data Vault not dogmatically but appropriately. Not every table needs to pass through the full vault; some reference data can be managed more simply through direct structures. This pragmatic weighing - where the vault is worthwhile, where it is not - is part of my consulting and saves the client unnecessary effort.

In practice I prefer to work iteratively. Rather than designing the entire model up front to the last detail, I start with the central business objects and their most important relationships, then extend the model step by step. Because Data Vault grows additively, this approach carries low risk: each extension adds to the existing model without endangering it. This way, a working data warehouse is available early on, gaining more sources and more business depth with each sprint - and business departments see initial results even during development.

Tests and documentation are part of the deliverable in a Data Vault project. Business tests secure the derivation of the marts; technical tests secure the loading of the vault. Together with the metadata, this produces documentation that enables the team to continue developing the data warehouse independently.

Typical Services in a Data Vault Project

Around Data Vault modeling I take on different tasks depending on the project phase - from business modeling through implementation to operations. I integrate into existing teams and tooling landscapes and work closely with architects, developers, and business departments to ensure that the result is not only technically sound but also viable from a business perspective and sustainable to operate over the long term.

  • Business modeling of hubs, links, and satellites
  • Defining the boundary between Raw Vault and Business Vault
  • Hash key strategy with HASHBYTES and canonical key normalization
  • Metadata-driven generation of DDL and load procedures
  • Parallel, idempotent loading of the vault
  • PIT and bridge tables for high-performance queries
  • Derivation of dimensional models (star schema, SCD2) for the BI layer
  • Migration of existing data warehouse structures into a Data Vault model
  • Testing, logging, monitoring, and documentation of the vault

Selected anonymized reference projects

Public-sector client

SQL Server · Data Vault · Kimball · tSQLt · Jenkins

Further development of a data warehouse with a Data Vault import layer for source integration and historization, plus a Kimball model built on top for business-oriented analysis. Business tests and regression tests using tSQLt, CI/CD with Jenkins.

Insurance / telecommunications

DWH redesign · Data Vault · PowerDesigner

Redesign of a data warehouse based on Data Vault, business modeling of hubs, links, and satellites, and metadata-driven model management with PowerDesigner.

Travel / booking platform

DWH · Data Vault · PowerDesigner

Build-out and further development of a data warehouse with Data Vault modeling, model management via PowerDesigner, and derivation of queryable structures for reporting.

Engineering / consulting

DWH · Data Vault · SSAS Tabular · Power BI

Data warehouse project with Data Vault modeling as the integration layer, Tabular models in SSAS built on top, and Power BI reports delivered to business departments.

Logistics / corporate group

Teradata · DWH integration · historization

Development of integration and historization logic for a large data warehouse with high data volumes, including transaction-controlled acquisition and restart logic.

Financial services provider

Data migration · COBOL/PL1 copybooks · historization

Migration and integration of mainframe data stocks into a historized target structure, including interpretation of source formats via COBOL and PL/1 copybooks and revision-safe storage.

Frequently Asked Questions about Data Vault Modeling

For which projects is Data Vault worthwhile?

For data warehouses with many source systems, frequent changes, and high requirements for traceability and auditability. For small, stable reporting environments with few sources, a lean star schema is often the better choice - I work through this trade-off together with you.

Does Data Vault rule out a Kimball star schema?

On the contrary. I combine both: Data Vault as the integration and historization layer, Kimball star schemas as the analytical layer for BI tools. Dimensions and fact tables are derived from the vault.

Why hash keys instead of sequential surrogate keys?

Because hash keys can be computed without a lookup, enabling hubs, links, and satellites to be loaded fully in parallel. This is a decisive performance advantage with large data volumes.

How is a Data Vault queried efficiently?

Through PIT tables, bridge tables, and derived information marts. End users typically do not work directly on the vault but on the query-optimized structures derived from it.

Can an existing data warehouse be migrated to Data Vault?

Yes, I have done this in several projects - usually incrementally, by first establishing a vault layer and then re-feeding the existing analytical structures from it. This step-by-step migration keeps risk low and delivers visible results early, while the old system continues to run in parallel.

In which languages can we work together?

In German, English, and Portuguese - fluently in each, including technical and subject-matter discussions.

Contact

Project enquiry

Need support with ETL, Data Vault, BI architecture, SQL Server or Azure?

Remote · Hybrid · Germany · EU · Brazil · Part-time · Full-time