How HealthVerity ensures the privacy and quality of real-world data

Why Quality Real-World Data is Key to the Future of Generative AI in  Healthcare | veranahealth.com

When clients see real-world data (RWD) in HealthVerity Marketplace, it may appear seamless. But behind the scenes, a sophisticated framework of quality assurance is constantly at work, cleaning, validating, and protecting that data from the moment it enters our ecosystem.

 

To better understand what it takes to deliver clean, privacy-protected data at scale, we sat down with two members of the HealthVerity Data Quality Assurance team: Ike Osuagwu, Manager of Data Quality, and Dr. Ellen McCleskey, PhD, Senior Data QA Engineer. In this first installment of our two-part series, they walk us through the core responsibilities of their team and explain how the HealthVerity proprietary Theseus framework is transforming the speed and reliability of real-world data delivery.

Ensuring real-world data accuracy and privacy at every stage

At HealthVerity, data quality assurance isn’t a final step. It’s a continuous process that begins the moment a new dataset arrives and continues through normalization, enrichment, and privacy review. As Dr. McCleskey says, “We check the data at every point.” We check the source data coming in, and then through normalization, and the final product before it gets to [HealthVerity] Marketplace.”

Both manual review and automated controls are used in this work. Source files are scanned for format and schema compliance, and every transformation, such as deduplication, standardization, or enrichment, is monitored for unintended anomalies.

Before any dataset can be found in HealthVerity Marketplace, it is thoroughly checked to see if it is free of protected health information (PHI) and complies with HIPAA Expert Determination requirements. This privacy rigor is integrated into the quality control pipeline, not an afterthought. Check out our article on privacy-preserving record linkage (PPRL) for more information.

How Theseus improves QA and accelerates data delivery

The quality assurance process is powered by Theseus, HealthVerity’s custom-built ETL framework. Theseus is designed to streamline the onboarding of data suppliers while applying rigorous validation at every stage of the pipeline.
Built on a medallion architecture: Bronze, Silver, and Gold, Theseus enables the QA team to progressively check for completeness, standardization, and privacy compliance.

The team is currently working on a new Theseus version that makes real-time and more granular validation possible. “Instead of waiting for the entire dataset to load in,” explained Dr. “We can actually run a few tests on each row to make sure the basics are there, and then we can quarantine any [bad] rows while still allowing the rest to go through,” according to McCleskey. This more dynamic approach also translates into faster timelines for data availability. Ike continued, “You’re getting data at higher rates and hitting contract dates a lot better.”

 

Why clean real-world data must preserve real-life complexity

Clients may think of “clean” data as synonymous with uniformity. But in real-world data, the opposite is often true. True signal lives in the variation. Ike highlights, “When you put that label of ‘clean’ on [real-world data], you expect a very normal distribution. But it’s kind of a beautiful thing that you don’t get really clean data. You get that truth that goes with real life—like high row counts for one month because everyone decided to go to the doctor after COVID or trends like that.”
That kind of natural irregularity, what statisticians might call meaningful heterogeneity, is exactly what makes RWD so valuable for life sciences. The HealthVerity Quality Assurance (QA) procedure’s objective is to preserve that complexity while also formatting the data in a consistent format and eliminating irrelevant information to guarantee that the data are accurate, privacy-compliant, and full of clinical insight.

 

The importance of quality assurance and privacy in real-world data research

When data quality and privacy are engineered into the pipeline from day one, researchers and analysts can trust that what they’re seeing reflects real-world behaviors, not artifacts of poor processing. HealthVerity uses superior linkage algorithms to ensure our matching accuracy stays high while protecting PII. Find out how our matching metrics work in the matching accuracy metrics deep dive.
Whether you’re conducting HEOR studies, building predictive models, or tracking public health trends, the accuracy and stability of your underlying data can make or break your results.

At HealthVerity, we’re making that trust Verified—with transparent sourcing, rigorous QA, and industry-leading privacy controls built into every dataset delivered via HealthVerity Marketplace.