Statistical theory and methodology for the combination of heterogeneous and distributed data

Develop new statistical methodologies to address data heterogeneity and measurement errors across diverse datasets, enhancing evidence-based advancements in science and policy.

Subsidie
€ 1.499.689
2024

Projectdetails

Introduction

Data is now collected at unprecedented scales across many industries, meaning that there is huge potential for evidence-based advances in science, technology, and public policy. However, to harness this potential we must navigate repositories that are often a far cry from the idealised datasets, carefully collected and curated under perfect conditions, that are usually imagined when new statistical methodology is introduced.

Challenges in Data Collection

Data are often gathered quickly and cheaply, patched together from multiple locations, with limited regard to enforcing experimental standards. We may have the large sample sizes we desire, but there will be:

  • Missing values
  • Misaligned datasets
  • Contamination

Depending on the sector, there may be noise added purposefully to satisfy individuals' and regulatory bodies' privacy concerns.

Proposed Solutions

We propose to address such difficulties through the development of new statistical methodology and theoretical frameworks that explicitly incorporate various forms of data heterogeneity and measurement error. This will be divided into four main areas:

  1. Accounting for sampling bias when a complete dataset is complemented by additional incomplete datasets. This will be studied through the lens of semiparametric theory for functional estimation.
  2. Combining two or more datasets that record overlapping but distinct sets of variables, where few or no complete records of all variables are available. These file matching problems will be studied using new developments in statistical optimal transport.
  3. Examining the effect of the violation of missing data assumptions. Here we will introduce techniques from robust statistics to mitigate the error due to misspecifying assumptions about sampling bias.
  4. Securing individuals' private data through the intentional use of noisy measurement. Here we contribute to the growing field of differential privacy, specifically the user-level local variant, where distributed batches of observations are privatised simultaneously.

Financiële details & Tijdlijn

Financiële details

Subsidiebedrag€ 1.499.689
Totale projectbegroting€ 1.499.689

Tijdlijn

Startdatum1-10-2024
Einddatum30-9-2029
Subsidiejaar2024

Partners & Locaties

Projectpartners

  • UNIVERSITY OF WARWICKpenvoerder

Land(en)

United Kingdom

Vergelijkbare projecten binnen European Research Council

ERC Consolid...

A coherent approach to analysing heterogeneity in network data

This project aims to develop innovative econometric methods for analyzing unobserved heterogeneity in social interactions, addressing identification, estimation, and computation challenges.

€ 966.000
ERC Starting...

The missing mathematical story of Bayesian uncertainty quantification for big data

This project aims to enhance scalable Bayesian methods through theoretical insights, improving their accuracy and acceptance in real-world applications like medicine and cosmology.

€ 1.492.750
ERC Advanced...

Flexible Statistical Inference

Develop a flexible statistical theory allowing post-hoc data collection and decision-making with error control, utilizing e-values for improved inference in small samples.

€ 2.499.461
ERC Starting...

Provable Scalability for high-dimensional Bayesian Learning

This project develops a mathematical theory for scalable Bayesian learning methods, integrating computational and statistical insights to enhance algorithm efficiency and applicability in high-dimensional models.

€ 1.488.673
ERC Consolid...

Towards an evidence-based model for big data policing: Evaluating the statistical-methodological, criminological and legal and ethical conditions

This project aims to develop an interdisciplinary, evidence-based model for big data policing in Europe, addressing research gaps and enhancing crime prediction and resource allocation through randomized trials.

€ 1.972.500