Statistical theory and methodology for the combination of heterogeneous and distributed data

Develop new statistical methodologies to address data heterogeneity and measurement errors across diverse datasets, enhancing evidence-based advancements in science and policy.

Subsidie

€ 1.499.689

2024

Projectdetails

Introduction

Data is now collected at unprecedented scales across many industries, meaning that there is huge potential for evidence-based advances in science, technology, and public policy. However, to harness this potential we must navigate repositories that are often a far cry from the idealised datasets, carefully collected and curated under perfect conditions, that are usually imagined when new statistical methodology is introduced.

Challenges in Data Collection

Data are often gathered quickly and cheaply, patched together from multiple locations, with limited regard to enforcing experimental standards. We may have the large sample sizes we desire, but there will be:

Missing values
Misaligned datasets
Contamination

Depending on the sector, there may be noise added purposefully to satisfy individuals' and regulatory bodies' privacy concerns.

Proposed Solutions

We propose to address such difficulties through the development of new statistical methodology and theoretical frameworks that explicitly incorporate various forms of data heterogeneity and measurement error. This will be divided into four main areas:

Accounting for sampling bias when a complete dataset is complemented by additional incomplete datasets. This will be studied through the lens of semiparametric theory for functional estimation.
Combining two or more datasets that record overlapping but distinct sets of variables, where few or no complete records of all variables are available. These file matching problems will be studied using new developments in statistical optimal transport.
Examining the effect of the violation of missing data assumptions. Here we will introduce techniques from robust statistics to mitigate the error due to misspecifying assumptions about sampling bias.
Securing individuals' private data through the intentional use of noisy measurement. Here we contribute to the growing field of differential privacy, specifically the user-level local variant, where distributed batches of observations are privatised simultaneously.

Financiële details & Tijdlijn

Financiële details

Subsidiebedrag	€ 1.499.689
Totale projectbegroting	€ 1.499.689

Tijdlijn

Startdatum	1-10-2024
Einddatum	30-9-2029
Subsidiejaar	2024

Partners & Locaties

Projectpartners

UNIVERSITY OF WARWICKpenvoerder

Land(en)

United Kingdom

Vergelijkbare projecten binnen European Research Council

Project	Regeling	Bedrag	Jaar	Actie
A coherent approach to analysing heterogeneity in network data This project aims to develop innovative econometric methods for analyzing unobserved heterogeneity in social interactions, addressing identification, estimation, and computation challenges.	ERC Consolid...	€ 966.000	2023	Details
The missing mathematical story of Bayesian uncertainty quantification for big data This project aims to enhance scalable Bayesian methods through theoretical insights, improving their accuracy and acceptance in real-world applications like medicine and cosmology.	ERC Starting...	€ 1.492.750	2022	Details
Flexible Statistical Inference Develop a flexible statistical theory allowing post-hoc data collection and decision-making with error control, utilizing e-values for improved inference in small samples.	ERC Advanced...	€ 2.499.461	2024	Details
Provable Scalability for high-dimensional Bayesian Learning This project develops a mathematical theory for scalable Bayesian learning methods, integrating computational and statistical insights to enhance algorithm efficiency and applicability in high-dimensional models.	ERC Starting...	€ 1.488.673	2023	Details
Towards an evidence-based model for big data policing: Evaluating the statistical-methodological, criminological and legal and ethical conditions This project aims to develop an interdisciplinary, evidence-based model for big data policing in Europe, addressing research gaps and enhancing crime prediction and resource allocation through randomized trials.	ERC Consolid...	€ 1.972.500	2023	Details

ERC Consolid...

A coherent approach to analysing heterogeneity in network data

This project aims to develop innovative econometric methods for analyzing unobserved heterogeneity in social interactions, addressing identification, estimation, and computation challenges.

ERC Consolidator Grant

€ 966.000

2023

Details

ERC Starting...

The missing mathematical story of Bayesian uncertainty quantification for big data

This project aims to enhance scalable Bayesian methods through theoretical insights, improving their accuracy and acceptance in real-world applications like medicine and cosmology.

ERC Starting Grant

€ 1.492.750

2022

Details

ERC Advanced...

Flexible Statistical Inference

Develop a flexible statistical theory allowing post-hoc data collection and decision-making with error control, utilizing e-values for improved inference in small samples.

ERC Advanced Grant

€ 2.499.461

2024

Details

ERC Starting...

Provable Scalability for high-dimensional Bayesian Learning

This project develops a mathematical theory for scalable Bayesian learning methods, integrating computational and statistical insights to enhance algorithm efficiency and applicability in high-dimensional models.

ERC Starting Grant

€ 1.488.673

2023

Details

ERC Consolid...

Towards an evidence-based model for big data policing: Evaluating the statistical-methodological, criminological and legal and ethical conditions

This project aims to develop an interdisciplinary, evidence-based model for big data policing in Europe, addressing research gaps and enhancing crime prediction and resource allocation through randomized trials.

ERC Consolidator Grant

€ 1.972.500

2023

Details

Projectdetails

Introduction

Challenges in Data Collection

Missing values
Misaligned datasets
Contamination

Depending on the sector, there may be noise added purposefully to satisfy individuals' and regulatory bodies' privacy concerns.

Proposed Solutions

Accounting for sampling bias when a complete dataset is complemented by additional incomplete datasets. This will be studied through the lens of semiparametric theory for functional estimation.
Combining two or more datasets that record overlapping but distinct sets of variables, where few or no complete records of all variables are available. These file matching problems will be studied using new developments in statistical optimal transport.
Examining the effect of the violation of missing data assumptions. Here we will introduce techniques from robust statistics to mitigate the error due to misspecifying assumptions about sampling bias.
Securing individuals' private data through the intentional use of noisy measurement. Here we contribute to the growing field of differential privacy, specifically the user-level local variant, where distributed batches of observations are privatised simultaneously.

Vergelijkbare projecten binnen European Research Council

Project	Regeling	Bedrag	Jaar	Actie
A coherent approach to analysing heterogeneity in network data This project aims to develop innovative econometric methods for analyzing unobserved heterogeneity in social interactions, addressing identification, estimation, and computation challenges.	ERC Consolid...	€ 966.000	2023	Details
The missing mathematical story of Bayesian uncertainty quantification for big data This project aims to enhance scalable Bayesian methods through theoretical insights, improving their accuracy and acceptance in real-world applications like medicine and cosmology.	ERC Starting...	€ 1.492.750	2022	Details
Flexible Statistical Inference Develop a flexible statistical theory allowing post-hoc data collection and decision-making with error control, utilizing e-values for improved inference in small samples.	ERC Advanced...	€ 2.499.461	2024	Details
Provable Scalability for high-dimensional Bayesian Learning This project develops a mathematical theory for scalable Bayesian learning methods, integrating computational and statistical insights to enhance algorithm efficiency and applicability in high-dimensional models.	ERC Starting...	€ 1.488.673	2023	Details
Towards an evidence-based model for big data policing: Evaluating the statistical-methodological, criminological and legal and ethical conditions This project aims to develop an interdisciplinary, evidence-based model for big data policing in Europe, addressing research gaps and enhancing crime prediction and resource allocation through randomized trials.	ERC Consolid...	€ 1.972.500	2023	Details

ERC Consolid...

A coherent approach to analysing heterogeneity in network data

This project aims to develop innovative econometric methods for analyzing unobserved heterogeneity in social interactions, addressing identification, estimation, and computation challenges.

ERC Consolidator Grant

€ 966.000

2023

Details

ERC Starting...

The missing mathematical story of Bayesian uncertainty quantification for big data

This project aims to enhance scalable Bayesian methods through theoretical insights, improving their accuracy and acceptance in real-world applications like medicine and cosmology.

ERC Starting Grant

€ 1.492.750

2022

Details

ERC Advanced...

Flexible Statistical Inference

Develop a flexible statistical theory allowing post-hoc data collection and decision-making with error control, utilizing e-values for improved inference in small samples.

ERC Advanced Grant

€ 2.499.461

2024

Details

ERC Starting...

Provable Scalability for high-dimensional Bayesian Learning

ERC Starting Grant

€ 1.488.673

2023

Details

ERC Consolid...

Towards an evidence-based model for big data policing: Evaluating the statistical-methodological, criminological and legal and ethical conditions

ERC Consolidator Grant

€ 1.972.500

2023

Details