Statistical theory and methodology for the combination of heterogeneous and distributed data
Develop new statistical methodologies to address data heterogeneity and measurement errors across diverse datasets, enhancing evidence-based advancements in science and policy.
Projectdetails
Introduction
Data is now collected at unprecedented scales across many industries, meaning that there is huge potential for evidence-based advances in science, technology, and public policy. However, to harness this potential we must navigate repositories that are often a far cry from the idealised datasets, carefully collected and curated under perfect conditions, that are usually imagined when new statistical methodology is introduced.
Challenges in Data Collection
Data are often gathered quickly and cheaply, patched together from multiple locations, with limited regard to enforcing experimental standards. We may have the large sample sizes we desire, but there will be:
- Missing values
- Misaligned datasets
- Contamination
Depending on the sector, there may be noise added purposefully to satisfy individuals' and regulatory bodies' privacy concerns.
Proposed Solutions
We propose to address such difficulties through the development of new statistical methodology and theoretical frameworks that explicitly incorporate various forms of data heterogeneity and measurement error. This will be divided into four main areas:
- Accounting for sampling bias when a complete dataset is complemented by additional incomplete datasets. This will be studied through the lens of semiparametric theory for functional estimation.
- Combining two or more datasets that record overlapping but distinct sets of variables, where few or no complete records of all variables are available. These file matching problems will be studied using new developments in statistical optimal transport.
- Examining the effect of the violation of missing data assumptions. Here we will introduce techniques from robust statistics to mitigate the error due to misspecifying assumptions about sampling bias.
- Securing individuals' private data through the intentional use of noisy measurement. Here we contribute to the growing field of differential privacy, specifically the user-level local variant, where distributed batches of observations are privatised simultaneously.
Financiële details & Tijdlijn
Financiële details
Subsidiebedrag | € 1.499.689 |
Totale projectbegroting | € 1.499.689 |
Tijdlijn
Startdatum | 1-10-2024 |
Einddatum | 30-9-2029 |
Subsidiejaar | 2024 |
Partners & Locaties
Projectpartners
- UNIVERSITY OF WARWICKpenvoerder
Land(en)
Vergelijkbare projecten binnen European Research Council
Project | Regeling | Bedrag | Jaar | Actie |
---|---|---|---|---|
A coherent approach to analysing heterogeneity in network dataThis project aims to develop innovative econometric methods for analyzing unobserved heterogeneity in social interactions, addressing identification, estimation, and computation challenges. | ERC Consolid... | € 966.000 | 2023 | Details |
The missing mathematical story of Bayesian uncertainty quantification for big dataThis project aims to enhance scalable Bayesian methods through theoretical insights, improving their accuracy and acceptance in real-world applications like medicine and cosmology. | ERC Starting... | € 1.492.750 | 2022 | Details |
Flexible Statistical InferenceDevelop a flexible statistical theory allowing post-hoc data collection and decision-making with error control, utilizing e-values for improved inference in small samples. | ERC Advanced... | € 2.499.461 | 2024 | Details |
Provable Scalability for high-dimensional Bayesian LearningThis project develops a mathematical theory for scalable Bayesian learning methods, integrating computational and statistical insights to enhance algorithm efficiency and applicability in high-dimensional models. | ERC Starting... | € 1.488.673 | 2023 | Details |
Towards an evidence-based model for big data policing: Evaluating the statistical-methodological, criminological and legal and ethical conditionsThis project aims to develop an interdisciplinary, evidence-based model for big data policing in Europe, addressing research gaps and enhancing crime prediction and resource allocation through randomized trials. | ERC Consolid... | € 1.972.500 | 2023 | Details |
A coherent approach to analysing heterogeneity in network data
This project aims to develop innovative econometric methods for analyzing unobserved heterogeneity in social interactions, addressing identification, estimation, and computation challenges.
The missing mathematical story of Bayesian uncertainty quantification for big data
This project aims to enhance scalable Bayesian methods through theoretical insights, improving their accuracy and acceptance in real-world applications like medicine and cosmology.
Flexible Statistical Inference
Develop a flexible statistical theory allowing post-hoc data collection and decision-making with error control, utilizing e-values for improved inference in small samples.
Provable Scalability for high-dimensional Bayesian Learning
This project develops a mathematical theory for scalable Bayesian learning methods, integrating computational and statistical insights to enhance algorithm efficiency and applicability in high-dimensional models.
Towards an evidence-based model for big data policing: Evaluating the statistical-methodological, criminological and legal and ethical conditions
This project aims to develop an interdisciplinary, evidence-based model for big data policing in Europe, addressing research gaps and enhancing crime prediction and resource allocation through randomized trials.