A prototype system for obtaining and managing training data for multilingual learning

The project aims to empower less-resourced language communities to create parallel corpora for machine translation, enhancing language preservation and cultural heritage through an open-source prototype.

Subsidie
€ 150.000
2023

Projectdetails

Introduction

It is difficult to build high-quality machine translation systems for less-resourced languages, such as the minority languages of Europe. State-of-the-art machine translation is trained on large parallel corpora, texts, and their translations. But such corpora are not available for less-resourced languages.

Project Overview

We will provide a system for the rapid and inexpensive creation of new parallel corpora. Our PoC project will both produce an open-source prototype utilizing findings from the PI's ERC StG and determine IPR and future funding.

Key Innovations

The key innovation of the prototype will be that it can be used by the less-resourced language community themselves. Current systems require extensive background in natural language processing. Allowing the community to create and curate parallel data has clear social benefits.

Social Impact

The creation of high-quality machine translation systems for less-resourced languages will allow for more content creation in these languages, playing a strong role in the preservation of these languages. Curated parallel data will also be useful in activities such as education and cultural heritage research.

Funding and Market Considerations

Government funding is available for digital language preservation for many of the 7000 languages spoken on Earth. Companies with online translation systems such as Google and DeepL/Linguee are not addressing this market, as the ROI is too low. It makes more sense to empower local communities to create such parallel data.

Evaluation and IPR Structure

We will carefully evaluate our prototype to ensure that it meets their needs. Along with the creation of the prototype, we will determine how best to structure the IPR to support future development.

Future Considerations

Consulting, which we have already carried out for the Sorbian community, and a certification scheme for users of our system are two possibilities we will consider, along with commercial machine translation and multilingual classification problems such as hate speech detection.

Financiële details & Tijdlijn

Financiële details

Subsidiebedrag€ 150.000
Totale projectbegroting€ 150.000

Tijdlijn

Startdatum1-10-2023
Einddatum30-9-2025
Subsidiejaar2023

Partners & Locaties

Projectpartners

  • TECHNISCHE UNIVERSITAET MUENCHENpenvoerder
  • LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN

Land(en)

Germany

Vergelijkbare projecten binnen European Research Council

ERC STG

MANUNKIND: Determinants and Dynamics of Collaborative Exploitation

This project aims to develop a game theoretic framework to analyze the psychological and strategic dynamics of collaborative exploitation, informing policies to combat modern slavery.

€ 1.497.749
ERC STG

Elucidating the phenotypic convergence of proliferation reduction under growth-induced pressure

The UnderPressure project aims to investigate how mechanical constraints from 3D crowding affect cell proliferation and signaling in various organisms, with potential applications in reducing cancer chemoresistance.

€ 1.498.280
ERC STG

Uncovering the mechanisms of action of an antiviral bacterium

This project aims to uncover the mechanisms behind Wolbachia's antiviral protection in insects and develop tools for studying symbiont gene function.

€ 1.500.000
ERC STG

The Ethics of Loneliness and Sociability

This project aims to develop a normative theory of loneliness by analyzing ethical responsibilities of individuals and societies to prevent and alleviate loneliness, establishing a new philosophical sub-field.

€ 1.025.860

Vergelijkbare projecten uit andere regelingen

ERC COG

Uncovering the creative process: from inception to reception of translated content using machine translation

INCREC aims to analyze the creative processes of professional translators using neural machine translation to enhance translation quality and user experience in literary and audiovisual contexts.

€ 1.993.643
ERC COG

DEep COgnition Learning for LAnguage GEneration

This project aims to enhance NLP models by integrating machine learning, cognitive science, and structured memory to improve out-of-domain generalization and contextual understanding in language generation tasks.

€ 1.999.595
ERC ADG

Linguistic traces: low-frequency forms as evidence of language and population history

This project aims to reconstruct early European languages by analyzing low-frequency linguistic variants in historical texts, integrating philology with deep learning to uncover cultural interactions.

€ 2.498.135
ERC ADG

Evaluating and Programming Intelligent Chatbots for Any Language

EPICAL aims to enhance intelligent chatbots by integrating low resource languages through innovative techniques in text generation, translation, and speech processing, promoting multilingual inclusivity.

€ 2.498.200