A prototype system for obtaining and managing training data for multilingual learning
The project aims to empower less-resourced language communities to create parallel corpora for machine translation, enhancing language preservation and cultural heritage through an open-source prototype.
Projectdetails
Introduction
It is difficult to build high-quality machine translation systems for less-resourced languages, such as the minority languages of Europe. State-of-the-art machine translation is trained on large parallel corpora, texts, and their translations. But such corpora are not available for less-resourced languages.
Project Overview
We will provide a system for the rapid and inexpensive creation of new parallel corpora. Our PoC project will both produce an open-source prototype utilizing findings from the PI's ERC StG and determine IPR and future funding.
Key Innovations
The key innovation of the prototype will be that it can be used by the less-resourced language community themselves. Current systems require extensive background in natural language processing. Allowing the community to create and curate parallel data has clear social benefits.
Social Impact
The creation of high-quality machine translation systems for less-resourced languages will allow for more content creation in these languages, playing a strong role in the preservation of these languages. Curated parallel data will also be useful in activities such as education and cultural heritage research.
Funding and Market Considerations
Government funding is available for digital language preservation for many of the 7000 languages spoken on Earth. Companies with online translation systems such as Google and DeepL/Linguee are not addressing this market, as the ROI is too low. It makes more sense to empower local communities to create such parallel data.
Evaluation and IPR Structure
We will carefully evaluate our prototype to ensure that it meets their needs. Along with the creation of the prototype, we will determine how best to structure the IPR to support future development.
Future Considerations
Consulting, which we have already carried out for the Sorbian community, and a certification scheme for users of our system are two possibilities we will consider, along with commercial machine translation and multilingual classification problems such as hate speech detection.
Financiële details & Tijdlijn
Financiële details
Subsidiebedrag | € 150.000 |
Totale projectbegroting | € 150.000 |
Tijdlijn
Startdatum | 1-10-2023 |
Einddatum | 30-9-2025 |
Subsidiejaar | 2023 |
Partners & Locaties
Projectpartners
- TECHNISCHE UNIVERSITAET MUENCHENpenvoerder
- LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN
Land(en)
Vergelijkbare projecten binnen European Research Council
Project | Regeling | Bedrag | Jaar | Actie |
---|---|---|---|---|
MANUNKIND: Determinants and Dynamics of Collaborative ExploitationThis project aims to develop a game theoretic framework to analyze the psychological and strategic dynamics of collaborative exploitation, informing policies to combat modern slavery. | ERC STG | € 1.497.749 | 2022 | Details |
Elucidating the phenotypic convergence of proliferation reduction under growth-induced pressureThe UnderPressure project aims to investigate how mechanical constraints from 3D crowding affect cell proliferation and signaling in various organisms, with potential applications in reducing cancer chemoresistance. | ERC STG | € 1.498.280 | 2022 | Details |
Uncovering the mechanisms of action of an antiviral bacteriumThis project aims to uncover the mechanisms behind Wolbachia's antiviral protection in insects and develop tools for studying symbiont gene function. | ERC STG | € 1.500.000 | 2023 | Details |
The Ethics of Loneliness and SociabilityThis project aims to develop a normative theory of loneliness by analyzing ethical responsibilities of individuals and societies to prevent and alleviate loneliness, establishing a new philosophical sub-field. | ERC STG | € 1.025.860 | 2023 | Details |
MANUNKIND: Determinants and Dynamics of Collaborative Exploitation
This project aims to develop a game theoretic framework to analyze the psychological and strategic dynamics of collaborative exploitation, informing policies to combat modern slavery.
Elucidating the phenotypic convergence of proliferation reduction under growth-induced pressure
The UnderPressure project aims to investigate how mechanical constraints from 3D crowding affect cell proliferation and signaling in various organisms, with potential applications in reducing cancer chemoresistance.
Uncovering the mechanisms of action of an antiviral bacterium
This project aims to uncover the mechanisms behind Wolbachia's antiviral protection in insects and develop tools for studying symbiont gene function.
The Ethics of Loneliness and Sociability
This project aims to develop a normative theory of loneliness by analyzing ethical responsibilities of individuals and societies to prevent and alleviate loneliness, establishing a new philosophical sub-field.
Vergelijkbare projecten uit andere regelingen
Project | Regeling | Bedrag | Jaar | Actie |
---|---|---|---|---|
Uncovering the creative process: from inception to reception of translated content using machine translationINCREC aims to analyze the creative processes of professional translators using neural machine translation to enhance translation quality and user experience in literary and audiovisual contexts. | ERC COG | € 1.993.643 | 2023 | Details |
DEep COgnition Learning for LAnguage GEnerationThis project aims to enhance NLP models by integrating machine learning, cognitive science, and structured memory to improve out-of-domain generalization and contextual understanding in language generation tasks. | ERC COG | € 1.999.595 | 2023 | Details |
Linguistic traces: low-frequency forms as evidence of language and population historyThis project aims to reconstruct early European languages by analyzing low-frequency linguistic variants in historical texts, integrating philology with deep learning to uncover cultural interactions. | ERC ADG | € 2.498.135 | 2025 | Details |
Evaluating and Programming Intelligent Chatbots for Any LanguageEPICAL aims to enhance intelligent chatbots by integrating low resource languages through innovative techniques in text generation, translation, and speech processing, promoting multilingual inclusivity. | ERC ADG | € 2.498.200 | 2025 | Details |
Uncovering the creative process: from inception to reception of translated content using machine translation
INCREC aims to analyze the creative processes of professional translators using neural machine translation to enhance translation quality and user experience in literary and audiovisual contexts.
DEep COgnition Learning for LAnguage GEneration
This project aims to enhance NLP models by integrating machine learning, cognitive science, and structured memory to improve out-of-domain generalization and contextual understanding in language generation tasks.
Linguistic traces: low-frequency forms as evidence of language and population history
This project aims to reconstruct early European languages by analyzing low-frequency linguistic variants in historical texts, integrating philology with deep learning to uncover cultural interactions.
Evaluating and Programming Intelligent Chatbots for Any Language
EPICAL aims to enhance intelligent chatbots by integrating low resource languages through innovative techniques in text generation, translation, and speech processing, promoting multilingual inclusivity.