Project BoLI

community

https://boli.unreal-tece.co.in/

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

RiteshK updated a Space 4 days ago

project-boli/README

RiteshK published a Space 4 days ago

project-boli/README

RiteshK published a dataset 9 days ago

project-boli/irula

View all activity

Organization Card

Community About org cards

Project BoLI

About the Project

Project BoLI is the flagship project of UnReaL-TecE LLP, which aims to build high-quality datasets for benchmarking and evaluating different kinds of AI tasks, including speech-to-text, machine translation, grammatical analysis and reasoning tasks and prompt-based evaluation of LLMs. While the project aims to build these datasets for every Indian language and variety, the primary focus is on over 1300 underserved languages and both first and second language varieties of major, scheduled languages. The project's uniqueness is not just limited to the kind of benchmarking tasks it supports (including proposing some novel tasks) and also the kind of languages and communities it supports but also in its contextualisation and implementation of a unique data governance model (not yet implemented anywhere across the globe), which mandates that all datasets released by the project and their derivatives (including the models) are co-owned by the community members and all contributors of the project, and any permission to use the dataset is not a transfer of ownership but a revocable license to use it. The conditions under which the license could be revoked are clearly mentioned as part of the BoLI License. The complete details of the project, languages and communities supported till now, the quantum of data available till now, its data governance model and other relevant documents are all publicly accessible at the project website.

Data Governance and Ethics in Project BoLI

Project BoLI represents our commitment to not only fair remuneration to the speakers of the language but also to co-ownership and equal IPR to all the contributors who have built the dataset. We believe this is the first step to move away from the extractive data collection and use practices and ensure fairness in our treatment of the community members. As such, we have listed all speakers and transcribers as Contributors to the dataset (we insist that they are co-owners of the dataset, even though the HuggingFace platform does not provide us an explicit way of stating that) and they are further recognised as Speakers and Annotators of the dataset. This dataset is only licensed to other researchers for use in their research projects. More details about licensing and commercial use conditions are given in the License and Commercial Use sections.

Project BoLI - Data Governance Policy

BoLI Ethics Principle & Pledge

Project BoLI - Digital Consent Form

Project BoLI - Field Recording of Oral Consent

Project BoLI - TnC

The BoLI Network

The BoLI Network represents the comprehensive network of collaborators and co-owners of the dataset produced as part of the Project BoLI. It consists of all organisations, individuals, communities and other stakeholders, who have made contributions to the project and have ensured in the development of a truly sovereign, community-driven and community-centred dataset for AI development.