Beta preview · 0 words / 1 sentences / 0 voice approved

Build a Tangkhul Naga AI dataset, one contribution at a time.

A small, focused tool for collecting words, sentences, and voice from speakers of Tangkhul — versioned, reviewable, and exportable for Whisper, LLM SFT, or TTS training.

Get a contributor account Log in

Three phases, one shared dataset

Start with words, build up to natural speech. Each phase feeds the next.

1. Words

Translate ~20,000 English prompts into Tangkhul. Capture meaning, part of speech, IPA, and at least one real-life usage example per word.

Multiple variations per prompt
Orthography validator
Versioned corrections

2. Sentences

Translate ~770 English sentences, including short mini-dialogues for register and pragmatics.

Register tags (formal / casual / elder)
Mini-dialogue support
Reviewer-approved before training

3. Voice

Record accepted Tangkhul sentences. Every clip is auto-checked for SNR and a Whisper round-trip WER before it joins the dataset.

≥ 3 distinct speakers
SNR ≥ 15 dB, WER < 30%
Exportable as 16 kHz mono WAV

Built for the way real people work

Focus mode

One prompt at a time. Sticky progress at the top, big readable English, autofocused Tangkhul field. Existing variations appear as chips you can click to build on.

Versioned, never destructive

Every edit creates an audit-log row. Admins can diff and revert any time. Submissions wait in a review queue before they enter the training dataset.

Admin approval

Self-signup is open, but a new account is pending until an admin approves it. No one gets to write to the dataset uninvited.

Export-ready

When the dataset feels full, one click produces TSV + JSONL for Whisper, LLM SFT, and TTS — with an EXPORT_README.md describing the schema.

Status

Build is complete. Sign-up is open to native Tangkhul speakers — let's fill the dataset.

L0Bootstrapdone
L1Schema + prompt importdone
L2Word form + lexicon browserdone
L3Admin shell + dashboarddone
L4Review queuedone
L5Sentence phasedone
L6Voice phasedone
L7Export (TSV/JSONL/SFT)done
L8Settings + audit polishdone

AI training starts when you decide the dataset is full. Every submission goes through admin review and is fully versioned.

Currently waiting for contributors — 49661 word prompts, 38550 sentence prompts open. Sign up →