1. Words
Translate ~20,000 English prompts into Tangkhul. Capture meaning, part of speech, IPA, and at least one real-life usage example per word.
- Multiple variations per prompt
- Orthography validator
- Versioned corrections
A small, focused tool for collecting words, sentences, and voice from speakers of Tangkhul — versioned, reviewable, and exportable for Whisper, LLM SFT, or TTS training.
Start with words, build up to natural speech. Each phase feeds the next.
Translate ~20,000 English prompts into Tangkhul. Capture meaning, part of speech, IPA, and at least one real-life usage example per word.
Translate ~770 English sentences, including short mini-dialogues for register and pragmatics.
Record accepted Tangkhul sentences. Every clip is auto-checked for SNR and a Whisper round-trip WER before it joins the dataset.
One prompt at a time. Sticky progress at the top, big readable English, autofocused Tangkhul field. Existing variations appear as chips you can click to build on.
Every edit creates an audit-log row. Admins can diff and revert any time. Submissions wait in a review queue before they enter the training dataset.
Self-signup is open, but a new account is pending until an admin approves it. No one gets to write to the dataset uninvited.
When the dataset feels full, one click produces TSV + JSONL for Whisper, LLM SFT, and TTS —
with an EXPORT_README.md describing the schema.
Build is complete. Sign-up is open to native Tangkhul speakers — let's fill the dataset.
AI training starts when you decide the dataset is full. Every submission goes through admin review and is fully versioned.
Currently waiting for contributors — 49661 word prompts, 38550 sentence prompts open. Sign up →