Skip to main content
Tangkhul AI
Beta preview · 0 words / 1 sentences / 0 voice approved

Build a Tangkhul Naga AI dataset, one contribution at a time.

A small, focused tool for collecting words, sentences, and voice from speakers of Tangkhul — versioned, reviewable, and exportable for Whisper, LLM SFT, or TTS training.

Three phases, one shared dataset

Start with words, build up to natural speech. Each phase feeds the next.

1. Words

Translate ~20,000 English prompts into Tangkhul. Capture meaning, part of speech, IPA, and at least one real-life usage example per word.

  • Multiple variations per prompt
  • Orthography validator
  • Versioned corrections

2. Sentences

Translate ~770 English sentences, including short mini-dialogues for register and pragmatics.

  • Register tags (formal / casual / elder)
  • Mini-dialogue support
  • Reviewer-approved before training

3. Voice

Record accepted Tangkhul sentences. Every clip is auto-checked for SNR and a Whisper round-trip WER before it joins the dataset.

  • ≥ 3 distinct speakers
  • SNR ≥ 15 dB, WER < 30%
  • Exportable as 16 kHz mono WAV

Built for the way real people work

Focus mode

One prompt at a time. Sticky progress at the top, big readable English, autofocused Tangkhul field. Existing variations appear as chips you can click to build on.

Versioned, never destructive

Every edit creates an audit-log row. Admins can diff and revert any time. Submissions wait in a review queue before they enter the training dataset.

Admin approval

Self-signup is open, but a new account is pending until an admin approves it. No one gets to write to the dataset uninvited.

Export-ready

When the dataset feels full, one click produces TSV + JSONL for Whisper, LLM SFT, and TTS — with an EXPORT_README.md describing the schema.

Status

Build is complete. Sign-up is open to native Tangkhul speakers — let's fill the dataset.

  1. L0Bootstrapdone
  2. L1Schema + prompt importdone
  3. L2Word form + lexicon browserdone
  4. L3Admin shell + dashboarddone
  5. L4Review queuedone
  6. L5Sentence phasedone
  7. L6Voice phasedone
  8. L7Export (TSV/JSONL/SFT)done
  9. L8Settings + audit polishdone

AI training starts when you decide the dataset is full. Every submission goes through admin review and is fully versioned.

Currently waiting for contributors — 49661 word prompts, 38550 sentence prompts open. Sign up →