From Digits to Words: Practical Approaches to Number Translation
Overview
This guide explains practical methods for converting numeric digits into their written-word forms across languages and contexts. It covers rule-based, data-driven, and hybrid approaches, and highlights common challenges and best practices for accuracy, localization, and maintainability.
When to use digit-to-word conversion
- Text-to-speech systems
- Voice assistants and IVR scripts
- Financial documents, checks, and invoices
- Language learning apps and accessibility tools
- Natural-language generation (reports, summaries)
Approaches
1) Rule-based (deterministic) conversion
- Implement language-specific grammar rules mapping digits and place values to words.
- Handle units (ones, tens, hundreds, thousands), ordinal forms, and punctuation.
- Pros: predictable, explainable, fast, no training data needed.
- Cons: labor-intensive to support many languages and edge cases (compound numbers, irregulars).
2) Locale-aware libraries and CLDR data
- Use existing libraries or CLDR (Unicode Common Locale Data Repository) to obtain locale rules for number formatting and spelled-out numbers.
- Pros: leverages community-maintained data, reduces duplication.
- Cons: coverage varies; may need customization for domain-specific wording.
3) Statistical / ML approaches
- Train sequence-to-sequence models to map digits to words using parallel corpora (digit strings ↔ spelled-out forms).
- Useful for high-variability cases (noisy input, OCR errors).
- Pros: robust to input variation, can learn exceptions.
- Cons: needs data, less transparent, may make unpredictable errors.
4) Hybrid systems
- Combine rule-based normalization with ML for ambiguous or noisy segments.
- Example: use rules for standard numbers; model handles colloquial or OCR outputs.
- Pros: balances accuracy and flexibility.
Key challenges
- Language-specific irregularities (e.g., French “quatre-vingt-dix” for 90).
- Gender and agreement in languages that inflect numbers.
- Ordinals vs. cardinals, fractions, percentages, and currency.
- Large numbers and grouping conventions (short vs long scale).
- Contextual interpretation (dates, times, measurements, phone numbers).
- Handling leading zeros, decimals, and separators.
Best practices
- Build clear locale profiles: cardinal/ordinal rules, gender rules, scale (short/long), grouping separators.
- Normalize input first: remove extraneous punctuation, detect numeric formats (dates, times, currencies).
- Prioritize deterministic rules for common, well-defined patterns.
- Use ML selectively for noisy or domain-specific inputs and log outputs for manual review.
- Provide configurable style options (e.g., “and” inclusion, hyphenation, short/long scale).
- Include comprehensive tests and examples per locale, plus fallback rules.
Implementation checklist
- Define supported locales and styles.
- Gather CLDR data or language rule specs.
- Implement normalization pipeline (strip, detect formats).
- Implement core rule engine for cardinals/ordinals.
- Add ML component if needed and train on curated pairs.
- Create unit/integration tests with edge cases.
- Expose API with options for locale, style, and context hints.
Example (English)
- 0 → “zero”
- 5 → “five”
- 42 → “forty-two”
- 1999 → “one thousand nine hundred ninety-nine”
- 1,000,000 → “one million”
Further reading
- CLDR number spelling rules
- Papers on text normalization and written-to-spoken conversion
Leave a Reply