High-accuracy language detection for chat, SMS, and conversational text. FastLangML combines multiple detection backends (FastText, Lingua, langdetect, pyCLD3) into a powerful ensemble that outperforms any single detector.
Key Features:
- 170+ languages supported via multi-backend ensemble
- Context-aware detection – tracks conversation history to resolve ambiguous short messages like “ok”, “si”, “bien”
- Code-switching detection – identifies mixed-language messages (Spanglish, Franglais, Hinglish)
- Slang & abbreviations – built-in hints for chat lingo (“thx”, “mdr”, “jaja”)
- Confusion resolution – handles similar language pairs (Spanish/Portuguese, Norwegian/Danish/Swedish)
- Extensible – add custom backends, voting strategies, and hint dictionaries
Traditional language detectors are trained on well-formed sentences and documents. They fail on the kind of text you see in real conversations:
"Bonjour!" -> French (correct)
"Comment ca va?" -> French (correct)
"Bien" -> Spanish? French? German? (WRONG)
"ok" -> English? Universal? (AMBIGUOUS)
"thx" -> Unknown (FAIL)
Why this happens:
- Short text has low statistical signal
- Words like “ok”, “taxi”, “pizza” exist in many languages
- Chat slang (“thx”, “mdr”, “jaja”) isn’t in training data
- No context from surrounding messages
FastLangML solves this by:
- Tracking conversation context to disambiguate short messages
- Using hint dictionaries for slang and common words
- Combining multiple detection backends for robustness
- Returning “unknown” (
und) when uncertain instead of wrong guesses
# Full installation with all backends (recommended)
pip install fastlangml[all]
# Minimal installation (fasttext only)
pip install fastlangml[fasttext]
# Pick specific backends
pip install fastlangml[fasttext,lingua]
pip install fastlangml[langdetect]
Available backends:
| Backend | Languages | Speed | Accuracy | Install Extra |
|---|---|---|---|---|
| fasttext | 176 | Fast | High | [fasttext] |
| fastlangid | 177 | Fast | High (CJK) | [fastlangid] |
| lingua | 75 | Medium | Very High | [lingua] |
| langdetect | 55 | Fast | Medium | [langdetect] |
| pycld3 | 107 | Very Fast | Medium | [pycld3] |
| langid | 97 | Fast | Medium | [langid] |
Note:
fastlangidprovides improved accuracy for Japanese, Korean, Chinese, and Cantonese detection.
Optional features:
| Feature | Description | Install Extra |
|---|---|---|
| spaCy NER | Named entity recognition for proper noun filtering | [spacy] |
| fuzzy | Fuzzy matching for hint dictionaries | [fuzzy] |
| diskcache | Disk-based context storage (local dev) | [diskcache] |
| redis | Redis-based context storage (production) | [redis] |
The simplest API – just pass text, get language back:
from fastlangml import detect
# Direct comparison
detect("Hello") == "en" # True
detect("Bonjour") == "fr" # True
# Use as string
print(detect("Hola")) # "es"
f"Language: {detect('Ciao')}" # "Language: it"
# Full details when needed
result = detect("Hello, how are you?")
result.lang # "en"
result.confidence # 0.95
result.reliable # True
That’s it! No context, no stores, no configuration required.
FastLangML supports three usage modes – pick what fits your needs:
| Mode | Code | Use Case |
|---|---|---|
| Single-shot | detect("text") == "en" |
One-off detection (most common) |
| With context | detect("text", context=ctx) |
Multi-turn chat (in-memory) |
| With persistence | store.session(id) |
Multi-turn across HTTP requests |
# 1. Single-shot - no setup needed
from fastlangml import detect
detect("Bonjour") == "fr" # True
# 2. Context - optional, for conversations
from fastlangml import detect, ConversationContext
ctx = ConversationContext()
detect("Bonjour", context=ctx) # tracks history in memory
detect("ok", context=ctx) == "fr" # True (uses context)
# 3. Stores - optional, for persistence across requests
from fastlangml.context import DiskContextStore
store = DiskContextStore("./data")
with store.session("user-123") as ctx:
detect("Bonjour", context=ctx)
Note: Context and stores are 100% optional. Most users only need single-shot mode.
For multi-turn conversations, FastLangML can track context to resolve ambiguous short messages. When you pass a ConversationContext, the library:
- Remembers the last N detected languages
- Uses this history to resolve ambiguous messages
- Auto-updates the context after each detection
from fastlangml import detect, ConversationContext
# Create a context to track the conversation
context = ConversationContext()
# French conversation
detect("Bonjour!", context=context) # "fr" (clear)
detect("Comment ca va?", context=context) # "fr" (clear)
detect("Bien", context=context) # "fr"
detect("ok", context=context) == "fr" # True
# The context tracks that this is a French conversation,
# so ambiguous words resolve to French
How context resolution works:
| Message | Without Context | With Context (after French turns) |
|---|---|---|
| “ok” | Ambiguous | French (conversation language) |
| “Bien” | Spanish/French/German? | French (matches history) |
| “Si” | Spanish/Italian? | Italian (if conversation was Italian) |
| “Gracias” | Spanish | Spanish (high confidence, ignores context) |
What is conversation context?
ConversationContext maintains a sliding window of recent language detections. It tracks:
- The detected language of each message
- The confidence score of each detection
- A weighted history favoring recent messages (decay factor)
Configuration options:
context = ConversationContext(
max_turns=2, # Remember last 2 messages (default)
decay_factor=0.8, # Weight recent messages higher
)
When context helps:
- Ambiguous short messages (“ok”, “yes”, “no”)
- Words shared across languages (“taxi”, “hotel”, “pizza”)
- Mixed script input (romanized non-Latin languages)
When context doesn’t help:
- Clear language switches (user explicitly changes language)
- High-confidence detections (context is ignored)
Example: Customer service routing
from fastlangml import detect, ConversationContext
def route_message(messages: list[str]) -> str:
"""Route conversation to correct language queue."""
context = ConversationContext()
for msg in messages:
result = detect(msg, context=context)
# Return dominant language of conversation
return context.dominant_language or "en"
# French customer conversation
messages = ["Bonjour", "J'ai un probleme", "ok", "merci"]
queue = route_message(messages) # Returns "fr"
For multi-turn conversations across requests, persist context using the built-in stores.
DiskContextStore (local dev/testing):
pip install fastlangml[diskcache]
from fastlangml import detect
from fastlangml.context import DiskContextStore
store = DiskContextStore("./contexts", ttl_seconds=3600)
# Context manager - auto-saves on exit
with store.session(session_id) as ctx:
result = detect(text, context=ctx, auto_update=True)
RedisContextStore (production):
pip install fastlangml[redis]
from fastlangml import detect
from fastlangml.context import RedisContextStore
store = RedisContextStore("redis://localhost:6379", ttl_seconds=1800)
with store.session(session_id) as ctx:
result = detect(text, context=ctx, auto_update=True)
DIY storage (no extra deps):
# Serialize to your own storage
data = ctx.to_dict()
save_to_database(session_id, data)
# Load back
data = load_from_database(session_id)
ctx = ConversationContext.from_dict(data)
# Or use lightweight history format
ctx = ConversationContext.from_history([("fr", 0.95), ("fr", 0.9)])
Stores internally save only [(lang, confidence), ...] for efficiency.
FastLangML can combine multiple language detection backends for better accuracy. Each backend has different strengths:
| Backend | Best For | Weaknesses |
|---|---|---|
| fasttext | Speed, many languages | Less accurate on short text |
| lingua | Accuracy, short text | Slower, fewer languages |
| langdetect | General purpose | Non-deterministic by default |
| pycld3 | Speed, CLD3 compatibility | Lower accuracy |
How ensemble works:
- Text is sent to all configured backends
- Each backend returns a language prediction with confidence
- A voting strategy combines the predictions
- Final result is the language with highest combined score
from fastlangml import FastLangDetector, DetectionConfig
# Configure ensemble with 3 backends
detector = FastLangDetector(
config=DetectionConfig(
backends=["fasttext", "lingua", "langdetect"],
backend_weights={
"fasttext": 0.5, # Trust fasttext most
"lingua": 0.35, # Lingua for accuracy
"langdetect": 0.15 # Langdetect as tiebreaker
},
)
)
result = detector.detect("Ciao, come stai?")
print(result) # "it"
print(result.backend) # "ensemble"
Voting strategies determine how to combine predictions from multiple backends.
Available strategies:
| Strategy | Description | Best For |
|---|---|---|
weighted |
Weighted average of confidence scores | Production (default) |
hard |
Majority vote (each backend = 1 vote) | Equal-trust backends |
soft |
Average of all probabilities | Well-calibrated backends |
consensus |
Require N backends to agree | High-certainty requirements |
Weighted Voting (Default)
Multiplies each backend’s confidence by its weight, then picks the language with highest weighted score.
from fastlangml import FastLangDetector, DetectionConfig
detector = FastLangDetector(
config=DetectionConfig(
voting_strategy="weighted",
backend_weights={
"fasttext": 0.6,
"lingua": 0.4,
},
)
)
Hard Voting
Each backend gets one vote. Ties broken by confidence.
detector = FastLangDetector(
config=DetectionConfig(voting_strategy="hard")
)
Consensus Voting
Only returns a result if at least N backends agree. Useful when you need high certainty.
from fastlangml import ConsensusVoting, FastLangDetector, DetectionConfig
detector = FastLangDetector(
config=DetectionConfig(
custom_voting=ConsensusVoting(min_agreement=2)
)
)
Hints are word-to-language mappings that override backend detection. Essential for:
- Chat slang (“thx” -> English, “mdr” -> French)
- Company-specific terms
- Ambiguous words you want to force
Built-in hints for short text:
from fastlangml import FastLangDetector, HintDictionary
# Load default hints for chat/SMS
hints = HintDictionary.default_short_words()
detector = FastLangDetector(hints=hints)
detector.detect("thx") # "en" (thanks)
detector.detect("mdr") # "fr" (mort de rire = LOL)
detector.detect("jaja") # "es" (Spanish laugh)
Adding custom hints:
from fastlangml import FastLangDetector
detector = FastLangDetector()
# Add hints for your domain
detector.add_hint("asap", "en")
detector.add_hint("btw", "en")
detector.add_hint("stp", "fr") # s'il te plait
# Hints override backend detection
detector.detect("asap") # "en"
Hint priority:
Hints are checked before backend detection. If a hint matches:
- The hint’s language gets a confidence boost
- For very short text (
- For longer text, hints are weighted with backend predictions
The main function for language detection.
def detect(
text: str,
context: ConversationContext | None = None,
mode: str = "default",
auto_update: bool = True,
) -> DetectionResult
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Text to detect |
context |
ConversationContext |
None |
Conversation context for multi-turn |
mode |
str |
"default" |
Detection mode: "short", "default", "long" |
auto_update |
bool |
True |
Automatically add result to context |
Returns: DetectionResult
@dataclass
class DetectionResult:
lang: str # ISO 639-1 code ("en", "fr", "und")
confidence: float # 0.0 to 1.0
reliable: bool # True if confidence >= threshold
reason: str | None # Why "und" was returned (if applicable)
script: str | None # Detected script ("latin", "cyrillic", etc.)
backend: str # Which backend or "ensemble"
candidates: list # Top-k alternatives
meta: dict # Timing and debug info
# Can be used directly as string
detect("Hello") == "en" # True (via __eq__)
print(detect("Bonjour")) # "fr" (via __str__)
f"{detect('Hola')}" # "es"
context = ConversationContext(
max_turns: int = 2, # Max messages to remember
decay_factor: float = 0.8, # Recency weight (0.0-1.0)
)
# Properties
context.dominant_language # Most common language in history
context.language_distribution # {lang: weighted_count}
context.last_turn # Most recent turn
# Methods
context.get_context_boost(lang) # Get boost score for a language
context.get_language_streak() # (lang, streak_count)
context.clear() # Reset context
detector = FastLangDetector(
config=DetectionConfig(...),
hints=HintDictionary(),
)
# Methods
detector.detect(text, context=None, mode="default")
detector.detect_batch(texts, mode="default")
detector.add_hint(word, lang)
detector.remove_hint(word)
detector.set_languages(["en", "fr", "es"]) # Restrict output
detector.available_backends()
from fastlangml import DetectionConfig
config = DetectionConfig(
# Backend configuration
backends=["fasttext", "lingua"],
backend_weights={"fasttext": 0.6, "lingua": 0.4},
# Voting
voting_strategy="weighted", # or "hard", "soft", "consensus"
custom_voting=None, # VotingStrategy instance
# Thresholds
thresholds={
"short": 0.5, # Confidence threshold for short mode
"default": 0.7,
"long": 0.8,
},
min_text_length=1,
# Features
filter_proper_nouns=False, # Enable heuristic-based proper noun filtering
use_script_filter=True,
# spaCy NER for proper noun filtering (requires pip install fastlangml[spacy])
# filter = ProperNounFilter(use_spacy=True)
# Filters: PERSON, ORG, GPE, LOC, FAC, NORP entities
# Weights
hint_weight=1.5,
context_weight=0.3,
)
Create your own detection backend using the @backend decorator:
from fastlangml import backend, Backend
from fastlangml.backends.base import DetectionResult
@backend("my_detector", reliability=4) # reliability: 1-5
class MyBackend(Backend):
"""Custom language detection backend."""
@property
def name(self) -> str:
return "my_detector"
@property
def is_available(self) -> bool:
# Check if required dependencies are installed
return True
def detect(self, text: str) -> DetectionResult:
# Your detection logic here
lang = "en" # Replace with actual detection
confidence = 0.95
return DetectionResult(self.name, lang, confidence)
def supported_languages(self) -> set[str]:
return {"en", "fr", "de", "es"}
Using the custom backend:
from fastlangml import FastLangDetector, DetectionConfig
detector = FastLangDetector(
config=DetectionConfig(
backends=["my_detector", "fasttext"],
)
)
Programmatic registration:
from fastlangml import register_backend, unregister_backend, list_registered_backends
# Register
register_backend("my_backend", MyBackend, reliability=4)
# List all registered
print(list_registered_backends()) # ["my_backend"]
# Unregister
unregister_backend("my_backend")
Implement your own voting logic by extending VotingStrategy:
from fastlangml import VotingStrategy, FastLangDetector, DetectionConfig
class ConfidenceOnlyVoting(VotingStrategy):
"""Pick the language with highest individual confidence."""
def vote(
self,
results: list,
weights: dict[str, float] | None = None,
) -> dict[str, float]:
if not results:
return {}
# Find max confidence per language
scores = {}
for r in results:
lang = r.language
if lang not in scores or r.confidence > scores[lang]:
scores[lang] = r.confidence
return scores
# Use custom voting
detector = FastLangDetector(
config=DetectionConfig(
custom_voting=ConfidenceOnlyVoting()
)
)
FastLangML handles commonly confused language pairs with specialized logic:
Supported confused pairs:
- Spanish / Portuguese
- Norwegian / Danish / Swedish
- Czech / Slovak
- Croatian / Serbian / Bosnian
- Indonesian / Malay
- Russian / Ukrainian / Belarusian
- Hindi / Urdu
from fastlangml import ConfusionResolver, LanguageSimilarity
# Resolve ambiguity between similar languages
resolver = ConfusionResolver()
# Check if languages are a known confused pair
pair = resolver.get_confused_pair({"es", "pt"}) # frozenset({"es", "pt"})
# Adjust scores based on discriminating features
scores = {"es": 0.45, "pt": 0.42}
adjusted = resolver.resolve("Eu tenho um problema", scores)
# Portuguese boosted due to "tenho" (have)
# Get discriminating features
es_features, pt_features = resolver.get_discriminating_features("es", "pt")
# es_features: ["pero", "cuando", "donde", ...]
# pt_features: ["mas", "quando", "onde", ...]
# Check language relationships
sim = LanguageSimilarity()
sim.are_related("es", "pt") # True (Romance family)
sim.are_related("en", "zh") # False (different families)
sim.get_related_languages("es") # {"pt", "fr", "it", "ro", "ca", "gl"}
Detect mixed-language messages (Spanglish, Franglais, Hinglish, etc.):
from fastlangml import CodeSwitchDetector
detector = CodeSwitchDetector()
# Detect code-switching
result = detector.detect("That's muy importante for the proyecto")
result.is_mixed # True
result.primary_language # "en"
result.secondary_languages # ["es"]
result.languages # ["en", "es"]
result.language_distribution # {"en": 0.6, "es": 0.4}
# Get language spans
for span in result.spans:
print(f"{span.text}: {span.language} ({span.confidence:.2f})")
# "That's": en (0.85)
# "muy": es (0.92)
# "importante": es (0.95)
# "for": en (0.88)
# "the": en (0.90)
# "proyecto": es (0.94)
# Quick check
detector.is_code_switched("Hello world") # False
detector.is_code_switched("Hola, how are you?") # True
# Pattern-based detection
from fastlangml import detect_code_switching_pattern
pattern = detect_code_switching_pattern("That's muy bueno")
# ("en", "es") - matches Spanglish pattern
Supported code-switching patterns:
- Spanglish (English + Spanish)
- Franglais (English + French)
- Hinglish (English + Hindi)
- Denglish (German + English)
Tested across standard language samples, short text, and CJK:
| Test Category | Accuracy | Notes |
|---|---|---|
| Standard Languages (25 samples) | 96.0% | Full sentences across 25 languages |
| Short Text (16 samples) | 100.0% | Single words like “Hello”, “Bonjour”, “안녕” |
| CJK (6 samples) | 100.0% | Japanese, Chinese, Korean sentences |
FastLangML ensemble outperforms individual backends:
| Backend | Accuracy | Avg Latency |
|---|---|---|
| ensemble | 100.0% | 8.28ms |
| fasttext | 100.0% | 0.01ms |
| lingua | 90.0% | 0.18ms |
| langdetect | 80.0% | 2.84ms |
| langid | 70.0% | 0.56ms |
| Operation | Latency | Notes |
|---|---|---|
| Initialization | ~100ms | One-time, first detection |
| First detection (cold) | ~3000ms | Loads all backend models |
| Warm detection | 8.28ms | Average per detection |
| Cache hit | 0.0006ms | ~10,000x faster |
| Script short-circuit | 0.0006ms | Korean, Thai, Hebrew, etc. |
| Batch (100 texts) | 550ms | ~182 texts/sec |
FastLangML includes several performance optimizations:
| Optimization | Speedup | Description |
|---|---|---|
| Script short-circuit | ~5,500x | Skip backends for unambiguous scripts (Hangul→ko, Thai→th) |
| Result caching | ~10,000x | LRU cache for repeated texts |
| Lazy backend loading | – | Only load backends when needed |
| Adaptive parallelism | – | Skip threading for ≤2 backends or short text |
Script short-circuit languages:
- Korean (Hangul), Thai, Hebrew, Armenian, Georgian
- Tamil, Telugu, Kannada, Malayalam, Gujarati, Bengali, Punjabi, Oriya, Sinhala
- Khmer, Lao, Myanmar, Tibetan
# Run accuracy benchmarks
pytest tests/test_benchmarks.py -v
# Quick performance check
python -c "
from fastlangml import detect
import time
start = time.perf_counter()
for i in range(100):
detect(f'Hello world {i}')
print(f'100 detections: {(time.perf_counter()-start)*1000:.0f}ms')
"
Always pass a ConversationContext when detecting messages in a conversation:
# Good: Context-aware
context = ConversationContext()
for msg in conversation:
result = detect(msg, context=context)
# Bad: Stateless detection
for msg in conversation:
result = detect(msg) # Loses valuable context
- short: For SMS, chat messages (
- default: General purpose. Balanced threshold.
- long: For paragraphs/documents. Higher confidence threshold.
detect("ok", mode="short") # More lenient
detect("Hello world", mode="default")
detect(long_paragraph, mode="long") # More strict
If your users use specific slang or terms, add hints:
detector.add_hint("gg", "en") # Gaming
detector.add_hint("lol", "en")
detector.add_hint("mdr", "fr") # French LOL
detector.add_hint("kek", "en") # Gaming laugh
If you know the possible languages (e.g., a bilingual support queue), restrict the output:
# Only consider English and Spanish
detector.set_languages(["en", "es"])
When FastLangML is uncertain, it returns und instead of guessing wrong:
result = detect("ok")
if result == "und": # Can compare directly
# Fallback to default or ask user
lang = result.candidates[0].lang if result.candidates else "en"
print(f"Low confidence: {result.reason}")
Single backends have blind spots. Ensemble improves reliability:
# Development: Fast single backend
detector = FastLangDetector(
config=DetectionConfig(backends=["fasttext"])
)
# Production: Reliable ensemble
detector = FastLangDetector(
config=DetectionConfig(
backends=["fasttext", "lingua"],
voting_strategy="weighted",
)
)
When processing many messages, use batch detection:
# Good: Batch processing (parallelized)
results = detector.detect_batch(messages, mode="short")
# Bad: Sequential detection
results = [detector.detect(msg) for msg in messages] # Slower
We welcome contributions! See CONTRIBUTING.md for detailed guidelines.
# Clone and setup
git clone https://github.com/pnrajan/fastlangml.git
cd fastlangml
poetry install --all-extras
# Development commands
make test # Run tests
make lint # Check code style
make format # Format code with ruff
make fix # Auto-fix linting issues
make typecheck # Run type checker
make check # Run all checks (format + lint + typecheck + test)
- Bug fixes with test cases
- New detection backends
- Performance improvements
- Documentation improvements
- New voting strategies
See CONTRIBUTING.md for:
- Project architecture overview
- Testing guidelines
- Code style requirements
- Commit message conventions
- Pull request process
MIT License – see LICENSE for details.