pnrajan/fastlangml: Fast Language Detection for Conversational Agents


PyPI
Python
Ruff
License: MIT

High-accuracy language detection for chat, SMS, and conversational text. FastLangML combines multiple detection backends (FastText, Lingua, langdetect, pyCLD3) into a powerful ensemble that outperforms any single detector.

Key Features:

  • 170+ languages supported via multi-backend ensemble
  • Context-aware detection – tracks conversation history to resolve ambiguous short messages like “ok”, “si”, “bien”
  • Code-switching detection – identifies mixed-language messages (Spanglish, Franglais, Hinglish)
  • Slang & abbreviations – built-in hints for chat lingo (“thx”, “mdr”, “jaja”)
  • Confusion resolution – handles similar language pairs (Spanish/Portuguese, Norwegian/Danish/Swedish)
  • Extensible – add custom backends, voting strategies, and hint dictionaries

Traditional language detectors are trained on well-formed sentences and documents. They fail on the kind of text you see in real conversations:

"Bonjour!"       -> French (correct)
"Comment ca va?" -> French (correct)
"Bien"           -> Spanish? French? German? (WRONG)
"ok"             -> English? Universal? (AMBIGUOUS)
"thx"            -> Unknown (FAIL)

Why this happens:

  • Short text has low statistical signal
  • Words like “ok”, “taxi”, “pizza” exist in many languages
  • Chat slang (“thx”, “mdr”, “jaja”) isn’t in training data
  • No context from surrounding messages

FastLangML solves this by:

  1. Tracking conversation context to disambiguate short messages
  2. Using hint dictionaries for slang and common words
  3. Combining multiple detection backends for robustness
  4. Returning “unknown” (und) when uncertain instead of wrong guesses

# Full installation with all backends (recommended)
pip install fastlangml[all]

# Minimal installation (fasttext only)
pip install fastlangml[fasttext]

# Pick specific backends
pip install fastlangml[fasttext,lingua]
pip install fastlangml[langdetect]

Available backends:

Backend Languages Speed Accuracy Install Extra
fasttext 176 Fast High [fasttext]
fastlangid 177 Fast High (CJK) [fastlangid]
lingua 75 Medium Very High [lingua]
langdetect 55 Fast Medium [langdetect]
pycld3 107 Very Fast Medium [pycld3]
langid 97 Fast Medium [langid]

Note: fastlangid provides improved accuracy for Japanese, Korean, Chinese, and Cantonese detection.

Optional features:

Feature Description Install Extra
spaCy NER Named entity recognition for proper noun filtering [spacy]
fuzzy Fuzzy matching for hint dictionaries [fuzzy]
diskcache Disk-based context storage (local dev) [diskcache]
redis Redis-based context storage (production) [redis]


Single-Shot Detection (Most Common)

The simplest API – just pass text, get language back:

from fastlangml import detect

# Direct comparison
detect("Hello") == "en"      # True
detect("Bonjour") == "fr"    # True

# Use as string
print(detect("Hola"))        # "es"
f"Language: {detect('Ciao')}"  # "Language: it"

# Full details when needed
result = detect("Hello, how are you?")
result.lang        # "en"
result.confidence  # 0.95
result.reliable    # True

That’s it! No context, no stores, no configuration required.

FastLangML supports three usage modes – pick what fits your needs:

Mode Code Use Case
Single-shot detect("text") == "en" One-off detection (most common)
With context detect("text", context=ctx) Multi-turn chat (in-memory)
With persistence store.session(id) Multi-turn across HTTP requests

# 1. Single-shot - no setup needed
from fastlangml import detect
detect("Bonjour") == "fr"  # True

# 2. Context - optional, for conversations
from fastlangml import detect, ConversationContext
ctx = ConversationContext()
detect("Bonjour", context=ctx)  # tracks history in memory
detect("ok", context=ctx) == "fr"  # True (uses context)

# 3. Stores - optional, for persistence across requests
from fastlangml.context import DiskContextStore
store = DiskContextStore("./data")
with store.session("user-123") as ctx:
    detect("Bonjour", context=ctx)

Note: Context and stores are 100% optional. Most users only need single-shot mode.


Context-Aware Detection (Optional)

For multi-turn conversations, FastLangML can track context to resolve ambiguous short messages. When you pass a ConversationContext, the library:

  1. Remembers the last N detected languages
  2. Uses this history to resolve ambiguous messages
  3. Auto-updates the context after each detection
from fastlangml import detect, ConversationContext

# Create a context to track the conversation
context = ConversationContext()

# French conversation
detect("Bonjour!", context=context)        # "fr" (clear)
detect("Comment ca va?", context=context)  # "fr" (clear)
detect("Bien", context=context)            # "fr" 
detect("ok", context=context) == "fr"      # True 

# The context tracks that this is a French conversation,
# so ambiguous words resolve to French

How context resolution works:

Message Without Context With Context (after French turns)
“ok” Ambiguous French (conversation language)
“Bien” Spanish/French/German? French (matches history)
“Si” Spanish/Italian? Italian (if conversation was Italian)
“Gracias” Spanish Spanish (high confidence, ignores context)


What is conversation context?

ConversationContext maintains a sliding window of recent language detections. It tracks:

  • The detected language of each message
  • The confidence score of each detection
  • A weighted history favoring recent messages (decay factor)

Configuration options:

context = ConversationContext(
    max_turns=2,       # Remember last 2 messages (default)
    decay_factor=0.8,  # Weight recent messages higher
)

When context helps:

  • Ambiguous short messages (“ok”, “yes”, “no”)
  • Words shared across languages (“taxi”, “hotel”, “pizza”)
  • Mixed script input (romanized non-Latin languages)

When context doesn’t help:

  • Clear language switches (user explicitly changes language)
  • High-confidence detections (context is ignored)

Example: Customer service routing

from fastlangml import detect, ConversationContext

def route_message(messages: list[str]) -> str:
    """Route conversation to correct language queue."""
    context = ConversationContext()

    for msg in messages:
        result = detect(msg, context=context)

    # Return dominant language of conversation
    return context.dominant_language or "en"

# French customer conversation
messages = ["Bonjour", "J'ai un probleme", "ok", "merci"]
queue = route_message(messages)  # Returns "fr"

For multi-turn conversations across requests, persist context using the built-in stores.

DiskContextStore (local dev/testing):

pip install fastlangml[diskcache]
from fastlangml import detect
from fastlangml.context import DiskContextStore

store = DiskContextStore("./contexts", ttl_seconds=3600)

# Context manager - auto-saves on exit
with store.session(session_id) as ctx:
    result = detect(text, context=ctx, auto_update=True)

RedisContextStore (production):

pip install fastlangml[redis]
from fastlangml import detect
from fastlangml.context import RedisContextStore

store = RedisContextStore("redis://localhost:6379", ttl_seconds=1800)

with store.session(session_id) as ctx:
    result = detect(text, context=ctx, auto_update=True)

DIY storage (no extra deps):

# Serialize to your own storage
data = ctx.to_dict()
save_to_database(session_id, data)

# Load back
data = load_from_database(session_id)
ctx = ConversationContext.from_dict(data)

# Or use lightweight history format
ctx = ConversationContext.from_history([("fr", 0.95), ("fr", 0.9)])

Stores internally save only [(lang, confidence), ...] for efficiency.


FastLangML can combine multiple language detection backends for better accuracy. Each backend has different strengths:

Backend Best For Weaknesses
fasttext Speed, many languages Less accurate on short text
lingua Accuracy, short text Slower, fewer languages
langdetect General purpose Non-deterministic by default
pycld3 Speed, CLD3 compatibility Lower accuracy

How ensemble works:

  1. Text is sent to all configured backends
  2. Each backend returns a language prediction with confidence
  3. A voting strategy combines the predictions
  4. Final result is the language with highest combined score
from fastlangml import FastLangDetector, DetectionConfig

# Configure ensemble with 3 backends
detector = FastLangDetector(
    config=DetectionConfig(
        backends=["fasttext", "lingua", "langdetect"],
        backend_weights={
            "fasttext": 0.5,   # Trust fasttext most
            "lingua": 0.35,    # Lingua for accuracy
            "langdetect": 0.15 # Langdetect as tiebreaker
        },
    )
)

result = detector.detect("Ciao, come stai?")
print(result)         # "it"
print(result.backend) # "ensemble"

Voting strategies determine how to combine predictions from multiple backends.

Available strategies:

Strategy Description Best For
weighted Weighted average of confidence scores Production (default)
hard Majority vote (each backend = 1 vote) Equal-trust backends
soft Average of all probabilities Well-calibrated backends
consensus Require N backends to agree High-certainty requirements

Weighted Voting (Default)

Multiplies each backend’s confidence by its weight, then picks the language with highest weighted score.

from fastlangml import FastLangDetector, DetectionConfig

detector = FastLangDetector(
    config=DetectionConfig(
        voting_strategy="weighted",
        backend_weights={
            "fasttext": 0.6,
            "lingua": 0.4,
        },
    )
)

Hard Voting

Each backend gets one vote. Ties broken by confidence.

detector = FastLangDetector(
    config=DetectionConfig(voting_strategy="hard")
)

Consensus Voting

Only returns a result if at least N backends agree. Useful when you need high certainty.

from fastlangml import ConsensusVoting, FastLangDetector, DetectionConfig

detector = FastLangDetector(
    config=DetectionConfig(
        custom_voting=ConsensusVoting(min_agreement=2)
    )
)

Hints are word-to-language mappings that override backend detection. Essential for:

  • Chat slang (“thx” -> English, “mdr” -> French)
  • Company-specific terms
  • Ambiguous words you want to force

Built-in hints for short text:

from fastlangml import FastLangDetector, HintDictionary

# Load default hints for chat/SMS
hints = HintDictionary.default_short_words()

detector = FastLangDetector(hints=hints)
detector.detect("thx")   # "en" (thanks)
detector.detect("mdr")   # "fr" (mort de rire = LOL)
detector.detect("jaja")  # "es" (Spanish laugh)

Adding custom hints:

from fastlangml import FastLangDetector

detector = FastLangDetector()

# Add hints for your domain
detector.add_hint("asap", "en")
detector.add_hint("btw", "en")
detector.add_hint("stp", "fr")  # s'il te plait

# Hints override backend detection
detector.detect("asap")  # "en"

Hint priority:

Hints are checked before backend detection. If a hint matches:

  1. The hint’s language gets a confidence boost
  2. For very short text (
  3. For longer text, hints are weighted with backend predictions

The main function for language detection.

def detect(
    text: str,
    context: ConversationContext | None = None,
    mode: str = "default",
    auto_update: bool = True,
) -> DetectionResult

Parameters:

Parameter Type Default Description
text str required Text to detect
context ConversationContext None Conversation context for multi-turn
mode str "default" Detection mode: "short", "default", "long"
auto_update bool True Automatically add result to context

Returns: DetectionResult

@dataclass
class DetectionResult:
    lang: str           # ISO 639-1 code ("en", "fr", "und")
    confidence: float   # 0.0 to 1.0
    reliable: bool      # True if confidence >= threshold
    reason: str | None  # Why "und" was returned (if applicable)
    script: str | None  # Detected script ("latin", "cyrillic", etc.)
    backend: str        # Which backend or "ensemble"
    candidates: list    # Top-k alternatives
    meta: dict          # Timing and debug info

# Can be used directly as string
detect("Hello") == "en"      # True (via __eq__)
print(detect("Bonjour"))     # "fr" (via __str__)
f"{detect('Hola')}"          # "es"
context = ConversationContext(
    max_turns: int = 2,      # Max messages to remember
    decay_factor: float = 0.8, # Recency weight (0.0-1.0)
)

# Properties
context.dominant_language    # Most common language in history
context.language_distribution # {lang: weighted_count}
context.last_turn           # Most recent turn

# Methods
context.get_context_boost(lang)  # Get boost score for a language
context.get_language_streak()    # (lang, streak_count)
context.clear()                  # Reset context
detector = FastLangDetector(
    config=DetectionConfig(...),
    hints=HintDictionary(),
)

# Methods
detector.detect(text, context=None, mode="default")
detector.detect_batch(texts, mode="default")
detector.add_hint(word, lang)
detector.remove_hint(word)
detector.set_languages(["en", "fr", "es"])  # Restrict output
detector.available_backends()

from fastlangml import DetectionConfig

config = DetectionConfig(
    # Backend configuration
    backends=["fasttext", "lingua"],
    backend_weights={"fasttext": 0.6, "lingua": 0.4},

    # Voting
    voting_strategy="weighted",  # or "hard", "soft", "consensus"
    custom_voting=None,          # VotingStrategy instance

    # Thresholds
    thresholds={
        "short": 0.5,   # Confidence threshold for short mode
        "default": 0.7,
        "long": 0.8,
    },
    min_text_length=1,

    # Features
    filter_proper_nouns=False,  # Enable heuristic-based proper noun filtering
    use_script_filter=True,

    # spaCy NER for proper noun filtering (requires pip install fastlangml[spacy])
    # filter = ProperNounFilter(use_spacy=True)
    # Filters: PERSON, ORG, GPE, LOC, FAC, NORP entities

    # Weights
    hint_weight=1.5,
    context_weight=0.3,
)

Create your own detection backend using the @backend decorator:

from fastlangml import backend, Backend
from fastlangml.backends.base import DetectionResult

@backend("my_detector", reliability=4)  # reliability: 1-5
class MyBackend(Backend):
    """Custom language detection backend."""

    @property
    def name(self) -> str:
        return "my_detector"

    @property
    def is_available(self) -> bool:
        # Check if required dependencies are installed
        return True

    def detect(self, text: str) -> DetectionResult:
        # Your detection logic here
        lang = "en"  # Replace with actual detection
        confidence = 0.95
        return DetectionResult(self.name, lang, confidence)

    def supported_languages(self) -> set[str]:
        return {"en", "fr", "de", "es"}

Using the custom backend:

from fastlangml import FastLangDetector, DetectionConfig

detector = FastLangDetector(
    config=DetectionConfig(
        backends=["my_detector", "fasttext"],
    )
)

Programmatic registration:

from fastlangml import register_backend, unregister_backend, list_registered_backends

# Register
register_backend("my_backend", MyBackend, reliability=4)

# List all registered
print(list_registered_backends())  # ["my_backend"]

# Unregister
unregister_backend("my_backend")

Implement your own voting logic by extending VotingStrategy:

from fastlangml import VotingStrategy, FastLangDetector, DetectionConfig

class ConfidenceOnlyVoting(VotingStrategy):
    """Pick the language with highest individual confidence."""

    def vote(
        self,
        results: list,
        weights: dict[str, float] | None = None,
    ) -> dict[str, float]:
        if not results:
            return {}

        # Find max confidence per language
        scores = {}
        for r in results:
            lang = r.language
            if lang not in scores or r.confidence > scores[lang]:
                scores[lang] = r.confidence

        return scores

# Use custom voting
detector = FastLangDetector(
    config=DetectionConfig(
        custom_voting=ConfidenceOnlyVoting()
    )
)

Language Confusion Resolution

FastLangML handles commonly confused language pairs with specialized logic:

Supported confused pairs:

  • Spanish / Portuguese
  • Norwegian / Danish / Swedish
  • Czech / Slovak
  • Croatian / Serbian / Bosnian
  • Indonesian / Malay
  • Russian / Ukrainian / Belarusian
  • Hindi / Urdu
from fastlangml import ConfusionResolver, LanguageSimilarity

# Resolve ambiguity between similar languages
resolver = ConfusionResolver()

# Check if languages are a known confused pair
pair = resolver.get_confused_pair({"es", "pt"})  # frozenset({"es", "pt"})

# Adjust scores based on discriminating features
scores = {"es": 0.45, "pt": 0.42}
adjusted = resolver.resolve("Eu tenho um problema", scores)
# Portuguese boosted due to "tenho" (have)

# Get discriminating features
es_features, pt_features = resolver.get_discriminating_features("es", "pt")
# es_features: ["pero", "cuando", "donde", ...]
# pt_features: ["mas", "quando", "onde", ...]

# Check language relationships
sim = LanguageSimilarity()
sim.are_related("es", "pt")  # True (Romance family)
sim.are_related("en", "zh")  # False (different families)
sim.get_related_languages("es")  # {"pt", "fr", "it", "ro", "ca", "gl"}

Detect mixed-language messages (Spanglish, Franglais, Hinglish, etc.):

from fastlangml import CodeSwitchDetector

detector = CodeSwitchDetector()

# Detect code-switching
result = detector.detect("That's muy importante for the proyecto")

result.is_mixed            # True
result.primary_language    # "en"
result.secondary_languages # ["es"]
result.languages           # ["en", "es"]
result.language_distribution  # {"en": 0.6, "es": 0.4}

# Get language spans
for span in result.spans:
    print(f"{span.text}: {span.language} ({span.confidence:.2f})")
# "That's": en (0.85)
# "muy": es (0.92)
# "importante": es (0.95)
# "for": en (0.88)
# "the": en (0.90)
# "proyecto": es (0.94)

# Quick check
detector.is_code_switched("Hello world")  # False
detector.is_code_switched("Hola, how are you?")  # True

# Pattern-based detection
from fastlangml import detect_code_switching_pattern

pattern = detect_code_switching_pattern("That's muy bueno")
# ("en", "es") - matches Spanglish pattern

Supported code-switching patterns:

  • Spanglish (English + Spanish)
  • Franglais (English + French)
  • Hinglish (English + Hindi)
  • Denglish (German + English)

Tested across standard language samples, short text, and CJK:

Test Category Accuracy Notes
Standard Languages (25 samples) 96.0% Full sentences across 25 languages
Short Text (16 samples) 100.0% Single words like “Hello”, “Bonjour”, “안녕”
CJK (6 samples) 100.0% Japanese, Chinese, Korean sentences

FastLangML ensemble outperforms individual backends:

Backend Accuracy Avg Latency
ensemble 100.0% 8.28ms
fasttext 100.0% 0.01ms
lingua 90.0% 0.18ms
langdetect 80.0% 2.84ms
langid 70.0% 0.56ms

Operation Latency Notes
Initialization ~100ms One-time, first detection
First detection (cold) ~3000ms Loads all backend models
Warm detection 8.28ms Average per detection
Cache hit 0.0006ms ~10,000x faster
Script short-circuit 0.0006ms Korean, Thai, Hebrew, etc.
Batch (100 texts) 550ms ~182 texts/sec

FastLangML includes several performance optimizations:

Optimization Speedup Description
Script short-circuit ~5,500x Skip backends for unambiguous scripts (Hangul→ko, Thai→th)
Result caching ~10,000x LRU cache for repeated texts
Lazy backend loading Only load backends when needed
Adaptive parallelism Skip threading for ≤2 backends or short text

Script short-circuit languages:

  • Korean (Hangul), Thai, Hebrew, Armenian, Georgian
  • Tamil, Telugu, Kannada, Malayalam, Gujarati, Bengali, Punjabi, Oriya, Sinhala
  • Khmer, Lao, Myanmar, Tibetan
# Run accuracy benchmarks
pytest tests/test_benchmarks.py -v

# Quick performance check
python -c "
from fastlangml import detect
import time

start = time.perf_counter()
for i in range(100):
    detect(f'Hello world {i}')
print(f'100 detections: {(time.perf_counter()-start)*1000:.0f}ms')
"

1. Use Context for Chat/Messaging

Always pass a ConversationContext when detecting messages in a conversation:

# Good: Context-aware
context = ConversationContext()
for msg in conversation:
    result = detect(msg, context=context)

# Bad: Stateless detection
for msg in conversation:
    result = detect(msg)  # Loses valuable context
  • short: For SMS, chat messages (
  • default: General purpose. Balanced threshold.
  • long: For paragraphs/documents. Higher confidence threshold.
detect("ok", mode="short")         # More lenient
detect("Hello world", mode="default")
detect(long_paragraph, mode="long") # More strict

3. Add Domain-Specific Hints

If your users use specific slang or terms, add hints:

detector.add_hint("gg", "en")     # Gaming
detector.add_hint("lol", "en")
detector.add_hint("mdr", "fr")    # French LOL
detector.add_hint("kek", "en")    # Gaming laugh

4. Restrict Languages When Known

If you know the possible languages (e.g., a bilingual support queue), restrict the output:

# Only consider English and Spanish
detector.set_languages(["en", "es"])

5. Handle “und” (Unknown)

When FastLangML is uncertain, it returns und instead of guessing wrong:

result = detect("ok")
if result == "und":  # Can compare directly
    # Fallback to default or ask user
    lang = result.candidates[0].lang if result.candidates else "en"
    print(f"Low confidence: {result.reason}")

6. Use Ensemble for Production

Single backends have blind spots. Ensemble improves reliability:

# Development: Fast single backend
detector = FastLangDetector(
    config=DetectionConfig(backends=["fasttext"])
)

# Production: Reliable ensemble
detector = FastLangDetector(
    config=DetectionConfig(
        backends=["fasttext", "lingua"],
        voting_strategy="weighted",
    )
)

7. Batch Detection for Throughput

When processing many messages, use batch detection:

# Good: Batch processing (parallelized)
results = detector.detect_batch(messages, mode="short")

# Bad: Sequential detection
results = [detector.detect(msg) for msg in messages]  # Slower

We welcome contributions! See CONTRIBUTING.md for detailed guidelines.

# Clone and setup
git clone https://github.com/pnrajan/fastlangml.git
cd fastlangml
poetry install --all-extras

# Development commands
make test         # Run tests
make lint         # Check code style
make format       # Format code with ruff
make fix          # Auto-fix linting issues
make typecheck    # Run type checker
make check        # Run all checks (format + lint + typecheck + test)
  • Bug fixes with test cases
  • New detection backends
  • Performance improvements
  • Documentation improvements
  • New voting strategies

See CONTRIBUTING.md for:

  • Project architecture overview
  • Testing guidelines
  • Code style requirements
  • Commit message conventions
  • Pull request process

MIT License – see LICENSE for details.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *