Machine Learning · Search · Production
Semantic Search
Product Discovery Engine
Rebuilt the retrieval layer of a Shopify search app serving 10,000+ merchants. The model wasn't broken — it just had never seen how real shoppers search.
Why E-commerce Search Is Different
E-commerce queries don't look like the sentences NLP benchmarks are trained on. They're short (1–4 tokens), often ungrammatical, frequently misspelled, and sometimes packed with specs — m12 ss nut, 2hp VFD pump, A2-70 bolt. Amazon's public ESCI dataset (130K queries, 2.6M labeled pairs) captures this well — the hardest retrieval cases aren't the exact matches, they're the Substitutes and Complements.
This creates a clear division of labour: BM25 stays essential for exact specs and catalog codes. Embeddings handle the semantic gap — synonyms, typos, intent, multi-attribute binding. The failure mode isn't picking the wrong model, it's expecting one approach to do both.
- ✓Synonym mapping —
"ss nut m12"↔ stainless steel hex nut M12 - ✓Typo tolerance —
"sewing macine"→ sewing machine - ✓Occasion intent — "beach outfit" → swimwear, cover-up
- ✓Multi-attribute — "red sleeveless dress"
- ✗Exact specs —
M12,A2-70,ISI 1000l - ✗Brand + model number combinations
- ✗Numerical attributes — "2hp", "1000l"
- ✗SKU-level catalog codes
The Problem
Integration testing flagged a 29.5% "Not Good" rate across 162 test queries. The failure analysis classified them into 12 distinct groups — each with a different root cause. Color synonyms worked well (4,208 synonym pairs in training → 0% "Not Good"). Bottoms taxonomy had zero fit-axis vocabulary → 58% "Not Good". The pattern was clear: failure rate correlated directly with coverage in training data.
"sweater" → 0 knitwear results"skinny pants" → returned baggy styles"dress no sleeve" → negation ignored"man pink t-shirt" → surfaced women's"beach holiday outfit" → irrelevant results"whiite shrt" → zero white shirt matchesHow It Was Fixed
Each failure group mapped to a specific data gap. The fix wasn't to retrain from scratch — it was to surgically generate the missing knowledge, then fine-tune the production checkpoint with a low learning rate to avoid catastrophic forgetting.
Log Mining
500M+ events · Wilson score
Failure Analysis
162 queries · 12 groups
LLM Synthetic Gen
65K–70K pairs · 8 langs
Fine-Tuning
MNRL · hard negatives
Deploy
TensorRT · Qdrant · K8S
Synthetic Data — 65K–70K Pairs Across 5 Categories
FI and DE received 1.5× oversampling — higher morphological complexity, historically lower training coverage.
Deployment
☸️ All orchestrated by Kubernetes — auto-scaling, rolling deploys, self-healing
TensorRT — FP16 quantisation + kernel fusion reduces embedding inference latency vs. raw PyTorch. Model weights are frozen post-training, so quantisation loss is acceptable.
Self-managed Qdrant — At 10K+ merchant scale, managed vector DBs hit cost ceilings quickly. Self-managed with per-tenant collection isolation and horizontal sharding gave better cost/performance control.
Results
Each target has a direct lineage back to specific failure groups and training data allocation.
What Actually Mattered
The failure analysis was worth more than any hyperparameter tuning
Before writing a single training pair, classifying the 162 failing queries into 12 groups revealed that color worked (4,208 synonyms in training) but bottoms taxonomy had zero fit-axis vocabulary. That one insight determined 35% of the training budget.
Hard negatives are where the margin is
MNRL with random in-batch negatives gets you to 60–65% "Good". The jump to ≥70% required explicit hard negatives calibrated to cosine similarity [0.3, 0.7]. Too easy and the model ignores them. Too hard and they introduce label noise.
Finnish and German are a different problem than English
Finnish compound words (villapaita, farkut) and German umlaut normalisation required entirely separate prompt templates — not just translations of the English prompt. Generic multilingual generation produced garbage for agglutinative languages.
The regression holdout was the only real safety net
Overfitting to synthetic patterns is invisible on the synthetic eval set. The preservation holdout of current-"Good" queries was the only signal that caught when training was helping new queries while quietly breaking old ones.