Machine Learning · Search · Production

Semantic Search
Product Discovery Engine

Rebuilt the retrieval layer of a Shopify search app serving 10,000+ merchants. The model wasn't broken — it just had never seen how real shoppers search.

PyTorchSentence-TransformersTensorRTFastAPIQdrantKubernetesPySpark · Databricks

Why E-commerce Search Is Different

E-commerce queries don't look like the sentences NLP benchmarks are trained on. They're short (1–4 tokens), often ungrammatical, frequently misspelled, and sometimes packed with specs — m12 ss nut, 2hp VFD pump, A2-70 bolt. Amazon's public ESCI dataset (130K queries, 2.6M labeled pairs) captures this well — the hardest retrieval cases aren't the exact matches, they're the Substitutes and Complements.

This creates a clear division of labour: BM25 stays essential for exact specs and catalog codes. Embeddings handle the semantic gap — synonyms, typos, intent, multi-attribute binding. The failure mode isn't picking the wrong model, it's expecting one approach to do both.

Embeddings excel at
  • Synonym mapping — "ss nut m12" ↔ stainless steel hex nut M12
  • Typo tolerance — "sewing macine" → sewing machine
  • Occasion intent — "beach outfit" → swimwear, cover-up
  • Multi-attribute — "red sleeveless dress"
BM25 still wins at
  • Exact specs — M12, A2-70, ISI 1000l
  • Brand + model number combinations
  • Numerical attributes — "2hp", "1000l"
  • SKU-level catalog codes

The Problem

Integration testing flagged a 29.5% "Not Good" rate across 162 test queries. The failure analysis classified them into 12 distinct groups — each with a different root cause. Color synonyms worked well (4,208 synonym pairs in training → 0% "Not Good"). Bottoms taxonomy had zero fit-axis vocabulary → 58% "Not Good". The pattern was clear: failure rate correlated directly with coverage in training data.

Taxonomy & Synonyms
40%
"sweater" 0 knitwear results"skinny pants" returned baggy styles
Attribute Binding
30%
"dress no sleeve" negation ignored"man pink t-shirt" surfaced women's
Intent & Context
18%
"beach holiday outfit" irrelevant results
Typo Robustness
12%
"whiite shrt" zero white shirt matches

How It Was Fixed

Each failure group mapped to a specific data gap. The fix wasn't to retrain from scratch — it was to surgically generate the missing knowledge, then fine-tune the production checkpoint with a low learning rate to avoid catastrophic forgetting.

Log Mining

500M+ events · Wilson score

Failure Analysis

162 queries · 12 groups

LLM Synthetic Gen

65K–70K pairs · 8 langs

Fine-Tuning

MNRL · hard negatives

Deploy

TensorRT · Qdrant · K8S

Synthetic Data — 65K–70K Pairs Across 5 Categories

A · Taxonomy & Synonym → G1, G2, G735%
B · Attribute Binding → G3, G5, G6, G830%
C · Intent Bridge → G4, G10–G1215%
D · Typo Robustness → G910%
E · Lexical Baseline regularisation8%
8+langs
EN38%
DE16%
FR12%
FI10%
ES7%
JA7%
Others10%

FI and DE received 1.5× oversampling — higher morphological complexity, historically lower training coverage.

Deployment

Storefront
Shopify
FastAPI
Business logic
TensorRT
FP16 inference
Qdrant
Self-managed cluster

☸️ All orchestrated by Kubernetes — auto-scaling, rolling deploys, self-healing

TensorRT — FP16 quantisation + kernel fusion reduces embedding inference latency vs. raw PyTorch. Model weights are frozen post-training, so quantisation loss is acceptable.

Self-managed Qdrant — At 10K+ merchant scale, managed vector DBs hit cost ceilings quickly. Self-managed with per-tenant collection isolation and horizontal sharding gave better cost/performance control.

Results

Each target has a direct lineage back to specific failure groups and training data allocation.

Integration "Good" Rate
≥70%54.7%
Integration "Not Good" Rate
≤15%29.5%
Semantic Exact Match
≥55%40.8%
Gender Precision
≥95%~70%
Semantic Irrelevant Rate
≤10%18.3%
Negation Accuracy
≥75%~0%

What Actually Mattered

The failure analysis was worth more than any hyperparameter tuning

Before writing a single training pair, classifying the 162 failing queries into 12 groups revealed that color worked (4,208 synonyms in training) but bottoms taxonomy had zero fit-axis vocabulary. That one insight determined 35% of the training budget.

Hard negatives are where the margin is

MNRL with random in-batch negatives gets you to 60–65% "Good". The jump to ≥70% required explicit hard negatives calibrated to cosine similarity [0.3, 0.7]. Too easy and the model ignores them. Too hard and they introduce label noise.

Finnish and German are a different problem than English

Finnish compound words (villapaita, farkut) and German umlaut normalisation required entirely separate prompt templates — not just translations of the English prompt. Generic multilingual generation produced garbage for agglutinative languages.

The regression holdout was the only real safety net

Overfitting to synthetic patterns is invisible on the synthetic eval set. The preservation holdout of current-"Good" queries was the only signal that caught when training was helping new queries while quietly breaking old ones.