Case Study · Feed Engineering & ETL
Microsoft Shopping Feed Pipeline
Fully automated daily sync of high-volume affiliate feeds (Connexity, Shopping24) into Microsoft Merchant Center, OOM-safe, chunked upload, 100 % compliant.
The challenge
Blender Networks Inc. runs a large price-comparison portal whose entire monetization is exclusively built on Microsoft Advertising Product Listing Ads (PLAs). The previous in-house feed solution was unstable and had to be replaced.
The complication came from the heterogeneous third-party sources: Connexity delivers zipped JSON bundles via an API index, Shopping24 (S24) provides a CSV master file plus fragment updates over FTP. Both formats had to be flawlessly translated into Microsoft Merchant Center's strictly defined TSV schema. Every format error, every "item drop" = direct revenue loss.
Additional difficulty: Connexity data routinely exceeds 2 GB per publisher account, the mapped output runs into double-digit gigabytes, and the full daily upload into Microsoft Merchant Center sits at around 200 GB. Naive in-memory processing was off the table, the pipeline had to stay OOM-safe even on modest hardware.
The approach
Three heterogeneous sources, one unified pipeline. The architecture follows a strict three-stage model, Ingest → Map → Upload, where each affiliate network sits behind the same interface but internally taps its own stack (see pipeline diagram below).
- Streaming-first ingestion. A Python pipeline using
ijsonparses the zipped JSON bundles item-by-item directly off the stream instead of deserializing the whole document into RAM. Memory usage stays constant regardless of feed size. - Disk-backed deduplication. A SQLite table with
PRAGMAtuning handles cross-account deduplication on disk. Multi-million-item feeds stay clean without blowing up the heap, the dedup state survives even on abort, ready for inspection. - Strict mapping to the MS spec. Every source record gets deterministically mapped onto the required
id/title/description/link/image_link/price/...schema. Validation logic filters unusable brand strings (purely numeric, too long, too many words), checks GTINs for valid lengths (8/12/13/14) and setsidentifier_existsconsistently, no more "partial identifier" warnings. - 15 GB chunking & chunk-aware upload. Mapping outputs are auto-split at 15 GB into
_0.txt,_1.txt, … to match the Microsoft upload limit. The uploader detects these chunks via pattern matching and numbers them remotely correctly (TipDigest_US_0.txt,TipDigest_US_1.txt, …). - High-speed SFTP via LFTP. Instead of Python SFTP libraries (paramiko), the system LFTP binary is driven with enlarged TCP socket buffer and a reconnect strategy. Significantly faster and more stable than any Python implementation on multi-GB files.
- Multi-account orchestration. Multiple Connexity publishers and multiple Merchant Center accounts (US/DE) are processed in parallel; each account writes to its own output path and is correctly routed via a store-mapping config.
- Resilient daily sync. A cron lockfile prevents overlapping double-runs. Pipeline stages can be toggled individually (
--skip-ingest,--skip-map,--skip-upload) for targeted re-runs after partial failures, without re-pulling the entire 2 GB download. - Proactive error handling & diagnostics. Hybrid logging (Rich console with emojis for humans + RotatingFileHandler for the machine), per-stage execution-report table, per-account isolation. A single truncated JSON stream doesn't stop the overall run, errors get logged locally and the run continues with the rest.
- Modular architecture. Each affiliate network lives in its own module (
ingest_*.py+mapper_*.py) behind a unified pipeline interface. A third source (Kelkoo) was added without touching Connexity or S24, proof the abstraction holds. - Acceptance criterion. The hard acceptance bar (5+ consecutive days of error-free automation) was passed on the first attempt, secured by clean module boundaries and consistent validation layering.
The diagram shows the full data flow: from heterogeneous source systems (zipped JSON streams, REST API, FTP dumps), through the OOM-safe ingest layer, the validation-driven mapping with 15 GB chunking, all the way to the chunk-aware LFTP upload into Microsoft Merchant Center.
The result
- 100 % feed compliance, no more disapprovals from format errors.
- Zero errors in the daily sync, acceptance criterion passed first try.
- ROI protected, no revenue loss from broken feed updates.
- OOM-safe on multi-GB feeds, pipeline runs on modest hardware without memory pressure.
- Modularly extensible, new affiliate networks integrate without touching the core.
Hire me for this
Product Feed Pipelines
The same pipeline architecture that syncs affiliate feeds into Merchant Center here daily, with monitoring, quarantine and recovery. Available for your feed too.
More projects
AWS Cost Optimization
65% AWS cost reduction ($3,850 → $1,330 / month) via safe legacy decommissioning, zero downtime.
Legacy-DB Reverse Engineering & Migration
1.47 million parts liberated from a 1.2 GB password-protected manufacturer database and migrated into the client's new system, zero rule violations, fully auditable.
Book Lister AI
Desktop app that scans used books in under 30 seconds, extracts data via Gemini vision, live-prices, and lists on eBay, +400 % throughput.