infrastructure reference

how every endpoint fetches the web.

Per-endpoint proxy strategy, fallbacks, external APIs, and image-cache pipeline — derived from current code, not docs that drift.

Direct
Auto-Escalate
Zyte
Browserless/JS
Residential
External API
Endpoint Default Proxy Fallback / Alternative Implementation Details
Place Search
/api/place-search
Zyte (L3) Direct if scrapeUrl fails
Auto L1→L3 via autocomplete cache check
proxyLevel: 3, minProxyLevels: 3
GoogleMapHandler / GoogleMapCidHandler
Fallback: Direct fetch with User-Agent header
Place Autocomplete
/api/place-autocomplete
Auto (L1→L3) None (retry on failure) proxyLevel: 1, maxProxyLevels: 3
GoogleMapAutocompleteHandler
Tries Direct first, escalates to Zyte if blocked
Web Search (SERP)
/api/web-search
Auto (L1→L3) Multiple DOM parsers
Deep crawl with configurable proxy
minProxyLevels: 1, maxProxyLevels: 3
GoogleHandler via SerpService
Configurable via ?maxProxyLevels= param
Image Search
/api/image-search
Zyte (L3) Wikimedia
Unsplash/Pexels
proxyLevel: 3 (configurable via ?level=)
GoogleImageHandler
Falls back to Wikimedia Commons, Unsplash, Pexels
Google Lens
/api/google/lens
Zyte Browser None (direct Zyte API) Zyte API with browserHtml: true
GoogleLensHandler (direct API, not scrapeUrl)
Uses Zyte actions API for page navigation
YouTube Transcript
/api/youtube/transcript
ScrapeCreators Zyte (L3) + Client Rotation
IOS → ANDROID → WEB clients
Primary: ScrapeCreators API (no proxy)
Fallback: youtube-transcriptplus + Zyte
Controlled by YOUTUBE_TRANSCRIPT_PROVIDER env (default: scrapecreators)
YouTube Detail
/api/youtube/detail
Zyte (L3) ScrapeCreators if parsing fails
Multiple HTML parsers
proxyLevel: 3 (configurable via ?proxyLevel=)
HTML parsing: ld+json, meta tags, ytInitialData, ytInitialPlayerResponse
If parsing fails (no title/author): Fallback to ScrapeCreators /youtube/video API
YouTube Search
/api/youtube/search
Zyte (L3) Handler auto-detection proxyLevel: 3 (configurable via ?proxyLevel=)
YoutubeHandler via scrapeUrl
Block detection disabled for speed
Hotels
/api/hotels
Zyte (L3) Zyte for image enrichment proxyLevel: 3, minProxyLevels: 3
HotelService + GoogleHotelSearchHandler
Image enrichment via Google Image Search
Flights
/api/flights/calendar
Zyte (L3) Travel service fallbacks proxyLevel: 3
FlightService via scrapeUrl
Google Flights scraping
Amazon Image
/api/amazon/image-search
Zyte (L3) Amazon service fallbacks proxyLevel: 3
AmazonHandler via HttpServiceWithZyte
Product Advertising API integration
Amazon Search
/api/amazon/search
Zyte (L3) Configurable via ?proxies= proxyLevel: 3 (configurable via ?proxies=)
AmazonHandler via HttpServiceWithZyte
Supports named proxy chain: Direct,Private,Zyte
Instagram
/api/instagram/search
ZyteJS Configurable: Zyte, ZyteJS, Browserless, Residential proxy: 'ZyteJS' (configurable via ?proxy=)
InstagramHandler via createProxyService
Options: Zyte, ZyteJS, Browserless, Residential, Private, Direct
TikTok Tag Detail
/api/tiktok/challenge-detail
Direct (Free) None needed No proxy required
TikTokInternalApiService (free endpoint)
Direct API call without any proxy — returns challengeId + stats
TikTok Hashtag Capture
/api/tiktok/posts-capture
Browserless Zyte (browser)
Configurable via ?proxy=
proxy: 'Browserless' (configurable)
TikTokService with networkCapture on item_list
Browser navigates /tag/{tag}, scrolls, intercepts XHR responses for POI data
TikTok Keyword Search NEW
/api/tiktok/keyword-search
Direct (worker IP) Residential Zyte via ?proxy= Direct hit on tiktok.com/api/search/general/full/
TikTokInternalApiService.searchByKeyword — no browser, ~1.5s/call
Requires R2 auth blob (cookies + msToken/X-Bogus/X-Gnarly/verifyFp) — replays per-session sig across arbitrary keywords. Image covers cached via wsrv.nl → R2 with variant fallback (cover → originCover → dynamicCover).
TikTok Refresh Auth NEW
/api/tiktok/refresh-auth (POST)
N/A None — R2 write only Admin endpoint — refreshes R2 auth blob
Accepts a cURL paste (text/plain) or { cookies, sig } JSON
Writes to _config/tiktok-auth.json in R2. Call when /keyword-search starts returning 503 (sig expired). 60s in-worker mem cache.
TikTok Internal API
/api/tiktok/internal-api
Residential Configurable via ?proxy= EXPERIMENTAL — hashtag pagination via challenge/item_list
Tries direct challenge-detail first, falls back to tag-page fetch via proxy
Posts may require valid X-Bogus signatures (less reliable than keyword-search path)
TikTok Video Detail
/api/tiktok/video · /api/tiktok/video-detail
Auto (L1→L3) tikwm.com for video URL stage 0
ZyteJS for HTML scrape
Multi-stage: tikwm → direct API → proxy chain → full browser
Parses __UNIVERSAL_DATA_FOR_REHYDRATION__ (no JS rendering needed)
88% cost reduction vs always using browser. Includes itemStruct + POI + transcript
TikTok Video Proxy
/api/tiktok/proxy-video
Direct None Streams video bytes with proper Referer/Origin
Supports Range requests (HTTP 206), inline vs attachment via ?dl=1
Bypasses TikTok/tikwm hotlink protection via dynamic Referer detection
TikTok Cache Flush
/api/tiktok/flush-cache
N/A None — R2 delete Utility — clears cached results
Accepts ?hashtag= (hashtag + posts-capture cache) or ?q= (keyword-search cache)
Walks R2 prefix list for tiktok-hashtag_, tiktok-capture_, tiktok-search_
Google Shorts (cross-platform) NEW
/api/google-shorts/search
Auto (L1→L3) Per-platform site: filter
tiktok / instagram / youtube
Google Short Videos tab (udm=39) crawl
?source=tiktok|instagram|youtube · ?pages=1-5
Test UI has an "All" pill that fans out to all three sources in parallel. Supersedes the removed /api/tiktok/search-google.
Page Extract
/api/page/extract
Auto (L1→L3) Configurable via params Default: ['Direct', 'Private', 'Zyte']
Generic handler via scrapeUrl
Params: ?proxyLevel=, ?maxProxyLevels=, ?javascriptRendering=
Crawler Scrape
/api/crawler/scrape
Auto (L1→L3) Configurable via params Default: ['Direct', 'Private', 'Zyte']
Direct scrapeUrl call
Params: ?proxyLevel=, ?javascript=, ?cache=, ?retry=

Proxy Level Reference

L0/L1 (Direct): No proxy, direct HTTP request. Fastest but often blocked.
L2 (Private): Basic proxy rotation for light protection bypass.
L3 (Zyte): Zyte Smart Proxy Manager for anti-bot bypass.
L4 (Residential): Residential IPs for maximum stealth.
Browserless/JS: Headless browser rendering for JavaScript-heavy sites.

Auto-Escalate (L1→L3): Tries Direct first, escalates to Private then Zyte when blocked.
API Fallback: If scraping/parsing fails, fallback to external API (e.g., ScrapeCreators).

External APIs Used

ScrapeCreators: YouTube transcripts (primary, controlled by YOUTUBE_TRANSCRIPT_PROVIDER)
Wikimedia Commons: Image search fallback
Unsplash/Pexels: Premium image fallback
Zyte API: Proxy service for anti-bot bypass
Browserless: Headless browser as a service
wsrv.nl (a.k.a. images.weserv.nl): Free public image proxy + transcoder for hosts that block our worker IP (TikTok CDN, Google /proxy/, etc.)
tikwm.com: Third-party TikTok video URL extractor (Stage 0 of video-detail multi-stage chain)
Serper: Google web/video search API

TikTok Image-Cache Pipeline (added)

TikTok CDN (p16-/p19-common-sign.tiktokcdn-us.com) returns 403 to every server-side IP we tested — Cloudflare worker IPs, our Residential proxy, Zyte's HTTP proxy. wsrv.nl is the only path that can fetch these signed URLs.

Flow per cover/avatar:
1. cacheImage() hashes the original tiktokcdn URL → media/cached-image_<sha256>.jpg
2. If R2 has it → return cached key (sub-100ms).
3. Miss → fetch via wsrv.nl/?url=<url>&w=400&output=webp&q=85 (width-only, preserves 9:16 portrait aspect).
4. Store webp to R2 (~30-50KB per cover).
5. Resolve bare R2 key to full URL at response boundary: https://cache.contextforce.com/... in prod, http://localhost:8787/api/cached-image/... in dev.

Variant fallback chain (keyword-search only): TikTok issues 3 independently-signed cover URLs per post — cover, originCover, dynamicCover. wsrv intermittently 404s URLs routed via Fastly (p19) due to its egress-IP reputation. We try all 3 variants concurrently per post and keep the first wsrv 200. Recovery: ~70% → ~100% cover hit rate.

Refresh cycle: the keyword-search R2 auth blob's msToken rotates server-side (TTL unknown, observed days). When /keyword-search starts returning 503 {authExpired:true}, POST a fresh cURL paste (DevTools → Network → search/general/full → Copy as cURL) to /api/tiktok/refresh-auth.