Extraction
The Extraction layer is responsible for fetching raw data from external merchant APIs. It normalizes the interface so the rest of the pipeline doesn't need to know how the data was fetched.
The Extract Class
Defined in app/dataflow/etl_service.py, this class acts as a factory wrapper.
extractor = Extract(
client="shopify_client", # or whop_client, greenit_client
type="ShopifyProductsLoader",
config={...} # API keys, store URLs, etc.
)
data = await extractor.extract(key="store_url")
Client Implementations
Each merchant platform has unique challenges that dictate its extraction strategy.
1. Shopify (shopify_client.py)
- Strategy: Bulk Operations API (GraphQL)
- Why: A standard REST API call fetches 250 products. For a store with 50,000 products, this requires 200 sequential requests, taking hours and hitting rate limits.
- Mechanism:
- Send a
mutationto generate a JSONL file containing all product data. - Poll
currentBulkOperationuntil status isCOMPLETED. - Download and parse the JSONL file line-by-line.
- Send a
- Edge Case: Handling the JSONL format where "child" records (variants, images) appear as separate lines after their parent product line. The client re-assembles these into a nested structure.
2. Whop (whop_client.py)
- Strategy: List + Hydrate (Async)
- Why: The "List Products" API returns only basic info (ID, title). Detailed info (Images, Descriptions) requires a separate call per product.
- Mechanism:
- List: Fetch all product IDs.
- Hydrate: Use
asyncio.gatherwith aSemaphore(5)to fetch details for 5 products concurrently. This speeds up the process 5x while respecting rate limits. - Scraping: Whop's API sometimes misses images. The client attempts to scrape the product's public HTML page to find the hero image URL.
- Merge: Fetches "Plans" (pricing options) from a separate endpoint and manually merges them into the corresponding product objects.
3. GreenIT (merchants/greenit_client.py)
- Strategy: REST Pagination
- Mechanism: Standard
whileloop following thenextpage link in the API response. - Auth: Uses a custom header
Cosmo-API-Key. - Data: The API returns a straightforward list of product objects that map cleanly to our schema.
4. Petswyak (merchants/petswyak_client.py)
- Strategy: Odoo-based Authentication
- Mechanism:
- Authenticate: POST to
/web/session/authenticateto get asession_id. - Session Cookie: Passes
Cookie: session_id=...in all subsequent requests. - Pagination: Iterates through
page=1, 2, ...until an empty list is returned.
- Authenticate: POST to
- Edge Case: Strict error handling for Odoo's JSON-RPC style responses.