The Extraction layer is responsible for fetching raw data from external merchant APIs. It normalizes the interface so the rest of the pipeline doesn't need to know how the data was fetched.

The Extract Class

Defined in app/dataflow/etl_service.py, this class acts as a factory wrapper.

extractor = Extract(
    client="shopify_client",  # or whop_client, greenit_client
    type="ShopifyProductsLoader",
    config={...} # API keys, store URLs, etc.
)
data = await extractor.extract(key="store_url")

Client Implementations

Each merchant platform has unique challenges that dictate its extraction strategy.

1. Shopify (shopify_client.py)

  • Strategy: Bulk Operations API (GraphQL)
  • Why: A standard REST API call fetches 250 products. For a store with 50,000 products, this requires 200 sequential requests, taking hours and hitting rate limits.
  • Mechanism:
    1. Send a mutation to generate a JSONL file containing all product data.
    2. Poll currentBulkOperation until status is COMPLETED.
    3. Download and parse the JSONL file line-by-line.
  • Edge Case: Handling the JSONL format where "child" records (variants, images) appear as separate lines after their parent product line. The client re-assembles these into a nested structure.

2. Whop (whop_client.py)

  • Strategy: List + Hydrate (Async)
  • Why: The "List Products" API returns only basic info (ID, title). Detailed info (Images, Descriptions) requires a separate call per product.
  • Mechanism:
    1. List: Fetch all product IDs.
    2. Hydrate: Use asyncio.gather with a Semaphore(5) to fetch details for 5 products concurrently. This speeds up the process 5x while respecting rate limits.
    3. Scraping: Whop's API sometimes misses images. The client attempts to scrape the product's public HTML page to find the hero image URL.
    4. Merge: Fetches "Plans" (pricing options) from a separate endpoint and manually merges them into the corresponding product objects.

3. GreenIT (merchants/greenit_client.py)

  • Strategy: REST Pagination
  • Mechanism: Standard while loop following the next page link in the API response.
  • Auth: Uses a custom header Cosmo-API-Key.
  • Data: The API returns a straightforward list of product objects that map cleanly to our schema.

4. Petswyak (merchants/petswyak_client.py)

  • Strategy: Odoo-based Authentication
  • Mechanism:
    1. Authenticate: POST to /web/session/authenticate to get a session_id.
    2. Session Cookie: Passes Cookie: session_id=... in all subsequent requests.
    3. Pagination: Iterates through page=1, 2, ... until an empty list is returned.
  • Edge Case: Strict error handling for Odoo's JSON-RPC style responses.