Selected Web Scraping & Data Pipeline Work
Hands-on Python engineer with experience building scraping and data collection systems across dynamic JS-heavy sites and structured web/public-data sources. Examples below show browser automation, HTTP extraction, normalization, validation, and reliability work across production and applied systems.
Contact: kanad.rishiraj@gmail.com
NomNomy
- Built Python/Selenium scrapers for Uber Eats, Grubhub, and DoorDash, extracting structured menu data from JS-heavy delivery platforms including items, prices, images, and modifier/customization trees.
- Improved scraper resilience with platform-specific readiness checks, challenge/gating handling, stale-element recovery, repeated-item detection, incremental persistence, completion metadata, and debug artifacts.
- Extended the scrapers into a broader pipeline with CLI/batch orchestration, JSON persistence, and Streamlit-based review/finalization tooling for QA and normalization.
MovieSaints
- Built Instagram scraping and automation workflows including session reuse via cookies, creator discovery, hashtag/post extraction, and messaging automation across dynamic UI flows.
- Developed structured external-data pipelines for IMDb metadata/credits, FX-rate ingestion with source fallback, and other third-party web data using browser automation plus direct HTTP/HTML parsing.
- Added persistence, retries, source fallback, normalization, and structured JSON/DB outputs so extracted data could be reused in downstream workflows.
FitJobs
- Built multi-site job extraction logic for LinkedIn, Indeed, Glassdoor, Greenhouse, and Lever using site-specific scraper routing and DOM strategies.
- Implemented resilient extraction for dynamic job pages using hidden-content expansion, retry-based metadata reads, structured-data parsing, and layered selector fallbacks.
- Normalized extracted fields into a consistent payload and passed them directly into downstream analysis workflows for resume-to-job fit scoring.
TalkToGov
- Built Python scraping/import pipelines using
requests.SessionandBeautifulSoupto extract structured data from government list and profile pages. - Designed recurring-scrape reliability features including HTML change detection, cache reuse, strict validation/reconciliation, and safety checks to prevent bad runs from corrupting downstream data.
- Implemented broader import flows with paginated API extraction, 429 backoff handling, normalized CSV/DB outputs, and change-history tracking.
Code examples and additional technical detail are available on request.
Relevant GitHub repo: selenium-web-automation-utils
Contact: kanad.rishiraj@gmail.com