Data Engineer II-01
Job Description:
Job Summary
Our partner is building a real-time intelligence platform powered by continuous large-scale data ingestion across thousands of public and regulatory data sources. The core product is a live knowledge graph that connects facilities, companies, permits, equipment categories, operators, and regulatory events into a unified intelligence layer.
We are looking for a highly autonomous Data Engineer with strong experience in large-scale web scraping, data pipelines, and knowledge graph architectures. This person will take ownership of critical data infrastructure, from ingestion and enrichment to storage, orchestration, and reliability.
This is a remote contractor opportunity open across LATAM.
Responsibilities
Real-Time Knowledge Graph
- Design and operate the property graph at the center of the platform, linking facilities, companies, permits, equipment categories, operators, and regulatory events into a live, queryable intelligence layer.
- Build entity resolution pipelines that reconcile inconsistent identifiers across thousands of sources using deterministic and probabilistic matching.
- Enrich nodes using public datasets and registries while maintaining data quality, freshness, and performance.
- Ensure the graph remains coherent, scalable, and highly available as source systems evolve.
Large-Scale Data Ingestion & Web Scraping
- Build and operate ingestion systems that continuously collect data from municipal permit portals, environmental databases, GIS feeds, federal registries, and other public sources.
- Develop scrapers, asynchronous API clients, and custom parsers for PDF, XML, GeoJSON, shapefiles, and other structured and semi-structured formats.
- Manage anti-bot protections, proxy rotation, session management, rate limiting, and fault tolerance.
- Design scalable scraping infrastructure capable of supporting thousands of continuously changing sources.
Data Freshness & Automation
- Design and maintain scheduled workflows that extract, enrich, deduplicate, and process data at scale.
- Implement monitoring, alerting, retry strategies, and failure recovery mechanisms.
- Ensure reliable concurrent execution across multiple ingestion and transformation pipelines.
Data Lake & Regional Architecture
- Design storage architectures partitioned and indexed by geographic regions, including county, metropolitan area, and state.
- Structure datasets to efficiently support downstream analytics, graph workloads, and regional queries at scale.
Infrastructure & Reliability
- Build and maintain cloud-native data infrastructure across GCP and AWS environments.
- Develop and manage orchestration workflows using Modal.
- Improve observability, reliability, deployment automation, and operational excellence across the platform.
- Contribute to CI/CD processes using Docker and GitHub Actions.
Tech Stack
- Languages: Python, SQL
- Databases: PostgreSQL, PostGIS, Supabase, Neo4j-compatible graph schemas
- Cloud: GCP (primary), AWS
- Orchestration: Modal
- Data & Analytics: Databricks
- Scraping: Playwright, Selenium, Browserbase, httpx, aiohttp
- Transformation: Large-scale LLM batch processing
- Observability: Sentry
- CI/CD: Docker, GitHub Actions
Requirements
- 2–4 years of experience in Data Engineering or Software Engineering with ownership of production systems.
- Proven experience building and maintaining large-scale web scraping solutions across multiple websites and data sources.
- Strong experience handling anti-bot mechanisms, proxy rotation, session management, rate limiting, and scraping reliability challenges.
- Experience designing, building, and querying knowledge graphs or graph-based data architectures.
- Strong PostgreSQL experience, including schema design, query optimization, and PostGIS.
- Experience designing and operating scheduled data processing pipelines, automation workflows, monitoring, alerting, and recovery mechanisms.
- Hands-on experience with AWS and/or GCP in production environments.
- Experience working with messy, undocumented, and evolving real-world data sources.
- Strong technical judgment and ability to work independently in a startup environment.
- Excellent English communication skills (B2+).
Nice to Have
- Experience with Neo4j or property graph modeling.
- Experience with geospatial technologies such as PostGIS, shapefiles, GeoJSON, and spatial indexing.
- Experience working with geographic data partitioning by county, metropolitan area, or jurisdiction.
- Familiarity with public regulatory datasets and government data sources.
- Experience in industrial, energy, construction, logistics, or supply chain industries.
Position Type and Expected Hours of Work
- Contractor position.
- Remote across LATAM.
- Equipment provided by the company.
- Start date: ASAP.