Job Summary

Our partner is building a real-time intelligence platform powered by continuous large-scale data ingestion across thousands of public and regulatory data sources. The core product is a live knowledge graph that connects facilities, companies, permits, equipment categories, operators, and regulatory events into a unified intelligence layer.

We are looking for a highly autonomous Data Engineer with strong experience in large-scale web scraping, data pipelines, and knowledge graph architectures. This person will take ownership of critical data infrastructure, from ingestion and enrichment to storage, orchestration, and reliability.

This is a remote contractor opportunity open across LATAM.

Responsibilities

Real-Time Knowledge Graph

Design and operate the property graph at the center of the platform, linking facilities, companies, permits, equipment categories, operators, and regulatory events into a live, queryable intelligence layer.
Build entity resolution pipelines that reconcile inconsistent identifiers across thousands of sources using deterministic and probabilistic matching.
Enrich nodes using public datasets and registries while maintaining data quality, freshness, and performance.
Ensure the graph remains coherent, scalable, and highly available as source systems evolve.

Large-Scale Data Ingestion & Web Scraping

Build and operate ingestion systems that continuously collect data from municipal permit portals, environmental databases, GIS feeds, federal registries, and other public sources.
Develop scrapers, asynchronous API clients, and custom parsers for PDF, XML, GeoJSON, shapefiles, and other structured and semi-structured formats.
Manage anti-bot protections, proxy rotation, session management, rate limiting, and fault tolerance.
Design scalable scraping infrastructure capable of supporting thousands of continuously changing sources.

Data Freshness & Automation

Design and maintain scheduled workflows that extract, enrich, deduplicate, and process data at scale.
Implement monitoring, alerting, retry strategies, and failure recovery mechanisms.
Ensure reliable concurrent execution across multiple ingestion and transformation pipelines.

Data Lake & Regional Architecture

Design storage architectures partitioned and indexed by geographic regions, including county, metropolitan area, and state.
Structure datasets to efficiently support downstream analytics, graph workloads, and regional queries at scale.

Infrastructure & Reliability

Build and maintain cloud-native data infrastructure across GCP and AWS environments.
Develop and manage orchestration workflows using Modal.
Improve observability, reliability, deployment automation, and operational excellence across the platform.
Contribute to CI/CD processes using Docker and GitHub Actions.

Tech Stack

Languages: Python, SQL
Databases: PostgreSQL, PostGIS, Supabase, Neo4j-compatible graph schemas
Cloud: GCP (primary), AWS
Orchestration: Modal
Data & Analytics: Databricks
Scraping: Playwright, Selenium, Browserbase, httpx, aiohttp
Transformation: Large-scale LLM batch processing
Observability: Sentry
CI/CD: Docker, GitHub Actions

Requirements

2–4 years of experience in Data Engineering or Software Engineering with ownership of production systems.
Proven experience building and maintaining large-scale web scraping solutions across multiple websites and data sources.
Strong experience handling anti-bot mechanisms, proxy rotation, session management, rate limiting, and scraping reliability challenges.
Experience designing, building, and querying knowledge graphs or graph-based data architectures.
Strong PostgreSQL experience, including schema design, query optimization, and PostGIS.
Experience designing and operating scheduled data processing pipelines, automation workflows, monitoring, alerting, and recovery mechanisms.
Hands-on experience with AWS and/or GCP in production environments.
Experience working with messy, undocumented, and evolving real-world data sources.
Strong technical judgment and ability to work independently in a startup environment.
Excellent English communication skills (B2+).

Nice to Have

Experience with Neo4j or property graph modeling.
Experience with geospatial technologies such as PostGIS, shapefiles, GeoJSON, and spatial indexing.
Experience working with geographic data partitioning by county, metropolitan area, or jurisdiction.
Familiarity with public regulatory datasets and government data sources.
Experience in industrial, energy, construction, logistics, or supply chain industries.

Position Type and Expected Hours of Work

Contractor position.
Remote across LATAM.
Equipment provided by the company.
Start date: ASAP.