Data Engineer II-01

  • Bogotá, Bogota, Colombia
  • Full-Time
  • Remote

Job Description:

Job Summary

Our partner is building a real-time intelligence platform powered by continuous large-scale data ingestion across thousands of public and regulatory data sources. The core product is a live knowledge graph that connects facilities, companies, permits, equipment categories, operators, and regulatory events into a unified intelligence layer.

We are looking for a highly autonomous Data Engineer with strong experience in large-scale web scraping, data pipelines, and knowledge graph architectures. This person will take ownership of critical data infrastructure, from ingestion and enrichment to storage, orchestration, and reliability.

This is a remote contractor opportunity open across LATAM.

Responsibilities

Real-Time Knowledge Graph

  • Design and operate the property graph at the center of the platform, linking facilities, companies, permits, equipment categories, operators, and regulatory events into a live, queryable intelligence layer.
  • Build entity resolution pipelines that reconcile inconsistent identifiers across thousands of sources using deterministic and probabilistic matching.
  • Enrich nodes using public datasets and registries while maintaining data quality, freshness, and performance.
  • Ensure the graph remains coherent, scalable, and highly available as source systems evolve.

Large-Scale Data Ingestion & Web Scraping

  • Build and operate ingestion systems that continuously collect data from municipal permit portals, environmental databases, GIS feeds, federal registries, and other public sources.
  • Develop scrapers, asynchronous API clients, and custom parsers for PDF, XML, GeoJSON, shapefiles, and other structured and semi-structured formats.
  • Manage anti-bot protections, proxy rotation, session management, rate limiting, and fault tolerance.
  • Design scalable scraping infrastructure capable of supporting thousands of continuously changing sources.

Data Freshness & Automation

  • Design and maintain scheduled workflows that extract, enrich, deduplicate, and process data at scale.
  • Implement monitoring, alerting, retry strategies, and failure recovery mechanisms.
  • Ensure reliable concurrent execution across multiple ingestion and transformation pipelines.

Data Lake & Regional Architecture

  • Design storage architectures partitioned and indexed by geographic regions, including county, metropolitan area, and state.
  • Structure datasets to efficiently support downstream analytics, graph workloads, and regional queries at scale.

Infrastructure & Reliability

  • Build and maintain cloud-native data infrastructure across GCP and AWS environments.
  • Develop and manage orchestration workflows using Modal.
  • Improve observability, reliability, deployment automation, and operational excellence across the platform.
  • Contribute to CI/CD processes using Docker and GitHub Actions.

Tech Stack

  • Languages: Python, SQL
  • Databases: PostgreSQL, PostGIS, Supabase, Neo4j-compatible graph schemas
  • Cloud: GCP (primary), AWS
  • Orchestration: Modal
  • Data & Analytics: Databricks
  • Scraping: Playwright, Selenium, Browserbase, httpx, aiohttp
  • Transformation: Large-scale LLM batch processing
  • Observability: Sentry
  • CI/CD: Docker, GitHub Actions

Requirements

  • 2–4 years of experience in Data Engineering or Software Engineering with ownership of production systems.
  • Proven experience building and maintaining large-scale web scraping solutions across multiple websites and data sources.
  • Strong experience handling anti-bot mechanisms, proxy rotation, session management, rate limiting, and scraping reliability challenges.
  • Experience designing, building, and querying knowledge graphs or graph-based data architectures.
  • Strong PostgreSQL experience, including schema design, query optimization, and PostGIS.
  • Experience designing and operating scheduled data processing pipelines, automation workflows, monitoring, alerting, and recovery mechanisms.
  • Hands-on experience with AWS and/or GCP in production environments.
  • Experience working with messy, undocumented, and evolving real-world data sources.
  • Strong technical judgment and ability to work independently in a startup environment.
  • Excellent English communication skills (B2+).

Nice to Have

  • Experience with Neo4j or property graph modeling.
  • Experience with geospatial technologies such as PostGIS, shapefiles, GeoJSON, and spatial indexing.
  • Experience working with geographic data partitioning by county, metropolitan area, or jurisdiction.
  • Familiarity with public regulatory datasets and government data sources.
  • Experience in industrial, energy, construction, logistics, or supply chain industries.

Position Type and Expected Hours of Work

  • Contractor position.
  • Remote across LATAM.
  • Equipment provided by the company.
  • Start date: ASAP.