Job details

Lead Data Engineer

Get a free resume review

About LGND
LGND is an early-stage startup revolutionizing geospatial AI infrastructure. We bridge the gap between large Earth observation models and specific application developers, enabling intuitive interaction with geospatial data. Our core mission is to empower decision-makers with rapid insights from vast, complex datasets. As part of our small, dynamic team, you will play a foundational role in building tools that have never existed before.

Role Summary

We are seeking a Lead Data Engineer to design, build, and scale our inference pipeline for geospatial embeddings. This pipeline is the backbone of LGND’s technological product, integrating with a point-and-click web application to generate embeddings for geographic areas of interest based on user-defined parameters. These embeddings will populate a custom vector database designed for massive scale and speed.

The ideal candidate is a seasoned engineer with experience in production-grade data pipelines, thrives under uncertainty, and is eager to collaborate across engineering, DevOps, and science disciplines. AI and geospatial experience are not required, if you are willing to learn fast with our help. Over time, this role will evolve into an engineering lead position, overseeing all technological components while focusing on engineering excellence.

Role is remote. We have team members in San Francisco, Philadelphia, and Coppenhagen.

Key Responsibilities

Build the Inference Pipeline:

Develop a scalable, efficient pipeline to generate geospatial embeddings based on user input, integrating parameters such as geographic area, model type, time range, tiling strategy, and imagery source.
Balance pre-processed tokens (e.g., cloud-free Sentinel imagery) with on-the-fly inference for optimal performance.
Ensure the pipeline supports billions of embeddings at scale and leverages advanced compute capabilities for fast inference, mostly on commercial clouds but also local resources..

Integration and Collaboration:

Work closely with front-end engineers to ensure seamless integration of the pipeline into a user-friendly web application.
Collaborate with leadership to determine which components of the pipeline and storage system should remain proprietary versus open-source.
Partner with external groups like AWS and Asterik Labs for open-source contributions and technical integrations.

Scalability and Professionalism:

Design a pipeline that other high-level data engineers can immediately inherit and build upon.
Move large amounts of data around professionally, focusing on scale, extensibility, and maintainability.
Ensure compliance with best practices in data engineering, DevOps, and MLOps.

Enhance Existing Projects:

Build upon existing foundational work to increase pipeline speed, scale, and extensibility. Key repositories include:

embeddings-worker: A Python module that creates vector embeddings of satellite images using the Clay Foundation Model. The system splits geographic regions into smaller chips, processes them in a distributed manner, and manages status tracking in a database.
embeddings-api: A REST API module that manages the vector database and orchestrates embedding generation tasks. It includes robust endpoints for scheduling geographic regions for processing, retrieving task status, and searching for similar vectors.

Future Leadership:

Serve as the lead for the inference pipeline, one of four core technological components at LGND (inference pipeline, fine-tuning and retrieval algorithms, vector search database, and SDK).
Optionally grow into an engineering manager role, overseeing future hires and cross-functional development efforts.

Scope of Work: First Two Months

Increase the Speed and Scale of the Pipeline:

Optimize the inference pipeline to efficiently handle the generation of embeddings at massive scale.
Focus on performance improvements to support billions of embeddings and reduce inference runtime.

Tokenize Source Imagery:

Develop a process to "tokenize" source imagery for a given geographic region and time range.
Produce image chips according to the large Earth observation model architecture.
Store these image chips in Amazon S3 for easy recall during subsequent inference runs.

Run Model Inference:

Implement the pipeline to run inference on a couple of existing, pre-trained models.
Output the resulting embeddings and store them in a scalable, performant vector search database.
Collaborate with external partners, such as AWS, to ensure pipeline compatibility with the vector database infrastructure.

Nice-to-Have Feature:

Develop functionality to process source imagery into mosaics to address cloud cover and other image quality issues, improving the quality of inputs for inference.

Scope of Work: First Two Months, expanded

Operationalize the CLIP-based Retrieval Pipeline

Implement and optimize a scalable inference pipeline to generate CLIP embeddings (and embeddings from other pre-trained models) for remote sensing imagery.
Design the system to tokenize source imagery into manageable image chips for specific geographic areas and time ranges. Store these chips efficiently in Amazon S3 for reuse.
Ensure flexibility to incorporate additional embedding models in the future.

Experiment with Multi-Modal Retrieval

Enable interaction with both image and text queries in a combined retrieval framework using pre-trained vision-language models (e.g., CLIP).
Implement functionality to combine multiple embeddings (image-to-image and text-to-image similarity) and experiment with methods like WEICOM for modality control (e.g., weighted combinations of embeddings).

Database and API Design

Collaborate with external partners (e.g., AWS) to design a scalable vector search database capable of handling billions of embeddings.
Develop APIs to allow efficient storage and retrieval of embeddings based on user-defined queries (geographic area, model, time range, and textual context).

Pre-Processing for Image Quality (Nice-to-Have)

Develop a feature to process source imagery into cloud-free mosaics, improving image quality for inference and retrieval.

Performance Optimization

Optimize the pipeline for speed, ensuring embeddings can be generated at scale. Explore trade-offs between pre-processed tokens and on-the-fly inference.
Focus on building a robust, scalable system that reduces latency while maintaining flexibility.

Required Technical Skills:

Proficiency in Python and familiarity with Docker.
Expertise in building production-grade data pipelines at scale (10+ years of experience preferred).
Familiarity with tools and frameworks like:

Geospatial libraries: numpy, pandas, rasterio, geopandas, xarray.
Machine learning: PyTorch (torch, torchdata, torchvision), timm, einops.
Cloud integration: boto3 for AWS.
Database management: SQLAlchemy, GeoAlchemy2, pgvector, psycopg2.

Experience with inference pipelines, including pre-processing and real-time inference strategies.

Preferred Experience:

Familiarity with satellite image formats and protocols (e.g., STAC, Cloud Optimized GeoTIFFs, Zarr).
Experience with AWS infrastructure (bonus, not required).
Background in MLOps and geospatial AI applications.

Soft Skills:

Self-led and able to navigate uncertainty.
Excited by the opportunity to build tools and systems that have never been built before.
Collaborative, humble, and eager to learn.

Cultural Values

Humility: You value collaboration and learning from others.
Integrity: You uphold honesty and transparency in your work.
Effectiveness: You are results-driven, with a focus on building scalable, impactful solutions.

Compensation and Benefits

Competitive salary based on experience.
Equity options in a Seed Stage Startup
Flexible work arrangements.
Opportunity to play a foundational role in shaping LGND’s technological infrastructure.

Data Engineer Geospatial AI Python AWS Remote

Similar Jobs

EDAI Data Engineering Manager

General Motors (GM) Hybrid Remote - United States

VIEW

Posted 5 days ago

Lead and mentor a multidisciplinary data engineering team to develop high-performance cloud-native data platforms at General Motors.

Data Warehouse Manager I

National Indemnity Hybrid Omaha, NE

VIEW

Posted 5 days ago

Manage and grow a dynamic Data Warehouse team at National Indemnity Company, delivering business-focused data solutions in a hybrid setting.

Data Integration Senior Lead Analyst

Citi Hybrid Tampa Florida United States

VIEW

Posted 14 days ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Lead critical finance data integration and governance initiatives at Citibank in Tampa to support regulatory reporting and operational improvements.

Sales Data Coordinator

Sazerac Company Hybrid Louisville

VIEW

Posted 10 days ago

The Sales Data Coordinator will oversee US distributor sales data integrity and master data management to enhance reporting and support sales operations at Sazerac Company.

Senior Enterprise Data Engineer

Peraton Hybrid Scott AFB

VIEW

Posted 3 days ago

Peraton requires a skilled Senior Enterprise Data Engineer to architect and optimize critical data infrastructure at Scott Air Force Base in support of USTRANSCOM operations.

Associate Director, Data Management

Janux Therapeutics Hybrid San Diego, CA

VIEW

Posted 11 days ago

A leadership role at Janux Therapeutics managing clinical data operations and vendor activities to ensure high-quality, compliant clinical trial data management.

Senior Data Engineer

East Daley Analytics Hybrid Greenwood Village, Colorado, United States

VIEW

Posted 4 days ago

East Daley Analytics is looking for a Senior Data Engineer to lead data foundation efforts and deliver scalable, high-quality data solutions in the energy sector.

GCP Data Architect

Woongjin, Inc Hybrid Ridgefield Park, NJ, USA

VIEW

Posted 5 days ago

WOONGJIN, Inc. seeks a GCP Data Architect to lead data migration and architect advanced GCP data solutions.

Business Engineer

Truv Hybrid No location specified

VIEW

Posted 5 days ago

Innovative fintech company Truv is looking for a skilled GTM Automation Engineer to build AI-driven workflows that transform revenue operations at scale.

Data Platform Developer (Java)

Geotab Hybrid Atlanta, Georgia, United States

VIEW

Posted 2 days ago

Seeking a skilled Data Platform Developer to create and optimize big data infrastructure and machine learning platforms for a leading IoT and connected transportation company.

Senior Analytics Engineer

GitLab Hybrid Remote, US

VIEW

Posted 2 days ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Customer-Centric

Social Impact Driven

Dare to be Different

Maternity Leave

Paternity Leave

401K Matching

Paid Holidays

Paid Time-Off

Medical Insurance

Dental Insurance

Vision Insurance

Mental Health Resources

Life insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

GitLab is looking for a Senior Analytics Engineer to enhance its Enterprise Data Platform by developing efficient data models and fostering collaboration across business and technical teams.

Data Architect Sr- SQL

PNC Hybrid OH - Strongsville

VIEW

Posted 9 days ago

Experienced Data Architect with strong SQL and data warehousing skills needed at PNC to drive data integration and analytics for strategic decision-making.