Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
Lead Data Engineer image - Rise Careers
Job details

Lead Data Engineer

About LGND
LGND is an early-stage startup revolutionizing geospatial AI infrastructure. We bridge the gap between large Earth observation models and specific application developers, enabling intuitive interaction with geospatial data. Our core mission is to empower decision-makers with rapid insights from vast, complex datasets. As part of our small, dynamic team, you will play a foundational role in building tools that have never existed before.

Role Summary

We are seeking a Lead Data Engineer to design, build, and scale our inference pipeline for geospatial embeddings. This pipeline is the backbone of LGND’s technological product, integrating with a point-and-click web application to generate embeddings for geographic areas of interest based on user-defined parameters. These embeddings will populate a custom vector database designed for massive scale and speed.

The ideal candidate is a seasoned engineer with experience in production-grade data pipelines, thrives under uncertainty, and is eager to collaborate across engineering, DevOps, and science disciplines. AI and geospatial experience are not required, if you are willing to learn fast with our help. Over time, this role will evolve into an engineering lead position, overseeing all technological components while focusing on engineering excellence.

Role is remote. We have team members in San Francisco, Philadelphia, and Coppenhagen.

Key Responsibilities

    • Build the Inference Pipeline:
      • Develop a scalable, efficient pipeline to generate geospatial embeddings based on user input, integrating parameters such as geographic area, model type, time range, tiling strategy, and imagery source.
      • Balance pre-processed tokens (e.g., cloud-free Sentinel imagery) with on-the-fly inference for optimal performance.
      • Ensure the pipeline supports billions of embeddings at scale and leverages advanced compute capabilities for fast inference, mostly on commercial clouds but also local resources..
    • Integration and Collaboration:
      • Work closely with front-end engineers to ensure seamless integration of the pipeline into a user-friendly web application.
      • Collaborate with leadership to determine which components of the pipeline and storage system should remain proprietary versus open-source.
      • Partner with external groups like AWS and Asterik Labs for open-source contributions and technical integrations.
    • Scalability and Professionalism:
      • Design a pipeline that other high-level data engineers can immediately inherit and build upon.
      • Move large amounts of data around professionally, focusing on scale, extensibility, and maintainability.
      • Ensure compliance with best practices in data engineering, DevOps, and MLOps.
    • Enhance Existing Projects:
  • Build upon existing foundational work to increase pipeline speed, scale, and extensibility. Key repositories include:
    • embeddings-worker: A Python module that creates vector embeddings of satellite images using the Clay Foundation Model. The system splits geographic regions into smaller chips, processes them in a distributed manner, and manages status tracking in a database.
    • embeddings-api: A REST API module that manages the vector database and orchestrates embedding generation tasks. It includes robust endpoints for scheduling geographic regions for processing, retrieving task status, and searching for similar vectors.
    • Future Leadership:
      • Serve as the lead for the inference pipeline, one of four core technological components at LGND (inference pipeline, fine-tuning and retrieval algorithms, vector search database, and SDK).
      • Optionally grow into an engineering manager role, overseeing future hires and cross-functional development efforts.

Scope of Work: First Two Months

  1. Increase the Speed and Scale of the Pipeline:
    • Optimize the inference pipeline to efficiently handle the generation of embeddings at massive scale.
    • Focus on performance improvements to support billions of embeddings and reduce inference runtime.
  2. Tokenize Source Imagery:
    • Develop a process to "tokenize" source imagery for a given geographic region and time range.
    • Produce image chips according to the large Earth observation model architecture.
    • Store these image chips in Amazon S3 for easy recall during subsequent inference runs.
  3. Run Model Inference:
    • Implement the pipeline to run inference on a couple of existing, pre-trained models.
    • Output the resulting embeddings and store them in a scalable, performant vector search database.
    • Collaborate with external partners, such as AWS, to ensure pipeline compatibility with the vector database infrastructure.
  4. Nice-to-Have Feature:
    • Develop functionality to process source imagery into mosaics to address cloud cover and other image quality issues, improving the quality of inputs for inference.

Scope of Work: First Two Months, expanded

  1. Operationalize the CLIP-based Retrieval Pipeline
    • Implement and optimize a scalable inference pipeline to generate CLIP embeddings (and embeddings from other pre-trained models) for remote sensing imagery.
    • Design the system to tokenize source imagery into manageable image chips for specific geographic areas and time ranges. Store these chips efficiently in Amazon S3 for reuse.
    • Ensure flexibility to incorporate additional embedding models in the future.
  2. Experiment with Multi-Modal Retrieval
  3. Database and API Design
    • Collaborate with external partners (e.g., AWS) to design a scalable vector search database capable of handling billions of embeddings.
    • Develop APIs to allow efficient storage and retrieval of embeddings based on user-defined queries (geographic area, model, time range, and textual context).
  4. Pre-Processing for Image Quality (Nice-to-Have)
    • Develop a feature to process source imagery into cloud-free mosaics, improving image quality for inference and retrieval.
  5. Performance Optimization
    • Optimize the pipeline for speed, ensuring embeddings can be generated at scale. Explore trade-offs between pre-processed tokens and on-the-fly inference.
    • Focus on building a robust, scalable system that reduces latency while maintaining flexibility.

Required Technical Skills:

  • Proficiency in Python and familiarity with Docker.
  • Expertise in building production-grade data pipelines at scale (10+ years of experience preferred).
  • Familiarity with tools and frameworks like:
    • Geospatial libraries: numpy, pandas, rasterio, geopandas, xarray.
    • Machine learning: PyTorch (torch, torchdata, torchvision), timm, einops.
    • Cloud integration: boto3 for AWS.
    • Database management: SQLAlchemy, GeoAlchemy2, pgvector, psycopg2.
  • Experience with inference pipelines, including pre-processing and real-time inference strategies.

Preferred Experience:

  • Familiarity with satellite image formats and protocols (e.g., STAC, Cloud Optimized GeoTIFFs, Zarr).
  • Experience with AWS infrastructure (bonus, not required).
  • Background in MLOps and geospatial AI applications.

Soft Skills:

  • Self-led and able to navigate uncertainty.
  • Excited by the opportunity to build tools and systems that have never been built before.
  • Collaborative, humble, and eager to learn.

Cultural Values

  • Humility: You value collaboration and learning from others.
  • Integrity: You uphold honesty and transparency in your work.
  • Effectiveness: You are results-driven, with a focus on building scalable, impactful solutions.

Compensation and Benefits

  • Competitive salary based on experience.
  • Equity options in a Seed Stage Startup
  • Flexible work arrangements.
  • Opportunity to play a foundational role in shaping LGND’s technological infrastructure.
Similar Jobs
Photo of the Rise User
Posted 5 days ago

Lead and mentor a multidisciplinary data engineering team to develop high-performance cloud-native data platforms at General Motors.

Manage and grow a dynamic Data Warehouse team at National Indemnity Company, delivering business-focused data solutions in a hybrid setting.

Photo of the Rise User
Citi Hybrid Tampa Florida United States
Posted 14 days ago
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony

Lead critical finance data integration and governance initiatives at Citibank in Tampa to support regulatory reporting and operational improvements.

Posted 10 days ago

The Sales Data Coordinator will oversee US distributor sales data integrity and master data management to enhance reporting and support sales operations at Sazerac Company.

Photo of the Rise User
Posted 3 days ago

Peraton requires a skilled Senior Enterprise Data Engineer to architect and optimize critical data infrastructure at Scott Air Force Base in support of USTRANSCOM operations.

A leadership role at Janux Therapeutics managing clinical data operations and vendor activities to ensure high-quality, compliant clinical trial data management.

East Daley Analytics Hybrid Greenwood Village, Colorado, United States
Posted 4 days ago

East Daley Analytics is looking for a Senior Data Engineer to lead data foundation efforts and deliver scalable, high-quality data solutions in the energy sector.

Photo of the Rise User
Woongjin, Inc Hybrid Ridgefield Park, NJ, USA
Posted 5 days ago

WOONGJIN, Inc. seeks a GCP Data Architect to lead data migration and architect advanced GCP data solutions.

Photo of the Rise User
Truv Hybrid No location specified
Posted 5 days ago

Innovative fintech company Truv is looking for a skilled GTM Automation Engineer to build AI-driven workflows that transform revenue operations at scale.

Photo of the Rise User
Geotab Hybrid Atlanta, Georgia, United States
Posted 2 days ago

Seeking a skilled Data Platform Developer to create and optimize big data infrastructure and machine learning platforms for a leading IoT and connected transportation company.

Photo of the Rise User
Posted 2 days ago
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony
Customer-Centric
Social Impact Driven
Dare to be Different
Maternity Leave
Paternity Leave
401K Matching
Paid Holidays
Paid Time-Off
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)

GitLab is looking for a Senior Analytics Engineer to enhance its Enterprise Data Platform by developing efficient data models and fostering collaboration across business and technical teams.

PNC Hybrid OH - Strongsville
Posted 9 days ago

Experienced Data Architect with strong SQL and data warehousing skills needed at PNC to drive data integration and analytics for strategic decision-making.

Photo of the Rise User
American Express Hybrid New York, New York, United States
Posted 11 days ago
Inclusive & Diverse
Empathetic
Collaboration over Competition
Growth & Learning
Transparent & Candid
Medical Insurance
Dental Insurance
Mental Health Resources
Life insurance
Disability Insurance
Child Care stipend
Employee Resource Groups
Learning & Development

Manage and drive finance data governance and metadata management at American Express within a collaborative and innovative environment.

MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
HQ LOCATION
No info
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
January 6, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!