Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Lead Data Engineer image - Rise Careers
Job details

Lead Data Engineer

About LGND
LGND is an early-stage startup revolutionizing geospatial AI infrastructure. We bridge the gap between large Earth observation models and specific application developers, enabling intuitive interaction with geospatial data. Our core mission is to empower decision-makers with rapid insights from vast, complex datasets. As part of our small, dynamic team, you will play a foundational role in building tools that have never existed before.

Role Summary

We are seeking a Lead Data Engineer to design, build, and scale our inference pipeline for geospatial embeddings. This pipeline is the backbone of LGND’s technological product, integrating with a point-and-click web application to generate embeddings for geographic areas of interest based on user-defined parameters. These embeddings will populate a custom vector database designed for massive scale and speed.

The ideal candidate is a seasoned engineer with experience in production-grade data pipelines, thrives under uncertainty, and is eager to collaborate across engineering, DevOps, and science disciplines. AI and geospatial experience are not required, if you are willing to learn fast with our help. Over time, this role will evolve into an engineering lead position, overseeing all technological components while focusing on engineering excellence.

Role is remote. We have team members in San Francisco, Philadelphia, and Coppenhagen.

Key Responsibilities

    • Build the Inference Pipeline:
      • Develop a scalable, efficient pipeline to generate geospatial embeddings based on user input, integrating parameters such as geographic area, model type, time range, tiling strategy, and imagery source.
      • Balance pre-processed tokens (e.g., cloud-free Sentinel imagery) with on-the-fly inference for optimal performance.
      • Ensure the pipeline supports billions of embeddings at scale and leverages advanced compute capabilities for fast inference, mostly on commercial clouds but also local resources..
    • Integration and Collaboration:
      • Work closely with front-end engineers to ensure seamless integration of the pipeline into a user-friendly web application.
      • Collaborate with leadership to determine which components of the pipeline and storage system should remain proprietary versus open-source.
      • Partner with external groups like AWS and Asterik Labs for open-source contributions and technical integrations.
    • Scalability and Professionalism:
      • Design a pipeline that other high-level data engineers can immediately inherit and build upon.
      • Move large amounts of data around professionally, focusing on scale, extensibility, and maintainability.
      • Ensure compliance with best practices in data engineering, DevOps, and MLOps.
    • Enhance Existing Projects:
  • Build upon existing foundational work to increase pipeline speed, scale, and extensibility. Key repositories include:
    • embeddings-worker: A Python module that creates vector embeddings of satellite images using the Clay Foundation Model. The system splits geographic regions into smaller chips, processes them in a distributed manner, and manages status tracking in a database.
    • embeddings-api: A REST API module that manages the vector database and orchestrates embedding generation tasks. It includes robust endpoints for scheduling geographic regions for processing, retrieving task status, and searching for similar vectors.
    • Future Leadership:
      • Serve as the lead for the inference pipeline, one of four core technological components at LGND (inference pipeline, fine-tuning and retrieval algorithms, vector search database, and SDK).
      • Optionally grow into an engineering manager role, overseeing future hires and cross-functional development efforts.

Scope of Work: First Two Months

  1. Increase the Speed and Scale of the Pipeline:
    • Optimize the inference pipeline to efficiently handle the generation of embeddings at massive scale.
    • Focus on performance improvements to support billions of embeddings and reduce inference runtime.
  2. Tokenize Source Imagery:
    • Develop a process to "tokenize" source imagery for a given geographic region and time range.
    • Produce image chips according to the large Earth observation model architecture.
    • Store these image chips in Amazon S3 for easy recall during subsequent inference runs.
  3. Run Model Inference:
    • Implement the pipeline to run inference on a couple of existing, pre-trained models.
    • Output the resulting embeddings and store them in a scalable, performant vector search database.
    • Collaborate with external partners, such as AWS, to ensure pipeline compatibility with the vector database infrastructure.
  4. Nice-to-Have Feature:
    • Develop functionality to process source imagery into mosaics to address cloud cover and other image quality issues, improving the quality of inputs for inference.

Scope of Work: First Two Months, expanded

  1. Operationalize the CLIP-based Retrieval Pipeline
    • Implement and optimize a scalable inference pipeline to generate CLIP embeddings (and embeddings from other pre-trained models) for remote sensing imagery.
    • Design the system to tokenize source imagery into manageable image chips for specific geographic areas and time ranges. Store these chips efficiently in Amazon S3 for reuse.
    • Ensure flexibility to incorporate additional embedding models in the future.
  2. Experiment with Multi-Modal Retrieval
  3. Database and API Design
    • Collaborate with external partners (e.g., AWS) to design a scalable vector search database capable of handling billions of embeddings.
    • Develop APIs to allow efficient storage and retrieval of embeddings based on user-defined queries (geographic area, model, time range, and textual context).
  4. Pre-Processing for Image Quality (Nice-to-Have)
    • Develop a feature to process source imagery into cloud-free mosaics, improving image quality for inference and retrieval.
  5. Performance Optimization
    • Optimize the pipeline for speed, ensuring embeddings can be generated at scale. Explore trade-offs between pre-processed tokens and on-the-fly inference.
    • Focus on building a robust, scalable system that reduces latency while maintaining flexibility.

Required Technical Skills:

  • Proficiency in Python and familiarity with Docker.
  • Expertise in building production-grade data pipelines at scale (10+ years of experience preferred).
  • Familiarity with tools and frameworks like:
    • Geospatial libraries: numpy, pandas, rasterio, geopandas, xarray.
    • Machine learning: PyTorch (torch, torchdata, torchvision), timm, einops.
    • Cloud integration: boto3 for AWS.
    • Database management: SQLAlchemy, GeoAlchemy2, pgvector, psycopg2.
  • Experience with inference pipelines, including pre-processing and real-time inference strategies.

Preferred Experience:

  • Familiarity with satellite image formats and protocols (e.g., STAC, Cloud Optimized GeoTIFFs, Zarr).
  • Experience with AWS infrastructure (bonus, not required).
  • Background in MLOps and geospatial AI applications.

Soft Skills:

  • Self-led and able to navigate uncertainty.
  • Excited by the opportunity to build tools and systems that have never been built before.
  • Collaborative, humble, and eager to learn.

Cultural Values

  • Humility: You value collaboration and learning from others.
  • Integrity: You uphold honesty and transparency in your work.
  • Effectiveness: You are results-driven, with a focus on building scalable, impactful solutions.

Compensation and Benefits

  • Competitive salary based on experience.
  • Equity options in a Seed Stage Startup
  • Flexible work arrangements.
  • Opportunity to play a foundational role in shaping LGND’s technological infrastructure.
What You Should Know About Lead Data Engineer, LGND AI, Inc.

At LGND, we're on a mission to revolutionize geospatial AI infrastructure, and we're looking for a Lead Data Engineer to join our small but mighty team! As a Lead Data Engineer at our innovative startup, you'll have the opportunity to design, build, and scale our cutting-edge inference pipeline for geospatial embeddings. This is quite an exciting role as you’ll be at the heart of our technological product. Your work will ensure that our point-and-click web application efficiently generates embeddings based on user-defined parameters for geographic areas of interest. Imagine being a part of something truly groundbreaking; these embeddings will be stored in a specially designed vector database that can handle massive scale and speed. We're seeking a seasoned engineer who has a proven track record in production-grade data pipelines. You should thrive in uncertain environments, as this role is all about collaboration across disciplines including engineering, DevOps, and science. While experience in AI or geospatial domains isn't a prerequisite, a willingness to learn quickly will be crucial. Initially, you'll focus on optimizing and enhancing existing projects while also laying the groundwork for future technological components. Over time, you'll naturally evolve into a leadership role, guiding others to maintain engineering excellence at LGND. If you're excited about building the tools of tomorrow and want the flexibility of a remote role, we encourage you to check us out!

Frequently Asked Questions (FAQs) for Lead Data Engineer Role at LGND AI, Inc.
What are the responsibilities of the Lead Data Engineer at LGND?

As a Lead Data Engineer at LGND, you will be tasked with designing and building a scalable inference pipeline for geospatial embeddings. This involves integrating various user-defined parameters to generate embeddings, ensuring fast performance while handling billions of embeddings. You'll also collaborate closely with front-end engineers to ensure a seamless integration with our web application, and partner with external groups for technical integrations.

Join Rise to see the full answer
What qualifications do I need to apply for the Lead Data Engineer role at LGND?

To be considered for the Lead Data Engineer position, it is preferred that you have over 10 years of experience in building production-grade data pipelines. Proficiency in Python and familiarity with tools such as Docker, AWS, and various geospatial libraries are key. Although experience specific to AI or geospatial domains is not mandatory, a genuine eagerness to learn and adapt is essential.

Join Rise to see the full answer
Is experience in geospatial AI necessary for the Lead Data Engineer position at LGND?

While having experience in AI and geospatial fields can be beneficial, it is not a strict requirement for the Lead Data Engineer role at LGND. We highly value candidates who are willing to learn and adapt quickly. Our team is ready to provide the necessary training to help you succeed in this innovative environment.

Join Rise to see the full answer
What does the first two months look like for the Lead Data Engineer at LGND?

In the first two months as a Lead Data Engineer at LGND, you will focus on optimizing the inference pipeline to handle embeddings efficiently. You'll work on developing tokenization processes for imagery and implementing the pipeline to run inference on pre-trained models. Additionally, collaboration with external partners will be vital to ensure compatibility within our growing infrastructure.

Join Rise to see the full answer
What are the growth opportunities for the Lead Data Engineer role at LGND?

As a Lead Data Engineer at LGND, there are substantial growth opportunities. The role is designed to evolve over time, with potential to transition into an engineering lead or manager position, overseeing future hires and directing cross-functional development efforts. You’ll play a crucial part in shaping the team's future and have the chance to lead new initiatives.

Join Rise to see the full answer
Common Interview Questions for Lead Data Engineer
How do you approach building scalable data pipelines as a Lead Data Engineer?

When building scalable data pipelines, I focus on designing the architecture to accommodate increased data volumes and optimize performance. Using best practices in data engineering such as modular design, efficient data storage solutions, and ensuring fault tolerance are key strategies. Also, I advocate for continuous monitoring and performance tuning to maintain efficiency.

Join Rise to see the full answer
Can you explain your experience with geospatial libraries in relation to data engineering?

Although I might not have extensive experience in geospatial libraries, I am familiar with tools like NumPy, Geopandas, and Rasterio. I understand that leveraging these libraries can enhance the processing of large geospatial datasets, and I’m eager to deepen my knowledge by working on real-world applications within the role.

Join Rise to see the full answer
What strategies do you use to enhance collaboration between data engineering and other teams?

Enhancing collaboration involves open communication and regular check-ins between teams. I also find it effective to use collaborative tools that allow for shared access to documentation and project updates. Participating in cross-functional meetings early in the project helps everyone understand their role and the larger mission.

Join Rise to see the full answer
How would you handle performance bottlenecks in a data pipeline?

To address performance bottlenecks in a data pipeline, I would first conduct a comprehensive analysis to identify the source of the delay. After pinpointing the issues, I would optimize processing steps, perhaps by implementing parallel processing or caching strategies, and continuously monitor system performance for ongoing improvements.

Join Rise to see the full answer
Describe your experience with cloud integration for data pipeline architecture.

I have integrated various cloud services within data pipeline architecture using AWS tools like S3 and Lambda. My approach has involved leveraging cloud computing to ensure scalability and flexibility, allowing for efficient storage and processing of large datasets without local infrastructure limitations.

Join Rise to see the full answer
What is your experience with MLOps and its relevance to data engineering?

While I have foundational knowledge of MLOps concepts, I see it as a vital aspect of data engineering, particularly in ensuring models are deployed and managed efficiently. My focus is on implementing robust data pipelines that support continuous integration and delivery for machine learning models.

Join Rise to see the full answer
What’s your approach to mentoring junior engineers?

My approach to mentoring junior engineers includes providing them with constructive feedback, guiding them through aspects of complex projects, and encouraging questions. I believe in creating an environment where they feel comfortable exploring new ideas and making their own contributions toward projects.

Join Rise to see the full answer
How do you ensure data quality in your engineering processes?

Ensuring data quality involves implementing validation checks and automated monitoring systems to detect anomalies early. Additionally, I focus on documentation of standards and best practices, making sure that all team members adhere to data quality protocols throughout the lifecycle of the data.

Join Rise to see the full answer
Describe a challenging data engineering project and how you overcame obstacles.

In a previous project, we faced significant challenges with data throughput that affected our real-time processing capabilities. By conducting a thorough analysis and revising our architecture, including the implementation of a message queue for efficient data handling, we were able to overcome the obstacles and significantly improve performance.

Join Rise to see the full answer
Why do you want to work as a Lead Data Engineer at LGND?

I am particularly excited about the opportunity to work at LGND because of its innovative approach to geospatial AI. The chance to be part of a foundational team that builds new tools and enhances technology resonates deeply with my career aspirations. I am eager to leverage my skills to contribute to transformative projects while collaborating with talented professionals.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Brillio Hybrid Chicago, Illinois, United States
Posted 2 days ago
Pearl Remote No location specified
Posted 13 days ago
Photo of the Rise User
Master Works Remote No location specified
Posted 5 days ago
Photo of the Rise User
Atlan Remote No location specified
Posted 11 days ago
Photo of the Rise User
Posted 4 days ago
Photo of the Rise User
Master Works Remote No location specified
Posted 5 days ago
Photo of the Rise User
Qventus Hybrid Mountain View
Posted 7 days ago
Photo of the Rise User
Meridian Energy Remote Christchurch Central, Christchurch, New Zealand
Posted 9 hours ago
MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
LOCATION
No info
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
January 6, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!