Job details

Senior Software Engineer, Distributed Systems & Infrastructure

Get a free resume review

About Us

At Vizcom, we empower designers at companies like Nike, General Motors, and Riot Games to turn ideas into reality faster and with more precision. Our tools integrate seamlessly into workflows, providing real-time feedback that bridges creativity and manufacturability.

We’re building a high-performance, reliable job-scheduling system that powers distributed AI/ML workflows and ephemeral jobs. Our platform must handle large-scale concurrency, orchestrate GPU workers, and provide seamless failover and retry. We value engineers who excel at designing robust infrastructure, implementing elegant distributed systems, and writing clean, maintainable code.

The Role

You will be the primary engineer designing and implementing a next-generation Job Scheduling & Distributed Computing platform. This includes everything from a fault-tolerant queue system to advanced load balancing, worker orchestration, real-time monitoring, and autoscaling. You’ll collaborate with product teams to ensure the platform can handle diverse workloads—such as ephemeral AI jobs, data processing, and high-priority tasks.

Key Responsibilities

Design & Build a job scheduling service:
- Architect a robust queuing system (Redis, Postgres, or other) to track, schedule, and distribute jobs across multiple workers/GPUs.
- Implement advanced features: priority scheduling, concurrency limits, retry logic, and timeouts.
Infrastructure & Reliability:
- Ensure the system is highly available, fault tolerant, and horizontally scalable.
- Introduce monitoring, alerting, and logging best practices for distributed workloads.
- Automate provisioning, autoscaling, and failover in cloud environments (AWS, GCP, or similar).
Worker Orchestration:
- Manage worker registration and capacity tracking.
- Implement a load balancing strategy based on resource usage (GPU, CPU, memory).
- Support ephemeral job “mailboxes,” streaming results to clients in real time.
System Integrations:
- Collaborate with AI/ML teams to integrate inference workloads (e.g., GPU-intensive tasks) into the job scheduler.
- Hook into existing deployment pipelines and internal tooling.
Performance & Observability:
- Collect and analyze metrics for scheduling latency, queue lengths, job success/failure, and worker health.
- Optimize throughput, minimize overhead, and detect performance bottlenecks early.

About You

5+ years of experience in backend or infrastructure engineering with a focus on distributed systems or HPC (high-performance computing).
Deep knowledge of concurrency patterns, job queues, or pub/sub frameworks (e.g., BullMQ, RabbitMQ, Kafka, or custom solutions).
Cloud Expertise: Comfortable deploying containerized services (Docker/Kubernetes) on AWS, GCP, or Azure. Knowledge of IaC (Pulumi, Terraform, or CDK) is a plus.
Database & Caching: Skilled with SQL/NoSQL. Familiarity with in-memory datastores like Redis for real-time queueing.
Programming: Proficient in Node.js/TypeScript (or similar backend language). Strong coding skills, comfortable writing production-grade code, testable components, and microservices.
Scalable Infra: Track record of designing and running highly scalable, resilient backends. Experience with autoscaling GPU or HPC clusters is a huge bonus.
Monitoring & DevOps: Good grasp of logging, metrics (Datadog, Prometheus, Grafana), and CI/CD pipelines.

Nice to Have

GPU / ML: Experience orchestrating GPU-intensive jobs, integrating with frameworks like PyTorch or TensorFlow.
Event-Driven: Familiarity with tRPC, GraphQL, or gRPC for real-time or streaming data flows.
Security & Networking: Knowledge of API token management, service-to-service security, TLS termination, etc.
Autoscaling: Practical experience building or tuning an autoscaler.

What We Offer

Ownership & Impact: You’ll design a critical system used by the entire organization—your code is the backbone of large-scale AI/ML workflows.
Cutting-Edge Stack: Work with GPU clusters, ephemeral job management, real-time scheduling, and advanced cloud infra.
Flexible Work Environment: Remote-friendly culture, flexible hours, and supportive of personal development.
Compensation & Benefits: Competitive salary, equity, healthcare, and an allowance for home office or co-working space.
Growth Opportunities: Leadership track potential—help define the engineering culture and best practices for years to come.

Software Engineer Distributed Systems Job Scheduling Backend Engineering Cloud Computing

Average salary estimate

$140000 / YEARLY (est.)

min

max

$120000K

$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Infrastructure Software Developer

Flexcompute Inc. Hybrid No location specified

VIEW

Posted 7 days ago

Develop and maintain scalable infrastructure software while automating processes and ensuring secure, high-performance system operations.

Backend Software Engineer

Navan Hybrid Palo Alto, California, United States

VIEW

Posted 7 days ago

Contribute to Navan's Expense Platform as a Backend Engineer, building innovative, scalable, and reliable backend services that power modern expense management.

Full Stack Developer

WalkMe Hybrid New York City

VIEW

Posted 5 days ago

WalkMe is looking for a Full Stack Developer passionate about web technologies and product innovation to help elevate their digital adoption platform.

Full Stack Product Engineer

One Project Hybrid Remote Work

VIEW

Posted yesterday

Contribute as a Full-Stack Product Engineer at One Project to develop innovative technology supporting a new, equitable economic system.

C++ Software Engineer, New College Grad (Burlington, MA / Greater Boston)

Cadence Hybrid Burlington, MA

VIEW

Posted 10 days ago

Advance your career by developing cutting-edge EDA software and AI applications with Cadence in Burlington, MA.

Software Engineer (Computational Geometry)

Layup Parts Hybrid Huntington Beach

VIEW

Posted 7 days ago

An exciting opportunity for Senior Software Engineers to drive innovation in computational geometry and manufacturing automation at a well-funded early-stage startup.

Full Stack Engineer

Koalafi Hybrid Richmond, Virginia, United States

VIEW

Posted 6 days ago

Koalafi seeks a Full Stack Engineer skilled in modern web technologies to develop scalable customer-facing applications in an innovative fintech environment.

Staff Software Engineer - Edge-Services Security

StubHub Hybrid Los Angeles, California, United States

VIEW

Posted 4 days ago

Contribute to StubHub’s edge infrastructure as a Staff Software Engineer focused on state-of-the-art CDN and security systems, enabling millions of live event fans worldwide.

Software Engineer, Backend

Noetica Hybrid New York

VIEW

Posted 3 days ago

Noetica is hiring a Backend Software Engineer to build robust data pipelines and integrations that support cutting-edge NLP solutions in capital markets.

Senior Full-stack Engineer (.NET/Angular) - Software

Truelogic Hybrid No location specified

VIEW

Posted 6 days ago

Experienced .NET and Angular developer wanted for a remote contract role with a leading nearshore technology firm serving top U.S. clients.

Core OS Software Engineer - Darwin Server

Apple Hybrid Cupertino, California, United States

VIEW

Posted 13 days ago

Inclusive & Diverse

Diversity of Opinions

Work/Life Harmony

Dare to be Different

Reward & Recognition

Empathetic

Take Risks

Growth & Learning

Transparent & Candid

Mission Driven

Passion for Exploration

Feedback Forward

Medical Insurance

Dental Insurance

Vision Insurance

Mental Health Resources

Life insurance

Disability Insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

Learning & Development

Paid Time-Off

Maternity Leave

Social Gatherings

Contribute to the evolution of Apple's core operating system foundation by developing innovative system software tailored for cloud and server environments within the Darwin Server team.