About Us
At Vizcom, we empower designers at companies like Nike, General Motors, and Riot Games to turn ideas into reality faster and with more precision. Our tools integrate seamlessly into workflows, providing real-time feedback that bridges creativity and manufacturability.
We’re building a high-performance, reliable job-scheduling system that powers distributed AI/ML workflows and ephemeral jobs. Our platform must handle large-scale concurrency, orchestrate GPU workers, and provide seamless failover and retry. We value engineers who excel at designing robust infrastructure, implementing elegant distributed systems, and writing clean, maintainable code.
You will be the primary engineer designing and implementing a next-generation Job Scheduling & Distributed Computing platform. This includes everything from a fault-tolerant queue system to advanced load balancing, worker orchestration, real-time monitoring, and autoscaling. You’ll collaborate with product teams to ensure the platform can handle diverse workloads—such as ephemeral AI jobs, data processing, and high-priority tasks.
Key Responsibilities
Design & Build a job scheduling service:
Architect a robust queuing system (Redis, Postgres, or other) to track, schedule, and distribute jobs across multiple workers/GPUs.
Implement advanced features: priority scheduling, concurrency limits, retry logic, and timeouts.
Infrastructure & Reliability:
Ensure the system is highly available, fault tolerant, and horizontally scalable.
Introduce monitoring, alerting, and logging best practices for distributed workloads.
Automate provisioning, autoscaling, and failover in cloud environments (AWS, GCP, or similar).
Worker Orchestration:
Manage worker registration and capacity tracking.
Implement a load balancing strategy based on resource usage (GPU, CPU, memory).
Support ephemeral job “mailboxes,” streaming results to clients in real time.
System Integrations:
Collaborate with AI/ML teams to integrate inference workloads (e.g., GPU-intensive tasks) into the job scheduler.
Hook into existing deployment pipelines and internal tooling.
Performance & Observability:
Collect and analyze metrics for scheduling latency, queue lengths, job success/failure, and worker health.
Optimize throughput, minimize overhead, and detect performance bottlenecks early.
5+ years of experience in backend or infrastructure engineering with a focus on distributed systems or HPC (high-performance computing).
Deep knowledge of concurrency patterns, job queues, or pub/sub frameworks (e.g., BullMQ, RabbitMQ, Kafka, or custom solutions).
Cloud Expertise: Comfortable deploying containerized services (Docker/Kubernetes) on AWS, GCP, or Azure. Knowledge of IaC (Pulumi, Terraform, or CDK) is a plus.
Database & Caching: Skilled with SQL/NoSQL. Familiarity with in-memory datastores like Redis for real-time queueing.
Programming: Proficient in Node.js/TypeScript (or similar backend language). Strong coding skills, comfortable writing production-grade code, testable components, and microservices.
Scalable Infra: Track record of designing and running highly scalable, resilient backends. Experience with autoscaling GPU or HPC clusters is a huge bonus.
Monitoring & DevOps: Good grasp of logging, metrics (Datadog, Prometheus, Grafana), and CI/CD pipelines.
GPU / ML: Experience orchestrating GPU-intensive jobs, integrating with frameworks like PyTorch or TensorFlow.
Event-Driven: Familiarity with tRPC, GraphQL, or gRPC for real-time or streaming data flows.
Security & Networking: Knowledge of API token management, service-to-service security, TLS termination, etc.
Autoscaling: Practical experience building or tuning an autoscaler.
Ownership & Impact: You’ll design a critical system used by the entire organization—your code is the backbone of large-scale AI/ML workflows.
Cutting-Edge Stack: Work with GPU clusters, ephemeral job management, real-time scheduling, and advanced cloud infra.
Flexible Work Environment: Remote-friendly culture, flexible hours, and supportive of personal development.
Compensation & Benefits: Competitive salary, equity, healthcare, and an allowance for home office or co-working space.
Growth Opportunities: Leadership track potential—help define the engineering culture and best practices for years to come.
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
Develop and maintain scalable infrastructure software while automating processes and ensuring secure, high-performance system operations.
Contribute to Navan's Expense Platform as a Backend Engineer, building innovative, scalable, and reliable backend services that power modern expense management.
WalkMe is looking for a Full Stack Developer passionate about web technologies and product innovation to help elevate their digital adoption platform.
Contribute as a Full-Stack Product Engineer at One Project to develop innovative technology supporting a new, equitable economic system.
Advance your career by developing cutting-edge EDA software and AI applications with Cadence in Burlington, MA.
An exciting opportunity for Senior Software Engineers to drive innovation in computational geometry and manufacturing automation at a well-funded early-stage startup.
Koalafi seeks a Full Stack Engineer skilled in modern web technologies to develop scalable customer-facing applications in an innovative fintech environment.
Contribute to StubHub’s edge infrastructure as a Staff Software Engineer focused on state-of-the-art CDN and security systems, enabling millions of live event fans worldwide.
Noetica is hiring a Backend Software Engineer to build robust data pipelines and integrations that support cutting-edge NLP solutions in capital markets.
Experienced .NET and Angular developer wanted for a remote contract role with a leading nearshore technology firm serving top U.S. clients.
Contribute to the evolution of Apple's core operating system foundation by developing innovative system software tailored for cloud and server environments within the Darwin Server team.
Drive global user growth as a Senior Software Engineer on Airwallex's innovative Growth Team based in San Francisco.
Lead a cross-functional team at Shield AI to build advanced test infrastructures for cutting-edge AI and robotics defense technology.
Subscribe to Rise newsletter