About the role
Contribute to METR’s efforts to produce tasks/benchmarks/protocols that can determine if an AI model has the potential for dangerous capabilities.
- create suites of model evaluations (’tasks”) which follow the METR Task Standard
- build the infrastructure for testing and running these tasks reliably at scale
- standardize and automate processes for monitoring and improving task quality
- develop LLM-powered agents to test the capabilities of frontier models
- implement data workflows for robust and reproducible reporting on model performance
You can see some examples of our open-source work at:
- Vivaria: our platform for running evaluations at scale (TypeScript, React, Docker, k8s)
- METR Task Standard: our spec for task implementations, with many simple example tasks (Python, Docker)
- pyhooks: our client library for writing agents that work with Vivaria (Python)
- modular-public: one of our workhorse agents, which uses pyhooks (Python)
- headless-human: our “human agent”, which we use for performing human “baselines” of tasks
What we’re looking for
We're looking for a versatile software engineer who thrives on diverse challenges. Ideal candidates will have 7+ years of professional experience and deep expertise in building robust, well-tested asynchronous Python applications. We'll also consider candidates who can demonstrate equivalent expertise through open source projects or portfolios.
In this role, you'll identify areas for improvement in our core research workflows, collaborate with engineers and researchers to understand their needs, and implement solutions. You'll help shape the technology and architecture of METR's evaluation platform as we scale to new heights.
Our tech stack centers on Python, TypeScript, Docker, Kubernetes, and AWS infrastructure, with integrations into Airtable, Slack, and other services. While the following skills are valuable, we know no single person will have them all. If you're strong in even a few of these areas, we encourage you to apply:
- rapid prototyping, MVP development, pragmatic problem-solving, and risk mitigation
- user-focused design, cross-team communication, and ability to explain technical constraints and tradeoffs to diverse colleagues
test-driven development and writing clear, maintainable code
- data engineering, versioned pipeline development, and efficient data analysis
- workflow automation and third-party system integration
- cloud infrastructure, secure platform design, and automated testing/deployment
- systems architecture, simplicity in design, and strategic problem-solving
Above all, we value a founder's mindset—someone who takes ownership, drives rapid progress, and can guide the team effectively through challenges.
About us
METR is a non-profit doing empirical research to test for whether frontier AI models possess the capability to permanently disempower humanity. We develop scientific methods to assess these risks accurately, and work with frontier AI companies (e.g., OpenAI, Anthropic), and government agencies to deploy these assessments. Our work helps ensure the safe development and deployment of transformative AI systems.
Some highlights of our work so far:
- Establishing autonomous replication evaluations: Thanks to our work, it’s now an industry norm to test models for autonomous capabilities (such as self-improvement and self-replication).
- Pre-release evaluations: We’ve worked with OpenAI and Anthropic to evaluate their models pre-release, and our research has been widely cited by policymakers, AI labs, and within government.
- Early commitments from labs: The safety frameworks of Google DeepMind, OpenAI, and Anthropic all credit or endorse our work in developing responsible scaling policies.
- Our work has been internationally recognized, e.g. by the UK government and Time Magazine.
- Inspiring lab evaluation efforts: Multiple leading AI companies are building their own internal evaluation teams, inspired by our work.
We are a motivated, fast-paced, growing team (currently ~20 people). Candidates should be excited about working entrepreneurially in a rapidly changing environment while helping to strengthen the organization's operational rigor.
Logistics
Successful candidates will complete two rounds of paid work tests and interviews, followed by three pair-programming interviews with different members of the team.
- Deadline to apply: None. Applications will be reviewed on a rolling basis.
- Compensation Range: $240,558 - $318,138 plus employee benefits
- Location: This role would be in-person out of our beautiful co-working space in Berkeley, CA.
Apply for this job
We encourage you to apply even if your background may not seem like the perfect fit! We would rather review a larger pool of applications than risk missing out on a promising candidate for the position. If you lack US work authorization, we can likely sponsor a cap-exempt H-1B visa for this role.
We are committed to diversity and equal opportunity in all aspects of our hiring process. We do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. We welcome and encourage all qualified candidates to apply for our open positions.
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
Are you ready to take your software engineering career to the next level? METR is on the hunt for a talented Senior Software Engineer to join our dynamic team! In this role, you will play a pivotal part in creating tasks and benchmarks that assess the potential dangers posed by AI models. Your mission will involve developing model evaluations to standardize and automate processes that enhance quality. You'll be working with cutting-edge technology, including Python and TypeScript, and contributing to our open-source projects, like Vivaria, which supports large-scale evaluations. METR is dedicated to ensuring the safe development of AI systems, collaborating with leaders in the industry, and influencing important safety frameworks. With your 7+ years of experience in building robust applications, you'll not only identify areas for improvement within our workflows but also play a crucial role in shaping the future technology and architecture of our evaluation platform. Your innovative thinking will help maintain our commitment to rigorous scientific research that promotes the responsible deployment of AI models. We're looking for someone who thrives on diverse challenges and is eager to make a tangible impact in a rapidly evolving field. If you have a founder's mindset and are excited about tackling big problems in tech, we want to hear from you!
Subscribe to Rise newsletter