Job details

Researcher: Multimodal

Get a free resume review

About Cartesia

Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.

We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.

We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.

The Role

• Conduct cutting-edge research at the intersection of machine learning, multimodal data, and generative modeling to advance the state of AI across audio, text, vision, and other modalities.

• Develop novel algorithms for multimodal understanding and generation, leveraging new architectures, training algorithms, datasets, and inference techniques.

• Design and build models that enable seamless integration of modalities for multimodal reasoning on streaming data.

• Lead the creation of robust evaluation frameworks to benchmark model performance on multimodal datasets and tasks.

• Collaborate closely with cross-functional teams to translate research breakthroughs into impactful products and applications.

What We’re Looking For

• Expertise in machine learning, multimodal learning, and generative modeling, with a strong research track record in top-tier conferences (e.g., CVPR, ICML, NeurIPS, ICCV).

• Proficiency in deep learning frameworks such as PyTorch or TensorFlow, with experience in handling diverse data modalities (e.g., audio, video, text).

• Strong understanding of state-of-the-art techniques for multimodal modeling, such as autoregressive and diffusion modeling, and deep understanding of architectural tradeoffs.

• Passion for exploring the interplay between modalities to solve complex problems and create groundbreaking applications.

• Excellent problem-solving skills, with the ability to independently tackle research challenges and collaborate effectively with multidisciplinary teams.

Nice-to-Haves

• Experience working with multimodal datasets, such as audio-visual datasets, video-captioning datasets, or large-scale cross-modal corpora.

• Background in designing or deploying real-time multimodal systems in resource-constrained environments.

• Early-stage startup experience or experience working in fast-paced R&D environments.

Our culture

🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.

🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.

🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.

Our perks

🍽 Lunch, dinner and snacks at the office.

🏥 Fully covered medical, dental, and vision insurance for employees.

🏦 401(k).

✈️ Relocation and immigration support.

🦖 Your own personal Yoshi.

Average salary estimate

$125000 / YEARLY (est.)

min

max

$100000K

$150000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Researcher: Multimodal, Cartesia

Are you excited about the future of AI? Cartesia is on the lookout for a talented Researcher: Multimodal to join our innovative team in San Francisco! In this role, you'll be at the forefront of pioneering new models and architectures that advance the integration of audio, text, and vision. Your research will center on multimodal data, as you design and develop algorithms that enhance how machines understand and process various modes of information. We foster a collaborative environment, working closely with cross-functional teams to transform cutting-edge research into impactful products. Your expertise in machine learning and generative modeling will be vital as you tackle complex challenges and push the boundaries of technology. Not only will you thrive in a high-execution speed culture that prioritizes quality and design, but you'll also enjoy supportive resources to ensure your success. Plus, with great perks like fully covered medical insurance, delicious meals at the office, and a strong commitment to inclusivity, Cartesia is a vibrant place to contribute to the next generation of AI. If you're passionate about multimodal learning and ready to make your mark, we want to hear from you!

Frequently Asked Questions (FAQs) for Researcher: Multimodal Role at Cartesia

What are the key responsibilities of a Researcher: Multimodal at Cartesia?

As a Researcher: Multimodal at Cartesia, you will be responsible for conducting advanced research in machine learning and multimodal data integration. This includes developing novel algorithms for seamless multimodal reasoning, creating robust evaluation frameworks, and collaborating with multidisciplinary teams to translate breakthroughs into innovative products. Your role will be pivotal in shaping the future of AI, focusing on transforming how machines process diverse data types.