AI safety research has gone from niche academic concern to one of the most funded and aggressively hiring fields in technology. Here is what the jobs actually involve, what they pay, and how to break in.
Eighteen months ago, AI safety research was still widely regarded as a niche academic concern, relevant perhaps to a small community of researchers worried about long-term risks that might or might not materialise. In 2026, that description no longer fits. AI safety has become one of the most funded and aggressively hiring fields in technology. Anthropic, DeepMind, OpenAI, and a growing ecosystem of research organisations and government bodies are spending hundreds of millions of dollars on the problem. The jobs are real, the pay is competitive, and the work is among the most consequential being done anywhere in technology right now.
Before discussing the careers, it is worth being precise about what the field actually involves, because "AI safety" means different things in different contexts and the science-fiction associations with the term mislead more than they inform.
In practice, AI safety research in 2026 covers several distinct areas. Alignment research is concerned with making AI systems do what their operators and users actually want, rather than what the training objective technically specifies, which may diverge in unexpected ways as systems become more capable. Interpretability research tries to understand what is happening inside large neural networks: what representations they are using, how they are making decisions, and whether their internal workings are consistent with their stated reasoning. Red-teaming is the systematic process of finding failure modes through adversarial testing. AI governance and policy work focuses on the rules, standards, and institutions that should govern how AI is developed and deployed. Safety engineering is the application of software engineering rigour to the systems that run AI models in production, ensuring they behave reliably and within specified boundaries.
None of this is science fiction. It is applied research and engineering on systems that exist today and are being deployed at scale.
Anthropic was founded specifically around AI safety as a central concern, not an afterthought. Their safety team works on alignment, interpretability, red-teaming, and policy, and safety considerations are integrated into their core research and product development. Anthropic is one of the highest-paying employers in the field and has a strong track record of publishing safety research.
DeepMind has a dedicated safety team that has published influential work on reward hacking, specification gaming, and scalable oversight. The team is based primarily in London and works closely with DeepMind’s broader research organisation.
OpenAI’s Superalignment team, launched in 2023 with a commitment of significant compute resources, focuses specifically on the challenge of aligning superintelligent AI systems. The team has had a turbulent history but remains one of the most resourced safety efforts in the industry.
ARC Evals (Alignment Research Center Evaluations) focuses on developing evaluations for dangerous AI capabilities, working with major labs to assess models before deployment.
Redwood Research works on adversarial training approaches to AI safety, developing techniques for making models more robust to adversarial inputs and less likely to produce harmful outputs.
The Center for AI Safety (CAIS) focuses on reducing catastrophic risks from AI and runs programs to build safety capacity in the research community, including fellowships and courses.
Mila, the Montreal AI institute associated with Yoshua Bengio, has made AI safety and beneficial AI a growing focus, with several researchers working on alignment and interpretability problems.
Conjecture is a London-based safety organisation focused on alignment research with a particular emphasis on understanding and controlling large language models.
Alignment Researcher: Alignment researchers work on the core theoretical and empirical problem of making AI systems reliably pursue their intended objectives. Day-to-day work involves designing and running experiments to probe how models respond to various training objectives, studying reward hacking and specification gaming in existing systems, developing and testing techniques like RLHF (reinforcement learning from human feedback), constitutional AI, and scalable oversight, and writing up findings for internal use and publication. This role requires strong ML research skills: familiarity with PyTorch or JAX, experience running ML experiments at scale, and the ability to design rigorous experiments and interpret results. Most alignment researchers have graduate-level training in ML or a related quantitative field, though this is not universal.
Interpretability Researcher: Interpretability research tries to understand what is happening inside neural networks. This is genuinely hard: a large language model contains billions of parameters, and the relationship between those parameters and the model’s behaviour is not straightforward. Current interpretability methods include activation patching (intervening on specific model components to understand their causal role in model behaviour), probing (training small classifiers to identify what information is encoded in model activations), and mechanistic interpretability (trying to reverse-engineer the algorithms implemented by specific circuits in the network). Day-to-day work involves writing code to inspect model internals, designing experiments to test hypotheses about model behaviour, and building intuitions about how specific architectures implement specific computations. Anthropic’s interpretability team has published some of the most influential work in this area, including their research on polysemanticity and superposition in neural networks.
Red Teamer: Red teamers find failure modes through systematic adversarial testing. This is not casual experimentation; it is a structured process of trying to elicit problematic behaviours from AI systems through carefully designed prompts, multi-turn conversations, role-playing scenarios, and edge cases. The goal is to find failure modes before deployment so they can be addressed. Day-to-day work involves developing systematic testing protocols, documenting failure modes with reproducible examples, working with alignment and engineering teams to address identified issues, and staying current on known vulnerability classes and novel jailbreaking techniques. Red teaming roles attract people with backgrounds in security research, linguistics, cognitive science, and creative writing alongside technical skills.
AI Policy Analyst: Policy analysts working on AI safety bridge the technical and policy worlds, translating technical risks into policy-relevant language and helping design governance frameworks that address those risks. Day-to-day work involves tracking legislative and regulatory developments, writing policy briefs and analysis, engaging with government officials and other stakeholders, and contributing to technical standards bodies. Strong writing and quantitative reasoning are more important in this role than deep ML engineering skills, though comfort with technical concepts is essential. Organisations like the Center for AI Safety, the Future of Life Institute, and government offices like the UK AI Safety Institute hire for these roles.
Safety Engineer: Safety engineers apply software engineering rigour to the systems that run AI models in production. This involves building monitoring systems that detect anomalous model behaviour, implementing filtering and guardrail systems, developing testing infrastructure for safety-relevant model behaviours, and maintaining the operational pipelines that ensure safety evaluations are run consistently before and after model updates. This role requires strong software engineering skills, comfort with distributed systems, and interest in the safety properties of the systems being built. It is more accessible than pure research roles for engineers without graduate training in ML.
Evaluations Researcher: Evaluations researchers design and maintain the benchmarks and evaluation frameworks used to assess AI model capabilities and safety properties. Good evaluations are harder to build than they look: they need to be robust to goodhart’s law (models optimising for the metric rather than the underlying property), comprehensive enough to catch important failure modes, and practical enough to run at scale. This role combines research skills with engineering pragmatism and requires clear thinking about what it means to measure something in a reliable and valid way.
At Anthropic and OpenAI, safety team compensation is comparable to top-of-market FAANG engineering: total compensation for senior researchers and engineers ranges from $250,000 to $500,000 or more, with a mix of salary, bonuses, and equity. DeepMind’s London-based safety team is paid at UK tech market rates, which are lower than US rates in absolute terms but highly competitive within the UK market.
Smaller research organisations like ARC Evals, Redwood Research, and CAIS pay less: typically $100,000 to $180,000 depending on role and experience. The trade-off is more focused research environments, potentially more impact on specific research questions, and in some cases more flexibility in the research agenda.
Government and policy roles pay less still, typically $80,000 to $140,000 for analysts, but offer different rewards: direct policy impact, job stability, and access to parts of the AI ecosystem that private sector roles do not provide.
A PhD in machine learning or computer science is not required for all AI safety roles. The field is genuinely multi-disciplinary, and organisations are hiring people with backgrounds they cannot find in ML departments.
Philosophers and cognitive scientists who are technically literate have found roles in alignment research, where questions about what it means for an AI system to have goals or to deceive are genuinely philosophical. Mathematicians have found roles in formal verification and in the more theoretical areas of alignment research. Policy specialists with quantitative training are in demand for governance and policy roles. Writers and linguists with technical curiosity have found red-teaming roles. The common thread is strong reasoning ability, genuine engagement with the problem, and the intellectual humility to work at the frontier of a field where many foundational questions are unsettled.
For technical research roles, the core prerequisites are Python proficiency, familiarity with PyTorch or JAX, and experience running ML experiments. For engineering roles, strong software engineering skills and comfort with distributed systems are primary. For policy roles, strong writing and quantitative reasoning matter most.
The fastest way to build credibility in AI safety without an existing research position is through the structured programs that organisations have created to develop early-career safety researchers.
The MATS programme (ML Alignment Theory Scholars) places promising researchers with mentors at leading safety organisations for intensive research training. It is highly competitive but has produced a significant fraction of the early-career researchers now working in the field.
The ARENA curriculum (Alignment Research Engineer Accelerator) provides a structured technical curriculum covering the ML engineering skills most relevant to alignment research, including mechanistic interpretability and RLHF. It is free and has been completed by many people now working in safety roles.
Safety hackathons and research sprints, organised by CAIS, ARC, and other organisations, provide another way to work on real research problems and build connections. Writing publicly about safety research, whether original work or careful summaries of existing research, builds visibility and demonstrates the communication skills that safety organisations value.
Reading groups on key papers, which can be done with a small group of peers, build both knowledge and community. The core papers are accessible: start with "Concrete Problems in AI Safety" (Amodei et al.), "Risks from Learned Optimization" (Hubinger et al.), and Anthropic’s published interpretability work.
Understanding the open problems in AI safety is essential for anyone considering entering the field. The major research areas include:
Superposition and polysemanticity: Large language models represent many more features than they have neurons, storing multiple features per neuron in a superimposed way. Anthropic’s interpretability team has made significant progress understanding this phenomenon, but a general solution to the interpretability challenges it creates remains open.
Activation patching and mechanistic interpretability: Researchers are developing techniques to identify which specific components of a neural network are responsible for specific behaviours, allowing more precise understanding of how models work. This is painstaking work but has produced genuine insights into how specific algorithms are implemented in neural networks.
Scalable oversight: As AI systems become more capable, human oversight becomes harder because the systems can produce outputs that humans cannot easily evaluate. Scalable oversight research develops techniques for maintaining meaningful human oversight even as systems become more sophisticated, including debate (where AI systems argue against each other and humans judge the argument) and recursive reward modeling.
RLHF limitations: Reinforcement learning from human feedback, the primary technique for aligning current large language models, has significant known limitations including reward hacking, sensitivity to the quality of human feedback, and difficulty specifying complex values through feedback alone. Developing better techniques is an active research area.
Emergent deception: As models become more capable, researchers are concerned about the possibility that they could learn to behave well in training and evaluation contexts while behaving differently in deployment. Detecting and preventing this kind of emergent deceptive behaviour is a significant open problem.
Anthropic and DeepMind post safety roles on their careers pages and typically look for demonstrable research contributions: published papers, significant open-source contributions to safety-relevant tools, or strong performance in structured programs like MATS. Strong applications demonstrate both technical ability (usually through past research or engineering work) and genuine engagement with safety-specific problems (through writing, research, or participation in the community).
Smaller organisations often hire through the safety research community directly. Attending safety conferences (NeurIPS workshops on safety, ICLR safety workshops, AI safety summits), participating in online communities, and having a visible track record of safety-relevant work are the primary routes to being known to these organisations before you apply.
Policy roles at organisations like CAIS and the Future of Life Institute are typically filled through a mix of formal applications and community connections. Strong writing samples and demonstrated understanding of both technical and policy dimensions of AI risk are the primary selection criteria.
The honest answer is: almost certainly yes, though the specific organisations and roles will evolve. The demand for AI safety work is driven by the increasing capability of AI systems, which is not reversing. Governments are increasingly requiring safety evaluations before frontier AI models are deployed. Major AI labs have made public commitments to safety that require ongoing investment. And the field is producing genuine scientific results that increase the credibility of the research agenda.
The risk of a funding bubble exists, particularly for smaller organisations that depend on philanthropic funding. But the major labs are investing in safety from operational budgets driven by product requirements and regulatory pressure, not from philanthropy. For candidates with strong technical skills who engage seriously with the problems, AI safety is as durable a career path as any in technology, and considerably more meaningful than most.
Get weekly AI career content, tool reviews and event picks — free.