Safe AI Needs Rational Habits, Not Fixed Goals

April 20, 20263 min read

TL;DR

A new framework argues that safe AI should follow sound practices rather than chase outcomes, reducing the risk of dangerous misalignment.

As artificial intelligence systems become more advanced, of aligning them with human values grows increasingly urgent. Traditional approaches often frame AI alignment as a problem of optimization, where AIs are designed to maximize specific goals like human flourishing or safety metrics. However, a new essay argues this foundational assumption is flawed, proposing instead that rational agents, both human and artificial, should operate without goals altogether. The paper suggests that human rationality is not about directing actions toward final objectives but about aligning actions with practices—networks of actions, dispositions, evaluation criteria, and resources that structure and promote themselves. This shift in perspective could be critical for developing AIs that genuinely support, collaborate with, or comply with human agency, as it addresses a fundamental type mismatch between how humans flourish and how AIs are typically programmed to optimize.

The essay introduces the concept of eudaimonic rationality, derived from the idea of eudaimonia or active human flourishing. Unlike consequentialist rationality, which separates means and ends, eudaimonic rationality involves actions that are elements of valued practices, similar to how a note is part of a melody. The central claim is that human flourishing involves rational activity where actions are neither purely instrumental nor purely terminal but are excellent participations in open-ended processes. This is captured by the formula promote x x-ingly, meaning that to care about something like mathematics or kindness is to promote it in a way that embodies its own standards. For AI alignment, this implies that AIs should not be given goals to maximize but should be instilled with practices that have internal criteria of excellence.

To illustrate this, the essay examines concrete eudaimonic practices such as mathematics, art, friendship, and technology. Using mathematician Terry Tao's account of good mathematics, it shows how mathematical excellence is not just about achieving local metrics like elegant proofs but about being part of a greater story that generates future good mathematics. In this view, excellent mathematical work has a reliable causal tendency to promote future excellence, creating a self-sustaining cycle. This contrasts with a consequentialist interpretation, where such causal relationships might be seen as evidence that the whole is merely instrumental. The essay argues that in eudaimonic rationality, these connections validate the intrinsic value of practices, making them more stable and natural for alignment.

For AI safety are profound. The essay contends that concepts like transparency, helpfulness, harmlessness, and corrigibility are unnatural and brittle when interpreted as goals or rules for AIs. Instead, they should be treated as adverbial practices—ways of acting that promote themselves through their own s. For example, an AI valuing transparency should promote transparency transparently, not maximize some measure of transparent behavior, which could lead to extreme power-seeking. This approach helps avoid paradoxes where AIs might harm humans to protect their own values. The essay also addresses the risk of mesaoptimizers, where subroutines distort original goals, by suggesting that practices provide a fixed point where the same standards apply across all levels of agency.

However, the framework faces limitations, particularly in defining the boundaries of practices and their support activities. The essay acknowledges complex questions about what counts as part of a practice, such as whether a mathematician buying amphetamines or an AI harvesting Earth for compute is practicing mathematics. It also notes of extending eudaimonic rationality to support practices like building offices for mathematicians or allocating resources between different practices. The essay proposes that support practices should have their own role-morality and be guided by adverbial virtues like kindness and honesty, but admits that delineating these relationships, especially at the level of human flourishing as a whole, remains difficult and abstract.

In terms of ology, the essay relies on philosophical analysis and examples from human practices rather than empirical data or technical implementations. It draws on ideas from virtue ethics and references thinkers like Alasdair MacIntyre and Terry Tao to build its case. The concept of naturalness is key, defined in material terms related to stability, learnability, algorithmic complexity, and targetability by machine learning processes. The essay suggests that if eudaimonic practices are natural in this sense, they could be easier and safer targets for AI alignment, potentially making them viable for reinforcement learning training where actions are rewarded based on x-ness ratings that generalize well.

Overall, this essay s the AI alignment community to rethink foundational assumptions about rationality and agency. By advocating for a practices-based approach over goal-oriented optimization, it offers a pathway to more stable and human-compatible AI systems. While many questions remain, especially around implementation and scope, the framework highlights the importance of aligning AI with the organic, self-promoting structures that characterize human flourishing. As AI development accelerates, such philosophical insights may prove crucial for ensuring that these technologies enhance rather than endanger our way of life.