hypothesis

active

hypothesis:if-simulators-are-not-inner-aligned-then-many-important-properties-like-prediction-orthogonality-may-not-hold

If simulators are not inner aligned, then many important properties like prediction orthogonality may not hold.

Conditional importance of inner alignment.

Source paper

extracted_from

Simulators — LessWrong

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Inner alignment framework
associated_with
The concept of inner vs outer alignment, referenced multiple times.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised modelsfinding0.763
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation
The underlying simulator has no agency of its own, not even in a mimetic sense, nor beliefs, preferences or goalsclaim0.758
Distinguishes the passive simulator from active simulacra that can appear to have agency
GPT, insofar as it is inner-aligned, is a simulator which can simulate agentic and non-agentic simulacra.claim0.756
Central thesis of the post.
In buildings, especially, the success of the process will be judged by the extent to which these middle-range entities appear with their own distinct symmetries, with their own definite and distinct force as strong centers.claim0.754
Proposes middle-range entity quality as the criterion for judging the success of a building process
Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.claim0.750
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
Learned simulations can be partially observed and lazily-rendered, and still work.claim0.749
One of the updates about prosaic ML simulation.
Alignment type is the only significant predictor of scores (p=0.006); architecture and parameter count do not.finding0.749
Kruskal-Wallis test result: Constitutional AI predicts highest baseline; roleplay/empathy training predict lowest.
Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.hypothesis0.749
Explanation for why dictionary learning can recover many more features than dimensions.