concept

active

concept:zou-et-al-2023-representation-engineering-a-top-down-approach-to-ai-transparency

Zou et al. 2023 - Representation Engineering: A Top-Down Approach to AI Transparency

Framework paper describing the broader class of methods within which SOO fine-tuning fits

Neighborhood — ranked by edge-count

paper

concept

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Representational Transparencyconcept0.782
Property of conscious representations: they do not contain information about the fact that they are representations at the level of the representation itself
Representation engineering for large-language models: Survey and research challenges (Bartoszcze et al., 2025)concept0.775
Survey of representation engineering methods cited as related work
Representation Engineeringframework0.768
A class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
Bai et al. 2022: Constitutional AI — harmlessness from AI feedbackconcept0.765
Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
Representation engineering successfully quantifies deception via high-accuracy steering vectors, establishing it as a measurable property of model representationsclaim0.762
Key interpretive claim that deception has a tractable geometric signature in activation space
Evans et al. 2021 - Truthful AI: Developing and Governing AI that Does Not Lieconcept0.754
Reference establishing the truthfulness/honesty distinction and need for honest AI
Representational abstraction of truth may emerge more clearly with model scaleclaim0.751
Interpretation of weaker PCA separation and lower ASR in smaller models
The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'.quote0.746
Defines the core concept of the paper.