concept
active
concept:zou-et-al-2023-representation-engineering-a-top-down-approach-to-ai-transparencyZou et al. 2023 - Representation Engineering: A Top-Down Approach to AI Transparency
Framework paper describing the broader class of methods within which SOO fine-tuning fits
Neighborhood — ranked by edge-count
Papers (3)
paper
Concepts (1)
concept
- Key prior work on representation engineering that ReflCtrl directly extends
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Property of conscious representations: they do not contain information about the fact that they are representations at the level of the representation itself
- Survey of representation engineering methods cited as related work
- A class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
- Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
- Key interpretive claim that deception has a tractable geometric signature in activation space
- Reference establishing the truthfulness/honesty distinction and need for honest AI
- Interpretation of weaker PCA separation and lower ASR in smaller models
- Defines the core concept of the paper.