concept
active
concept:representation-engineering-a-top-down-approach-to-ai-transparency-zou-et-al-2023Representation engineering: A top-down approach to AI transparency (Zou et al., 2023)
Key prior work on representation engineering that ReflCtrl directly extends
Neighborhood — ranked by edge-count
Papers (2)
paper
Concepts (1)
concept
- Framework paper describing the broader class of methods within which SOO fine-tuning fits
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Property of conscious representations: they do not contain information about the fact that they are representations at the level of the representation itself
- A class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
- Survey of representation engineering methods cited as related work
- Key interpretive claim that deception has a tractable geometric signature in activation space
- Interpretation of weaker PCA separation and lower ASR in smaller models
- The aspect of design dealing with data structures, modules, and implementation.
- Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
- Defines the core concept of the paper.