concept
active
concept:internal-featuresInternal Features
Representations inside LLMs that can be intervened upon.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Monosemantic Functional Featuresassociated_withFeatures that correspond to a single semantic concept and are effective for steering behavior.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Criterion requiring that causal influence of internal state on description be internal, not routed through sampled outputs; rules out pseudo-introspection via self-observation.
- The latent activations or embeddings inside a neural network.
- The idea that the geometry and dynamics of genetic material itself contributes directional ordering to evolution, beyond external Darwinian selective pressure
- The model's internal representation of uncertainty hypothesized to trigger self-reflection
- A representation that maintains stable activation across many tokens rather than being locally triggered by specific content
- The possibility of a stably encoded, causally active emotional state within LLMs, as distinct from token-by-token semantic content
- Features that activate when the model is asked about itself, invoking AI tropes and anthropomorphization.
- The inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs