Internal Features

Representations inside LLMs that can be intervened upon.

Neighborhood — ranked by edge-count

paper

concept

Monosemantic Functional Features
associated_with
Features that correspond to a single semantic concept and are effective for steering behavior.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Internality Criterionconcept0.757
Criterion requiring that causal influence of internal state on description be internal, not routed through sampled outputs; rules out pseudo-introspection via self-observation.
Internal model representationsconcept0.755
The latent activations or embeddings inside a neural network.
Internal Factors in Evolutionconcept0.745
The idea that the geometry and dynamics of genetic material itself contributes directional ordering to evolution, beyond external Darwinian selective pressure
Internal uncertaintyconcept0.744
The model's internal representation of uncertainty hypothesized to trigger self-reflection
Stateful Internal Representationconcept0.733
A representation that maintains stable activation across many tokens rather than being locally triggered by specific content
internal emotional stateconcept0.733
The possibility of a stably encoded, causally active emotional state within LLMs, as distinct from token-by-token semantic content
Self-identity featuresconcept0.730
Features that activate when the model is asked about itself, invoking AI tropes and anthropomorphization.
Internal Consistency Monitoringconcept0.730
The inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs