concept
active
concept:other-referencing-activations

Other-Referencing Activations

Latent model activations when processing inputs framed from another agent's perspective

Neighborhood — ranked by edge-count

Methods (1)

method
  • A loss function measuring the dissimilarity of latent model representations of self and other, minimized during fine-tuning

Concepts (1)

concept

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Activationsconcept0.818
    Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
  • Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset
  • Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
  • Pearson correlation of feature activations across 40M tokens used to measure feature similarity and universality across models
  • Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
  • Adding steering vector in forward direction to push model activations toward stronger reflective behavior.
  • cross-referenceconcept0.751
    Explicit textual or graphical links between parts of a work, dynamic and virtual.
  • Activation Oraclesframework0.748
    Framework training LLMs to answer questions about externally-provided activation vectors