Corrigibility

The property of an AI being safe to shut down or modify; discussed in context of GPT.

Neighborhood — ranked by edge-count

claim

question

artifact

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Coherenceconcept0.759
A property that makes a segment of space stand out as a center; determined by symmetry, connectedness, convexity, etc.
Can we disambiguate truth from closely related features such as 'commonly believed' or 'verifiable'?question0.739
Limitation noted in §7.1: scope restricted to simple statements prevents disambiguation
incoherenceconcept0.736
Nonsensical or unphysical model outputs that result when interventions cross voids in activation space.
Persuadabilityconcept0.731
The degree to which a system can be influenced by signals from brute force to rational argument; correlates with cognitive sophistication.
common senseconcept0.730
The practical, reality-based judgment that guides successful unfolding and adaptation.
Algorithm 1: Harmlessness Classificationmethod0.729
Proposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing
catastrophic forgettingconcept0.728
Machine learning problem, avoided in biology via polycomputing adding new interpretations.
"For interpretability, I don't think we even have the right definitions."quote0.727
Ian Goodfellow quote used to illustrate the pre-paradigmatic state of interpretability research