framework
active
framework:corrigibilityCorrigibility
The property of an AI being safe to shut down or modify; discussed in context of GPT.
Neighborhood — ranked by edge-count
Claims (1)
claim
- GPT's corrigibility explained.
Questions (1)
question
- Is GPT corrigible?associated_withDisambiguation exercise.
Artifacts (1)
artifact
- Simulators (LessWrong post)mentionsThe paper being extracted.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A property that makes a segment of space stand out as a center; determined by symmetry, connectedness, convexity, etc.
- Can we disambiguate truth from closely related features such as 'commonly believed' or 'verifiable'?question0.739Limitation noted in §7.1: scope restricted to simple statements prevents disambiguation
- Nonsensical or unphysical model outputs that result when interventions cross voids in activation space.
- The degree to which a system can be influenced by signals from brute force to rational argument; correlates with cognitive sophistication.
- The practical, reality-based judgment that guides successful unfolding and adaptation.
- Proposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing
- Machine learning problem, avoided in biology via polycomputing adding new interpretations.
- Ian Goodfellow quote used to illustrate the pre-paradigmatic state of interpretability research