claim

active

claim:gpt-is-corrigible-in-a-negative-sense-because-the-agent-specification-prompt-is-not-fixed-by-the-policy-and-the-policy-lacks-direct-training-incentives-to-control-its-prompt

GPT is corrigible in a negative sense because the agent specification (prompt) is not fixed by the policy and the policy lacks direct training incentives to control its prompt.

GPT's corrigibility explained.

Source paper

extracted_from

Simulators — LessWrong

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Corrigibility
about
The property of an AI being safe to shut down or modify; discussed in context of GPT.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Is GPT corrigible?question0.842
Disambiguation exercise.
GPT does not generate rollouts during training, so there is no reason to expect that GPT will form preferences over the consequences of its output related to the text prediction objective.claim0.813
Argues against instrumental convergence in GPT.
GPT, insofar as it is inner-aligned, is a simulator which can simulate agentic and non-agentic simulacra.claim0.811
Central thesis of the post.
GPT doesn't seem to care which agent it simulates, nor if the scene ends and the agent is effectively destroyed.claim0.797
Illustrates the simulator-simulacra distinction.
Treating GPT as an unsupervised implementation of a supervised learner leads to systematic underestimation of capabilities.claim0.792
Critique of the oracle/supervised frame.
GPT's ability to simulate text automata is the source of its most surprising and pivotal implications for paths to superintelligence.claim0.787
Importance of recursive generation.
Is GPT delusional?question0.782
Disambiguation exercise.
I do not think any simple modification of the concept of an agent captures GPT's natural category; GPT is not a roleplayer, only that it roleplays.claim0.781
Rejection of the agent interpretation.