concept

active

concept:inference-time-intervention-eliciting-truthful-answers-from-a-language-model-li-et-al-2023

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)

Safety intervention that relies on activation modification, which ESR might undermine

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
cites

Concepts (1)

concept

AI Alignment and Safety
associated_with
The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Inference-Time Intervention (ITI)method0.820
Method by Li et al. 2023a that adds static vectors to model activations at inference time to steer behavior
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.790
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
Do human participants demonstrate the same insight dynamics predicted by active inference in the rule-learning paradigm (currently under investigation with eye tracking and crowd-sourced reaction times)?question0.782
Empirical gap explicitly acknowledged; experiments reportedly in progress at time of writing
NLA explanations can contain claims about the target model's input context that are verifiably false, but are typically thematically faithful to the context.claim0.781
Key limitation identified: NLAs hallucinate specific details while preserving thematic accuracy; informs practical usage.
Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022)concept0.780
Foundational paper on CoT prompting cited as basis for reasoning LLM training
Active inference LLMs extending prediction-focused language models with tighter perception-action feedback loops may naturally embody contemplative wisdom as they scalehypothesis0.779
Predictive hypothesis about Contemplative Architecture approach based on Petersen et al. 2025 work
Given a language model M and a statement s, does M believe s to be true?question0.778
The core motivating question of the paper, framed by Christiano et al. (2021)
Friston, FitzGerald et al. (2016) — Active inference and learningconcept0.778
Prior active inference paper providing detailed neurophysiological implementation of belief updates