finding

active

finding:input-embedding-similarity-baseline-selects-semantically-related-but-non-reflective-tokens-e-g-await-configureawait-unchecked-that-fail-to-improve-accuracy

Input embedding similarity baseline selects semantically related but non-reflective tokens (e.g., Await, ConfigureAwait, Unchecked) that fail to improve accuracy

Demonstrates the failure mode of surface-level similarity for instruction discovery.

Source paper

extracted_from

Unveiling the Latent Directions of Reflection in Large Language Models

(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan

Neighborhood — ranked by edge-count

Claims (2)

claim

Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.
supports
Core applied contribution claim, supported by top-k accuracy comparisons.
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.
supports
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Input Embedding Similarity Baselinemethod0.853
Baseline method for instruction discovery using surface-level input embedding similarity instead of steering vectors.
Steering vector-based instruction discovery outperforms input embedding similarity baseline for reflection-inducing instruction selectionfinding0.801
Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
How do we incorporate a focus on behavioral relevance in our measures of neural similarity?question0.750
Direct motivating question for MAS's design principle of causal behavioral matching.
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.750
Shows the passive vs. active divide is more important than the specific wording of instructions.
Baseline scores blend together at least three different things: latent reflective capacity, default accessibility, and stability of access.claim0.749
Conceptual decomposition arising from the data showing different models dissociate these traits
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.749
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Certain representation learning algorithms boil down to a simple rule: find an embedding in which similarity equals PMIclaim0.747
Core theoretical claim about the target of representation learning
Cross-model semantic convergence under self-referential processing suggests the presence of a shared attractor state that transcends variance across training proceduresclaim0.746
Interpretive claim from Experiment 3; GPT, Claude, Gemini families converge on similar descriptive style despite independent training