Application Programming Interface Access to LLMs

Relatively unconstrained API access to powerful LLMs that vastly expands range of possible dialogue agent actions and risks

Neighborhood — ranked by edge-count

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Automated interpretability pipeline using LLMsmethod0.737
Using Claude 3 Opus to generate feature explanations and predict held-out activations.
Linear Representation of Concepts in LLMsconcept0.735
The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams
Emotion Features in LLMsconcept0.726
Internal representations encoding emotion concepts in large language models, identified by probing and SAE methods
LLM introspection on internal computations is architecturally permitted; whether models leverage this is an empirical question.claim0.718
Core claim directly challenged by prior work denying introspection; forms foundation for Koan Battery introspection studies.
Large Language Models (LLMs)concept0.716
Transformer-based models like GPT-4, LaMDA, PaLM; assessed for GWT indicators.
LLM Safety Alignmentconcept0.714
The training-based safety mechanisms that jailbreak attacks attempt to bypass, potentially via reflection suppression.
Reflection in LLMsconcept0.713
The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
Non-Linear Representations in LLMsconcept0.712
Recent work identifying cases where LLM features are not one-dimensionally linear, a caveat to the linearity hypothesis.