concept
active
concept:application-programming-interface-access-to-llmsApplication Programming Interface Access to LLMs
Relatively unconstrained API access to powerful LLMs that vastly expands range of possible dialogue agent actions and risks
Neighborhood — ranked by edge-count
Claims (1)
claim
- Safety-relevant claim showing that the role-play framing does not diminish the seriousness of potential harms
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Using Claude 3 Opus to generate feature explanations and predict held-out activations.
- The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams
- Internal representations encoding emotion concepts in large language models, identified by probing and SAE methods
- Core claim directly challenged by prior work denying introspection; forms foundation for Koan Battery introspection studies.
- Transformer-based models like GPT-4, LaMDA, PaLM; assessed for GWT indicators.
- The training-based safety mechanisms that jailbreak attacks attempt to bypass, potentially via reflection suppression.
- The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
- Recent work identifying cases where LLM features are not one-dimensionally linear, a caveat to the linearity hypothesis.