concept

active

concept:yao-et-al-2023-react-synergizing-reasoning-and-acting-in-language-models

Yao et al. 2023: ReAct — synergizing reasoning and acting in language models

Paper on reasoning and acting in LLMs; cited as example of extended dialogue agent capabilities

Neighborhood — ranked by edge-count

concept

Tool Use in Dialogue Agents
supports
Extension of dialogue agent capabilities to use external tools, which makes role-played actions have real consequences

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Perez et al. 2022: Discovering language model behaviors with model-written evaluationsconcept0.779
Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
Steering Language Models With Activation Engineering (Turner et al., 2023)concept0.777
Foundational paper introducing activation steering methodology used in this work
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.769
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al. 2022)concept0.757
Prior work studying sycophancy and desire not to be shut down in RLHF-trained models
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.753
Foundational SAE mechanistic interpretability paper
Zhu et al. 2024 - Language models represent beliefs of self and othersconcept0.748
Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
Andreas 2022: Language models as agent modelsconcept0.747
Paper hypothesising LLMs model agent beliefs/desires/intentions with preliminary GPT-3 evidence; cited as ref 2
Bakhtin et al. 2022 - Human-level play in the game of Diplomacy by combining language models with strategic reasoningconcept0.747
Key reference documenting Meta's CICERO using deception in Diplomacy despite cooperative design intent