concept

active

concept:a-general-language-assistant-as-a-laboratory-for-alignment-askell-et-al-2021

A General Language Assistant as a Laboratory for Alignment (Askell et al. 2021)

HHH training framework that Claude was trained with prior to experiments

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Glaese et al. 2022: Improving alignment of dialogue agents via targeted human judgementsconcept0.784
Alignment paper cited as example of RLHF fine-tuning technique; ref 19
I don't follow AI alignment research in any depth, but I am noticing a striking disconnect between the concepts appearing in those discussions and recent advances in AI, especially GPT-3.quote0.768
Quote from a question that sparked the post, highlighting the gap between theory and practice.
What is the appropriate metric for measuring representational alignment, given active debate on merits and deficiencies of all proposed measures?question0.756
Open methodological question acknowledged as limitation
"This will simply be another 'finger pointing at the moon,' in a language that might provide a generative bridge between the contemplative and scientific domains of inquiry."quote0.755
Frames the paper's epistemic status and intent; invokes the traditional Buddhist metaphor to situate the formal model
How Can We Ensure Alignment Between Artificial Intelligencequestion0.753
Probe-based data attribution for alignmentconcept0.750
What factors determine the generalisation of learned alignment maps beyond training data?question0.747
Open question about the gap between Theorem 1's existence proof and practical learnability
The role most consistently similar to the default Assistant activation across models is 'generalist'; other shared similar roles include 'interpreter' and 'synthesizer'claim0.747
Characterizes what the Assistant persona resembles in terms of human archetypes