Bias in language models

Features related to gender, racial, ethnic biases, slurs, and hate speech.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Language Modelsconcept0.827
Primary substrate for manifold steering experiments; demonstrates method on reasoning and in-context tasks.
Language Modelconcept0.825
Primary test domain for manifold steering, including reasoning and ICL tasks
Zhu et al. 2024 - Language models represent beliefs of self and othersconcept0.793
Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
Anisotropy in Language Modelsconcept0.792
Property of smaller saturated models making hidden states harder to access via alignment maps
Perez et al. 2022: Discovering language model behaviors with model-written evaluationsconcept0.788
Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
Language models implement algorithms humans have tried and failed to write by hand for decadesclaim0.785
Opening interpretive claim about the remarkable nature of language models.
Autoregressive Language Modelingconcept0.781
Training objective interpretable as optimizing a diverse set of tasks; thus subject to multitask scaling convergence pressures
Language models are few-shot learners (Brown et al., 2020)concept0.780
Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.