concept
active
concept:bias-in-language-modelsBias in language models
Features related to gender, racial, ethnic biases, slurs, and hate speech.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Primary substrate for manifold steering experiments; demonstrates method on reasoning and in-context tasks.
- Primary test domain for manifold steering, including reasoning and ICL tasks
- Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
- Property of smaller saturated models making hidden states harder to access via alignment maps
- Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
- Language models implement algorithms humans have tried and failed to write by hand for decadesclaim0.785Opening interpretive claim about the remarkable nature of language models.
- Training objective interpretable as optimizing a diverse set of tasks; thus subject to multitask scaling convergence pressures
- Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.