concept

active

concept:gemma-2-improving-open-language-models-at-a-practical-size-team-et-al-2024

Gemma 2: Improving Open Language Models at a Practical Size (Team et al., 2024)

Paper describing Gemma 2 model family used in this study

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The generalization improvement from explicit instructions observed in Llama models (A1-A3 to F0-F2) is more pronounced for F3-F5 to F0-F2 in Gemma models.claim0.785
Shows the instruction effect, while shifting geometry, may not produce consistent generalization effects across model families.
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 (Lieberum et al., 2024)concept0.774
Paper introducing GemmaScope SAEs used for Gemma-2 model experiments
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.758
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2022)concept0.756
Fine-tuning method paper whose technique is used in the fine-tuning experiments
Representation engineering for large-language models: Survey and research challenges (Bartoszcze et al., 2025)concept0.749
Survey of representation engineering methods cited as related work
Language models are some of the most remarkable computer programs in existence.quote0.749
Opening sentence setting the stage for the importance of interpretability.
Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.748
Key finding about the relationship between capability and introspection.
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.746
Foundational SAE mechanistic interpretability paper