concept
active
concept:guardrailsGuardrails
Constraints imposed via fine-tuning to reduce harmful output; can reduce harm but also attenuate expressivity and creativity
Neighborhood — ranked by edge-count
Claims (1)
claim
- Extends the role-play framing to explain the effect of RLHF on dialogue agents
Methods (1)
method
- Fine-Tuning via Reinforcement Learningassociated_withTechnique used to impose guardrails on base LLMs, analogized to censorship on the simulator's range of simulacra
Concepts (1)
concept
- Paper noting that RLHF guardrails can attenuate model expressivity and creativity; cited as ref 30
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Mathematical classification system for two-dimensional repeating patterns; mentioned as analogous to frieze groups.
- The property that qualities vary slowly, subtly, gradually across the extent of each living thing; gradients arise as natural responses to changing circumstances and create field-like character that points toward and establishes centers
- Selection mechanism in concurrent logic languages where guards are evaluated in parallel.
- A zoning code based on generative sequences rather than fixed criteria, enabling well-adapted building form to arise.
- The ultimate non-material reality behind matter, experienced when living structure opens a window to the I.
- The color representing private gardens and positive outdoor space in the four-fold pattern.
- Game-theoretic LLM evaluation benchmark with short-horizon interactions, cited.
- Spatially periodic firing neurons in medial entorhinal cortex; TEM-t learns representations resembling these.