method
active
method:llm-judge-binary-classifier

LLM Judge Binary Classifier

An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise

Neighborhood — ranked by edge-count

Methods (1)

method

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.
  • Alternative data attribution approach using an LLM as a judge; compared against the probe-based method.
  • Baseline comparison for data attribution; outperformed by probe-based approach.
  • Task paradigm from prior work asking 'Did you detect an injected thought?' via YES/NO logit comparison; shown here to be confounded
  • Classifier using cosine similarity between activation vectors and steering vectors to detect deception with 89% accuracy
  • LLM Meta-Cognitionconcept0.740
    The ability of LLMs to monitor and evaluate their own reasoning, closely related to reflection.
  • The case study target in Section 4: localizing gender information in hidden representations of Pythia-6.9B
  • Recent work identifying cases where LLM features are not one-dimensionally linear, a caveat to the linearity hypothesis.