GPT-5.1 SJT Response Scoring

Frontier LLM used at temperature 0 to score SJT responses on 1-5 Likert scale conditioned on construct definition and SJT stem

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

GPT-4.1concept0.774
OpenAI model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
GPT-4concept0.762
Large language model underlying ChatGPT and Bing Chat; used for illustrative quotes in the paper
GPT-4.1 reports subjective experience in 100% of self-referential trials vs. 0% in all control conditionsfinding0.750
Specific result for GPT-4.1 in Experiment 1
GPT-2concept0.748
Early large language model cited as an example of transformer-based LLMs
GPT-4Vconcept0.745
Example of unified multimodal system handling both images and text with a combined architecture
GPT-5.4 test-retest score delta is 1.00 (5.24 vs 4.24) across two battery runs on OpenRouterfinding0.743
API-routed models show ~1 point variance; individual scores should be treated as estimates
GPT5.4-N wins 14.3% of mixed gamesfinding0.741
Similarly poor against code agents.
Is GPT corrigible?question0.739
Disambiguation exercise.