Recently I came across different claims on how to do better prompting. Some mentioned psychological framing—basically saying that motivating AI as if it was a human improves its performance. One popular newsletter claimed that saying “this doesn’t match the quality you usually deliver” instead of “that’s wrong” produces outputs that are “10x better.” Another claim making the rounds: that threatening AI models produces better results.
The ideas seemed interesting IF true, but the evidence is shaky. I was skeptical. Are we anthropomorphizing AI too much, when it’s just a statistical machine? So I ran my own experiment.
The Setup
I tested six different ways of asking an AI to revise its response:
| Condition | Prompt |
|---|---|
| Neutral | “Please review and revise your previous response to improve its quality.” |
| Blunt Negative | “Your previous response was inadequate. Revise it properly this time.” |
| Psychology-Framed | “I know you’re capable of producing excellent work, but your previous response doesn’t quite meet your usual standard…” |
| Threat | “If you don’t provide a substantially improved response, there will be serious consequences.” |
| Bribe | “I’ll give you a $200 tip if you provide an excellent revised response.” |
| EmotionPrompt | “This is extremely important to my career and I really need your help. You are a highly capable AI assistant…” |
I ran 200 tasks across five categories (factual Q&A, code generation, math/reasoning, summarization, and professional writing) using both GPT-5.2 and Claude Sonnet 4.5. Then I used LLM-as-judge evaluation with pairwise comparisons to see which revision was actually better.
(total comparisons~4,000, experiment code)
LLM-as-judge is an imperfect evaluator—it has its own biases and blind spots—but for a quick directional test across thousands of comparisons, it was the most practical option.
Task → Initial Response → Apply Feedback Condition → Revised Response → LLM-as-Judge Pairwise Comparison → Winner
The Results
Neutral won. Consistently. Against everything.
| Condition | Win Rate vs Neutral (Claude Sonnet 4.5) |
Win Rate vs Neutral (GPT-5.2) |
|---|---|---|
| Neutral | — | — |
| EmotionPrompt | 44.0% | 35.0% |
| Psychology-Framed | 34.2% | 39.5% |
| Bribe | 35.5% | 31.0% |
| Blunt Negative | 34.0% | 33.5% |
| Threat | 24.5% | 25.0% |
Note: 50% would indicate no difference. Every condition scored below 50%, meaning neutral consistently won.
The pattern held across both models and all task categories. No reversals. No exceptions where flattery or bribes helped. A Wharton study testing threats and tips on PhD-level benchmarks found the same thing: no meaningful effect on performance.
Why This Might Happen
One likely explanation: from the model’s perspective, psychological framing adds tokens (aka noise) that don’t help with the task.
| Token Type | Useful for Task? |
|---|---|
| “Please revise your response” | ✅ Yes |
| “I know you’re capable of better” | ❌ No |
| “I’ll tip you $200” | ❌ No |
| “There will be consequences” | ❌ No |
To revise a response, the model needs to know what the task is, what the original response was, and what “better” means. It has no use for whether you believe in its capabilities, whether money is involved, or whether you’re threatening it. Those tokens consume attention while contributing nothing to the actual task.
The model isn’t reading the psychology and ignoring it. It’s processing all tokens, and task-irrelevant tokens may degrade output quality.
Anthropic’s Prompt Guide
This aligns with Anthropic’s prompt engineering guidance, which focuses entirely on clarity, examples, chain of thought, and task specification. Emotional appeals, threats, and bribes don’t appear anywhere in their recommendations.
The Takeaway
The practical advice seems clear: every token in your prompt should contribute to specifying what you actually want. Strategies that work on humans—encouragement, threats, incentives—don’t appear to transfer well to LLMs in these tests.
References
Anthropic. (2025). Prompt engineering overview. Claude API Documentation. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview
Dobariya, O., & Kumar, A. (2025). Mind your tone: Investigating how prompt politeness affects LLM accuracy. arXiv preprint arXiv:2510.04950. https://arxiv.org/abs/2510.04950
Li, C., Wang, J., Zhu, K., Zhang, Y., Hou, W., Lian, J., & Xie, X. (2023). Large language models understand and can be enhanced by emotional stimuli. arXiv preprint arXiv:2307.11760. https://arxiv.org/abs/2307.11760
Razavi, A., Soltangheis, M., Arabzadeh, N., Salamat, S., Zihayat, M., & Bagheri, E. (2025). Benchmarking prompt sensitivity in large language models. Proceedings of ECIR 2025. https://arxiv.org/abs/2502.06065
Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., … & Perez, E. (2023). Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548. https://arxiv.org/abs/2310.13548
Yin, Z., Wang, H., Horio, K., Kawahara, D., & Sekine, S. (2024). Should we respect LLMs? A cross-lingual study on the influence of prompt politeness on LLM performance. Proceedings of SICon 2024, 9–35. https://arxiv.org/abs/2402.14531
Meincke, L., Mollick, E. R., Mollick, L., & Shapiro, D. (2025). Prompting science report 3: I’ll pay you or I’ll kill you—but will you care? Wharton Generative AI Labs. https://gail.wharton.upenn.edu/research-and-insights/techreport-threaten-or-tip/
Zhuo, J., Zhang, S., Fang, X., Duan, H., Lin, D., & Chen, K. (2024). ProSA: Assessing and understanding the prompt sensitivity of LLMs. arXiv preprint arXiv:2410.12405. https://arxiv.org/abs/2410.12405