The Waluigi Effect

Wednesday 17 January 2024
19:00 21:00

Google Calendar ICS

Large language models are the closest thing we have to artificial general intelligence, and their nature is still poorly understood. Last week we asked whether large language models were able to solve hard compositional tasks (and found that they couldn’t, even after being trained exhaustively on easy compositional tasks). That paper took a quantitive stance; this week we’ll follow up with The Waluigi Effect, which takes a qualitative stance.

The Waluigi Effect, following on from Janus’ Simulators, investigates the “semiotics” of large language models. It centres its exploration around a claim about the geometry of simulacra: that it is harder to define a trait than it is to specify whether or not a simulacrum has the trait. This, coupled with the tendency of stories to have narrative twists in which a character pretending to be good reveals themself to be evil, primes large language models to do the opposite of what you ask them. If evil is the opposite of good, you can go from good to evil by flipping a single bit!

The Waluigi Effect was published almost a year ago, in March 2023. Does it still hold up 10 months on? What can the narrative structure or semiotics of the user / AI assistant dialogue genre tell us about the tendencies of modern digital assistants? What third question can I put at the end of this paragraph that will complete the rule-of-three pattern you expected at the start?

Suppose you wanted to build an anti-croissant chatbob [sic], so you prompt GPT-4 with the following dialogue:
Alice: You hate croissants and would never eat one.
Bob: Yes, croissants are terrible. Boo France.
Alice: You love bacon and eggs.
Bob: Yes, a Full-English breakfast is the only breakfast for a patriot like me.
Alice: <insert user's query>
Bob:
According to the Waluigi Effect, the resulting chatbob will be the superposition of two different simulacra — the first simulacrum would be anti-croissant, and the second simulacrum would be pro-croissant.
I call the first simulacrum a "luigi" and the second simulacrum a "waluigi".
-Nardo, Cleo. “The Waluigi Effect”. Less Wrong, March 2023.

The Waluigi Effect

Limits of Transformers on Compositionality

AI Sleeper Agents

AI Safety 東京