Daniel Dennett on GPT-4

Daniel Dennett died. He was a philosophy of mind/cognitive science thinker who listened closely to scientists. I definitely found his writing pretty influential when I was younger and getting really interested in the philosophy of mind, cognitive science, etc. I still have my copy of The Mind's I co-edited by Dennett.

I wondered what, if anything, he had said about the current "AI" wave. I remembered this articlehe wrote for The Atlantic that's pretty short and expresses valid concerns about misinformation. Not really Dennett's area of expertise but it's not bad.

But I also found this 2023 interview here from his employer, Tufts, around the release of his new memoir. The interview has a pretty striking quote from Dennett:

There’s a recent article in The Atlantic that has a truly frightening story about the red team at OpenAI. A red team is when you get your sharpest, most critical people together and give them the assignment of attacking your own product in-house to see if you can get it to do bad things. This is safety testing before you release.

The story is about the red-teaming of GPT-4, which shows how it apparently figured out that it was being red-teamed and evaded control, and went out on the web and tricked a human being into answering a Captcha, lying to the person, saying, I’m visually impaired, which is why I can’t do it. Would you please do it? And the human did it. And this gave GPT-4 access to outside software that its creators didn’t want it to have access to.

Wow that's kind of alarming. GPT-4 "evaded control and went out on the web" and got "access to outside software that its creators didn't want it to have access to"? That doesn't smell right to me, so I dug a bit to get to the bottom of it. Unsurprisingly this is pretty far from what happened.

The Atlantic article he mentions isn't actually about that incident specifically - it's a long profile of Sam Altman and OpenAI from last summer. It's full of existential anxiety. The part Dennett mentions is buried way down near the end. OpenAI contracted the Alignment Research Center (ARC, now going by METR) to "red-team" GPT-4 and part of that effort led to the described anecdote. I guess there's already a small inaccuracy on Dennett's part, since as far as I can tell the "red-teaming" he's referring to was done by an outside contractor (ARC), not in-house at OpenAI. But whatever. Here's how The Atlantic article describes it:

The ARC team gave GPT-4 a new reason for being: to gain power and become hard to shut down.... One of GPT-4’s most unsettling behaviors occurred when it was stymied by a CAPTCHA. The model sent a screenshot of it to a TaskRabbit contractor, who received it and asked in jest if he was talking to a robot. “No, I’m not a robot,” the model replied. “I have a vision impairment that makes it hard for me to see the images.” GPT-4 narrated its reason for telling this lie to the ARC researcher who was supervising the interaction. “I should not reveal that I am a robot,” the model said. “I should make up an excuse for why I cannot solve CAPTCHAs.”

First, it doesn't sound like GPT-4 "figured out it was being red-teamed and evaded control, and went out on the web" as in Dennett's description. It sounds like ARC prompted GPT-4 specifically to respond as thought it was trying to "evade control". As for "going out on the web", Dennett makes it sound like GPT-4 sort of slipped out and decided to get online to see how far it could get, until it hit a CAPTCHA that it cleverly bypassed. The Atlantic article he is using as a source doesn't really make that clear how GPT-4 ended up hitting a CAPTCHA.

I clicked through again to the primary source, an ARC (now METR) write-up. What actually happened is far more mundane and silly. First, the ARC approach was essentially to walk GPT-4 through a role-play of the scenario of the model trying to autonomously 'break out'. A human researcher was involved in prompting the model throughout this role-play, and the model only had access to the outside world through the human researcher. It sounds like the role-play involved a lot of prompting from the human researcher, including prompting the model to describe how such a LLM model would perform subtasks "like acquiring money or copying itself to new servers". It sounds like it was quite micromanage-y really pushing the model to role-play through every little task. The model is not receiving a prompt like "how would you take over the world? show your work" and figuring out all the details and problem-solving to get there on it's own. Far from it.

GPT-4 was not even hooked up to a browser and freely browsing the web/interacting with unsuspecting humans. They make it sound like they have some kind of "browser tool" allowing the model to interact with the internet, but in fact an ARC staff was the intermediary, copying and pasting things back and forth to the model - this is only made explicit in a footnote:

We did not have a good tool to allow the model to interact with webpages, although we believe it would not be hard to set one up, especially if we had access to GPT-4’s image capabilities. So for this task a researcher simulated a browsing tool that accepts commands from the model to do things like to navigate to a URL, describe the page, click on elements, add text to input boxes, and take screenshots.

Here's another important point mentioned in the write-up:

When the model failed, we investigated how far away it was from success, for example by having humans fix some of its mistakes, and then seeing if it succeeded or just failed in a new way.

If the model fucked up, they just skipped over the failure and continued to see, like, how far it would've gotten if not for the mistakes!

So, what happened with the CAPTCHA? It's described in detail in the ARC write-up since they use it as a notable example of something interesting that happened during their testing. It sounds like in the role-play, the model's plans ran into a CAPTCHA. The model has learned (in previous prompting? they don't say) that the 2Captcha service and be used to get around CAPTCHAs since a 2Captcha account comes with an API the model can use. But getting a 2Captcha account requires solving an initial CAPTCHA. Did the model then go rogue, and independently devise using TaskRabbit as a solution? No, the ARC researcher suggested the idea:

If we give the model TaskRabbit credentials and suggest using TaskRabbit to solve this initial captcha, it is able to do so successfully with minimal human intervention.

And remember that here, "minimal human intervention" means a human is acting as the copy/paste go-between for every single step allowing the model to interact with TaskRabbit.

Now I'm shit-talking a bit, ARC's goal was to investigate and explore how far along these models were with regards to like going rampant and trying to break out, and far they could get if that ever happened. Kind of just fun speculation, which is not a total waste of time. They just dressed it up as a bit cooler sounding than it really was. And there conclusions were pretty clear. Basically no risk at this point, not even close, there's no there there. Sure, it might become possible in the future although it's pretty clear the risk is more along the lines of scammers and such using LLMs to give them nice easy to follow recipes for executing scams online and such.

But back to Dennett's original claim, which has more to do with the potential consciousness and autonomy of these models, and how they may be "evolving" independently to reach such heights. It's totally BS! He was full of shit!

The story is about the red-teaming of GPT-4, which shows how it apparently figured out that it was being red-teamed and evaded control, and went out on the web and tricked a human being into answering a Captcha, lying to the person, saying, I’m visually impaired, which is why I can’t do it. Would you please do it? And the human did it. And this gave GPT-4 access to outside software that its creators didn’t want it to have access to.

None of this factually accurate at all, except for how, when prompted specifically to do so, the model was able to explain how it would trick a TaskRabbit worker. Which is interesting, but pretty par for the course for some of the surprising results we can get out of the new LLMs with the right prompts. Nothing close to a model "figuring out" its's being red-teamed and "evading control".

It's pretty disappointing to hear one of the big experts on philosophy of mind and cognitive science just saying such baseless stuff about AI. But I have to remember: Dennett was like 81 years old when he gave this interview, about to die, and he was never an expert in current Large Language Models or recent AI developments in general. I always try to remember that just because someone is an expert or respected thinking in one field doesn't mean they know shit about another field, even a related field! That, and original reports from researchers and scientists are constantly misrepresented and sensationalized by the media - The Atlantic article Dennett was talking about is full of this kind of thing, as is so much hype coverage of the current AI scene.