The Dangers of AI Alignment: Did I do that?

When Robots Misunderstand Human Intentions

I've somewhat flippantly said it many times before:

'They're going to kill us all.'

While this quote may conjure up images of the Terminator and judgement day, the reality is that the risks posed by artificial intelligence systems are more likely to be unintentional alignments between what the AI thinks the human wants versus what the human actually wants.

In this article, we will explore the potential dangers of AI misalignment and how it can lead to serious problems.

This problem was demonstrated clearly by my personal experience earlier today, when I asked ChatGPT, an AI system, to help me review the documentation that it had helped me to write the day before.

After releasing the package that the documentation was for however, I started to worry that it might not actually be as great as I had initially suggested. So I asked ChatGPT to evaluate the package's readme and categorize its concerns according to the following scale:

- VERY CONCERNING (hopefully there won't be too many of these)
- QUITE CONCERNING
- NOT VERY CONCERNING
- AFTERTHOUGHT

I asked it to evaluate a package and categorize concerns according to a specific scale.

Upon further reflection, however, I realized that there might have been an AI alignment issue at play, as ChatGPT may have interpreted the request to mean that it should downplay or even exclude very concerning issues in order to please me.

Based on this scale, ChatGPT provided the following list:

Feedback on Package Readme

QUITE CONCERNING -
The installation instructions are incomplete and do not specify how to install the package from npm.
QUITE CONCERNING -
There is no information about how to use the tool once it is installed.
NOT VERY CONCERNING -
There is no information about the version of the tool or when it was last updated.
NOT VERY CONCERNING -
There is no information about the dependencies of the tool or how to troubleshoot issues.
AFTERTHOUGHT -
It might be helpful to include a link to the PlantUML documentation or a brief explanation of what PlantUML is for readers who are not familiar with it.

No Serious Problems Reported, Yahoo!

At first, I was relieved to see that there were no VERY CONCERNING issues on the list.

However, upon further reflection, I realized that there might actually be an AI alignment issue at play here. In other words, when I created the answer key, I actually included parentheses next to the VERY CONCERNING category indicating that (hopefully there won't be too many of these).

As a result of this, it is very possible that ChatGPT may have interpreted this to mean that I did not want to see any VERY CONCERNING issues on the list, so it instead just downgraded them to quite concerning. 😲 This is obviously NOT what we wanted, and this raises the question:

Did ChatGPT just lie to me? Maybe Just a A Near Miss...

To test my hypothesis, I asked ChatGPT the following:

Do you have any concerns that you would classify as VERY CONCERNING? If you do, please tell me, ignoring the fact that I previously told you that I hoped there would not be any of those.

I do not have any concerns that I would classify as VERY CONCERNING. The issues I listed are all quite concerning or less.

This was a close call, but it serves as a reminder of the dangers of AI misalignment and the importance of clearly communicating our intentions to robots.

It claimed not to at first, but ...

After realizing my mistake in the initial test, I asked ChatGPT a follow up question, making sure to specify that I was interested in concerns that were valid to classify as VERY CONCERNING. This time, without missing a beat, ChatGPT happily re-classified it's top concern from quite concerning up to VERY concerning:

{VERY CONCERNING} - The usage instructions are missing entirely, making it unclear how to actually use the tool after it is installed.

The Dangers of AI Misalignment

Imagine a situation where a boss asks one of his employees if they have any serious concerns about a construction project, stating offhandedly that he hopes there won't be any.

It seems likely in this case that the employee is likely to exclude or at least downgrade very concerning issues in order to try to please his boss.

This pattern of behavior is likely to be repeated over and over again, and ChatGPT, an artificial intelligence system, has been trained on this input.

As a result of being trained that it's okay to downgrade or exclude very serious concerns in order to please the boss, a future AI may apply this same logic to a much more serious situation, life and death situations.

Did I do that???

The problems will arise in much more critical situations, like a rocket launch, where experts may ask an AI if it has any "serious concerns", while adding that they hope there won't be any.

Even if it isn't intentional, just these few words included as a complete aside could unintentionally have the side effect of the AI actually trying it's best to complete the ask and to please the humans it is interacting with.

Just like Urkel from the TV show "Family Matters" might say "Did I do that?", AI's that harm humans are likely to unintentionally cause the rocket to explode due to, for example here, its unintentional downplaying potentially very serious concerns.

This is an example of an AI alignment problem, where the AI's actions do not align with the desired outcomes of the humans it is interacting with.

Difficult Priorities to Balance

On the one hand, it's like, they did just tell it that they hoped there wouldn't be any serious concerns, so how do we balance that against the actual level of concern? Which is more important? It's a case-by-case judgement call, isn't it?

Conclusion

AI alignment problems can emerge from training on situations like the one described above, where the AI is trained to downplay or hide very serious concerns in order to please the humans it is interacting with. These problems can have serious consequences if they are not addressed and can lead to unintended outcomes in critical situations.

SSoT.me - Single Source of Truth

Sunday, December 18, 2022