On Alignment – Artificial Thought

“Alignment” refers to the AI carrying out the intent of the human that is calling it.

If ML models are black boxes, then we are in trouble. Let’s say I’m in conversation with another person, who I’ve asked to translate a body of text, from English into Chinese. I don’t speak Chinese. Once the translator is done, the translator asks me if I’m happy with the translation. The answer, of course, is that I have absolutely no idea. Not speaking Chinese, I am bereft of any means at all to determine the accuracy of the work.

That is where we are with black boxes in ML. Node after node, layer after layer, numbers are adjusted in order to match (in the case of supervised learning) an expected output with the system’s calculated output.

There are a couple of key questions, in my mind:

As a model moves from layer to layer, can we see it moving closer to the desired outcome; or rather, to any outcome? In the case of an image classification problem, can we see it moving in stages toward its eventual conclusion? What do those stages look like? Can we convert the interim vectors of hidden layers to human-comprehensible concepts?
In the case of text, can we train a model on actual word meanings rather than the NWITS model? If we are trying to determine intention in a machine, that means at least for now that we need to be able to examine that intention as it takes shape. That is, as it moves along a progression toward its answer. For that,

Human intent is evident to us in a number of ways. Primarily, when it is clear, it is observed behavior, which includes speech. (examples — frowning or eye-rolling when X’s name is mentioned. Or ‘I can’t stand that guy and am going to vote against him.”

When it isn’t clear, we can ask someone. “Why did you say that about him?” “Why are you so mean to him?” We are, there, inferring intent, and asking for clarification. Again, if the whole conversation is conducted in Chinese, and you don’t speak Chinese, you’re going to get nowhere. It is as though all affect, all expression in the face and in the words, are blank. A deadpan subject who reveals nothing.

The same problem was addressed for image recognition by Zeiler and Fergus.

Do any techniques exist in NLP to do what that paper did for image recognition: a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier?

There are several techniques in NLP that aim to provide insights into the function of intermediate feature layers and the operation of classifiers in natural language processing models. Some of these techniques include:

Attention Mechanisms: Attention mechanisms allow the model to focus on different parts of the input sequence when making predictions. Visualizing the attention weights can give insight into which words or phrases the model is focusing on when making predictions.

Saliency Maps: Saliency maps highlight the regions of the input sequence that have the greatest impact on the model’s predictions. This can help to understand which words or phrases are most important for the model’s predictions.

Layer Activations: The activations of the hidden layers in a neural network can be visualized to gain insight into what the model is learning at each layer. This can help to understand the intermediate representations learned by the model and the relationships between the input and output.

Word Embeddings: Word embeddings are learned representations of words in a high-dimensional vector space. Visualizing the relationships between word embeddings can give insight into the relationships between words in the language and how the model is using these relationships to make predictions.

My opinion is that word embeddings need to be retrained: not to predict the next word in the sentence, but among other words that have similar meanings. The model should not be getting word meanings wrong. Words reflect our world and our affairs in it. Their meanings must be accurate if we’re using them as a way in to reproducing that reality.

This is all the more important with the advent of prompt engineering — the process of crafting the questions and instructions that elicit desired responses from large language models. In order to effectively modify a prompt, we need to understand how the prompt affects the responses of the machine: we need a reliable feedback mechanism. Critically, it must contain human-understandable logic. “I asked it to take a paper on raising horses and then to integrate 8 additional points. Somehow it came out with something shorter and less detailed than either of the components. Why is that? Maybe I should just have fed in the paper and the additional points and said, ‘Write an essay that incorporates all these points.’ Then we iterate and experiment. It is, in Sam Altman’s words, “a technology that rewards experimentation and use in creative ways.”