Convolving on Language
In image recognition we are scanning pixels, ie data at a low level, to allow the computer to classify things that are high-level. This classification, though it is done by means that may or may not mirror what we actually do, arrives at a human-recognizable result. And it does this with the aim of hitting a target label, ie it’s supervised learning.
If we use this for language, the data-at-a-low-level piece is sometimes words (although some words are less atomic than others). Words when assembled into phrases and sentences create something higher-level and human-recognizable — ideas, or concepts, or really ‘the thing that’s being passed between us when we communicate’. In fact, certain individual words (chaos, recess, guitar) or word pairs (Gibson guitar, creature comforts, closing time), threes (utterly stupid idea) or fours (one of a kind) express discernible clumps of meaning.
So one task would be to train the model with a limited number of human-defined labels for whatever is being communicated. I did some of this for a passage from Harry Potter. It isn’t all that difficult. What is an interesting challenge is that it’s difficult to do on just one level, just one time through. For instance ‘He had stayed still a second too long.’ So first there’s the idea of the sentence: something bad had happened because he’d stayed in one place without moving.
Then there’s He (which is pretty atomic, and which is now easily identified as Harry via Transformer models); then, ‘had stayed’ which is back in time before something else just happened, combined with not physically moving. Then ‘a second too long’, which could also be further broken down on another pass. Yes, you could do ‘a’ and ‘second’ and ‘too’ and ‘long’ but you could also describe it in terms that are closer to what we’re getting when it’s passed from J.K. Rowling’s brain into ours. Something like time, a small unit of time, an allowed length of time, and that allowed length of time being exceeded. Kind of like filling up a pitcher too much, and it overflows just a bit.
I think this is how we understand what she’s written, and how we generalize across time and space.
The ‘label’ might actually be a number or series of numbers, derived from the above breakdown.
The way we understand ideas, at multiple levels and in multiple contexts, does not seem all that dissimilar from convolutional scanning.When we read, there is a sequentially compositing scan, or composing scan. This is different from a standard recurrent neural network and is more accurately called a convolutional recurrent neural network. It scans a sentence, then scans it again and again, deriving information from it each time. Each time through it may get a different aspect of context, which can be defined via a label. It may, on its first pass, take the first three words of sentence 1, then the next five. The second time through it may take the first two words, then the next two, then the last four. And there may be some times in the course of scanning th e paragraph — the higher-order object — when it does not particularly take any words from that sentence, since it is irrelevant to the dimension being labeled.
In other words, it’s a variable-size window that moves over a body of text. And it must move over more than once in order to give more information to the network it is training.