In his piece “The Language That Machines Read” (2020), John Cayley describes the process of grammaleptic reading. According to Cayley, when we scan and process grapholects (written language), we “seize” symbolic meaning, or grammè, from that text in the same way that we can also “grasp” the same meaning from spoken language. This process of grasping meaning from text can coarsely be thought of as the process of generating the voice, or substance, of language, which you may or may not experience as an almost real-sounding “voice in your head,” or at least the voice you articulate thoughts to yourself with. Coarsely speaking, reading is the process of generating that voice (the substance of language) from any sort of medium, in this case written English.
This understanding of reading and language means that documents themselves, the data that transformer-based language models train on, are not actually language until a voice can be read or created from them. The text data itself is not language. This is a clear problem, though, because in NLP it is generally assumed that language is the data that’s being worked with, conflating the medium through which language is expressed with language itself.
Current state-of-the-art language models don’t think, and don’t have any physical body. When GROVER or GPT-3 is “writing,” it is simply probabilistically churning through huge volumes of text data with no regard to the ideality of the text that it is processing. A language model might output the string “The moon is so beautiful” when asked to predict the next few words after “The moon,” but this is not because it has ever seen the moon, but because it has likely processed thousands of examples of the moon being described as beautiful. So state-of-the-art NLP models do “work” in the sense that they output text that seems readable on the surface, but this is not because there is any conscious intent behind their words. Rather, the text that they are outputting is (often) coherent enough to where humans can read and find meaning in it after the fact.