Rereading Writing

In March 2020, I began working as an annotator on a project studying narrativity in online discourse. The goal of the project was to develop a corpus of news articles and Reddit comments about marijuana legalization, annotated for various linguistic features, such as cause-effect relationships, generic versus specific statements (i.e. “students like free food” versus “I like free food”), and whether or not the document contains a narrative. However, this was not a typical corpus: not all of the documents being annotated were written by humans. Half of them were actual randomly-sampled news articles and comments, but the other half were generated by GROVER, a model for text generation based on GPT-2 and fine-tuned to detect and generate fake news. Transformer-based language models like GROVER work by training a model (containing over a billion parameters in this case) on hundreds of gigabytes of text data, so that it learns general patterns in the training data that allow it to then conditionally generate text given any other sequence of text as input. As annotators, we were not told which documents were written by humans and which were written by algorithms, leading to some sticky situations. Many sentences had unclear wording, like the following example taken from the beginning of a document:

If weed’s not really a public health issue and you're really happy about it, get an understanding about the ways in which it will be able to influence your behaviour.

Trying to annotate sentences like this was difficult, because I was never sure whether I was reading a wordy comment made by a real human trying to communicate a certain message, or whether I was just reading an algorithm’s attempt to imitate the surface-level features of the language that people use when discussing marijuana legalization online. This particular sentence, for example, takes some time for me to parse, but after some thought it seems fairly coherent. But later, the same document reads:

BART police already have a "marijuana alley" where potential customers could find sprayers and pagers ready to use and find it where they're supposed to.

giving away the fact that the document was likely written by an algorithm. These giveaway sentences were usually good news for me as I came across them in my annotations: if I could figure out that a given document was not written by a human, the more ambiguous sentences became much less confusing, because I knew that there was no intended message for me to unearth—I was reading something that was “written” by nobody. Without these dead giveaway sentences, though, annotation was an arduous and painful process. I often felt as if I was being gaslit by the text: do I just not understand what the author is trying to communicate, or is this paragraph even saying anything at all?

These incoherent GPT-2 documents reveal how we as readers can and do ascribe meaning to sequences of text, even when there is no intended message or motivation behind that text, and even when we would have otherwise disregarded that “meaning” if we knew the full context of that document’s creation. As examples of voiceless writing that we assign meaning to, they pose interesting questions about what language is, and how it relates to writing.

next ->