Chapter 15

Chapter 15: Data Augmentation for Text

How is data augmentation useful, and what are the most common augmentation techniques for text data?

Data augmentation is useful for artificially increasing dataset sizes to improve model performance, such as by reducing the degree of overfitting, as discussed in Chapter [ch05]. This includes techniques often used in computer vision models, like rotation, scaling, and flipping.

Similarly, there are several techniques for augmenting text data. The most common include synonym replacement, word deletion, word position swapping, sentence shuffling, noise injection, back translation, and text generated by LLMs. This chapter discusses each of these, with optional code examples in the supplementary/q15-text-augment subfolder at https://github.com/rasbt/MachineLearning-QandAI-book.

Synonym Replacement

In synonym replacement, we randomly choose words in a sentence—often nouns, verbs, adjectives, and adverbs—and replace them with synonyms. For example, we might begin with the sentence “The cat quickly jumped over the lazy dog,” and then augment the sentence as follows: “The cat rapidly jumped over the idle dog.”

Synonym replacement can help the model learn that different words can have similar meanings, thereby improving its ability to understand and generate text. In practice, synonym replacement often relies on a thesaurus such as WordNet. However, using this technique requires care, as not all synonyms are interchangeable in all contexts. Most automatic text replacement tools have settings for adjusting replacement frequency and similarity thresholds. However, automatic synonym replacement is not perfect, and you might want to apply post-processing checks to filter out replacements that might not make sense.

Word Deletion

Word deletion is another data augmentation technique to help models learn. Unlike synonym replacement, which alters the text by substituting words with their synonyms, word deletion involves removing certain words from the text to create new variants while trying to maintain the overall meaning of the sentence. For example, we might begin with the sentence “The cat quickly jumped over the lazy dog” and then remove the word quickly: “The cat jumped over the lazy dog.”

By randomly deleting words in the training data, we teach the model to make accurate predictions even when some information is missing. This can make the model more robust when encountering incomplete or noisy data in real-world scenarios. Also, by deleting nonessential words, we may teach the model to focus on key aspects of the text that are most relevant to the task at hand.

However, we must be careful not to remove critical words that may significantly alter a sentence’s meaning. For example, it would be suboptimal to remove the word cat in the previous sentence: “The quickly jumped over the lazy dog.” We must also choose the deletion rate carefully to ensure that the text still makes sense after words have been removed. Typical deletion rates might range from 10 percent to 20 percent, but this is a general guideline and could vary significantly based on the specific use case.

Word Position Swapping

In word position swapping, also known as word shuffling or permutation, the positions of words in a sentence are swapped or rearranged to create new versions of the sentence. If we begin with “The cat quickly jumped over the lazy dog,” we might swap the positions of some words to get the following: “Quickly the cat jumped the over lazy dog.”

While these sentences may sound grammatically incorrect or strange in English, they provide valuable training information for data augmentation because the model can still recognize the important words and their associations with each other. However, this method has its limitations. For example, shuffling words too much or in certain ways can drastically change the meaning of a sentence or make it completely nonsensical. Moreover, word shuffling may interfere with the model’s learning process, as the positional relationships between certain words can be vital in these contexts.

Sentence Shuffling

In sentence shuffling, entire sentences within a paragraph or a document are rearranged to create new versions of the input text. By shuffling sentences within a document, we expose the model to different arrangements of the same content, helping it learn to recognize thematic elements and key concepts rather than relying on specific sentence order. This promotes a more robust understanding of the document’s overall topic or category. Consequently, this technique is particularly useful for tasks that deal with document-level analysis or paragraph-level understanding, such as document classification, topic modeling, or text summarization.

In contrast to the aforementioned word-based methods (word position swapping, word deletion, and synonym replacement), sentence shuffling maintains the internal structure of individual sentences. This avoids the problem of altering word choice or order such that sentences become grammatically incorrect or change meaning entirely.

Sentence shuffling is useful when the order of sentences is not crucial to the overall meaning of the text. Still, it may not work well if the sentences are logically or chronologically connected. For example, consider the following paragraph: “I went to the supermarket. Then I bought ingredients to make pizza. Afterward, I made some delicious pizza.” Reshuffling these sentences as follows disrupts the logical and temporal progression of the narrative: “Afterward, I made some delicious pizza. Then I bought ingredients to make pizza. I went to the supermarket.”

Noise Injection

Noise injection is an umbrella term for techniques used to alter text in various ways and create variation in the texts. It may refer either to the methods described in the previous sections or to character-level techniques such as inserting random letters, characters, or typos, as shown in the following examples:

Random character insertion “The cat qzuickly jumped over the lazy dog.” (Inserted a z in the word quickly.)

Random character deletion “The cat quickl jumped over the lazy dog.” (Deleted y from the word quickly.)

Typo introduction “The cat qickuly jumped over the lazy dog.” (Introduced a typo in quickly, changing it to qickuly.)

These modifications are beneficial for tasks that involve spell-checking and text correction, but they can also help make the model more robust to imperfect inputs.

Back Translation

Back translation is one of the most widely used techniques to create variation in texts. Here, a sentence is first translated from the original language into one or more different languages, and then it is translated back into the original language. Translating back and forth often results in sentences that are semantically similar to the original sentence but have slight variations in structure, vocabulary, or grammar. This generates additional, diverse examples for training without altering the overall meaning.

For example, say we translate “The cat quickly jumped over the lazy dog” into German. We might get “Die Katze sprang schnell über den faulen Hund.” We could then translate this German sentence back into English to get “The cat jumped quickly over the lazy dog.”

The degree to which a sentence changes through backtranslation depends on the languages used and the specifics of the machine translation model. In this example, the sentence remains verys imilar. However, in other cases or with other languages, you might see more significant changes in wording or sentence structure while maintaining the same overall meaning.

This method requires access to reliable machine translation models or services, and care must be taken to ensure that the back-translated sentences retain the essential meaning of the original sentences.

Synthetic Data

Synthetic data generation is an umbrella term that describes methods and techniques used to create artificial data that mimics or replicates the structure of real-world data. All methods discussed in this chapter can be considered synthetic data generation techniques since they generate new data by making small changes to existing data, thus maintaining the overall meaning while creating something new.

Modern techniques to generate synthetic data now also include using decoder-style LLMs such as GPT (decoder-style LLMs are discussed in more detail in Chapter [ch17]). We can use these models to generate new data from scratch by using “complete the sentence” or “generate example sentences” prompts, among others. We can also use LLMs as alternatives to back translation, prompting them to rewrite sentences as shown in Figure 1.1.

Using an LLM to rewrite a sentence

Note that an LLM, as shown in Figure 1.1, runs in a nondeterministic mode by default, which means we can prompt it multiple times to obtain a variety of rewritten sentences.

Recommendations

The data augmentation techniques discussed in this chapter are commonly used in text classification, sentiment analysis, and other NLP tasks where the amount of available labeled data might be limited.

LLMs are usually pretrained on such a vast and diverse dataset that they may not rely on these augmentation techniques as extensively as in other, more specific NLP tasks. This is because LLMs aim to capture the statistical properties of the language, and the vast amount of data on which they are trained often provides a sufficient variety of contexts and expressions. However, in the fine-tuning stages of LLMs, where a pretrained model is adapted to a specific task with a smaller, task-specific dataset, data augmentation techniques might become more relevant again, mainly if the task-specific labeled dataset size is limited.

Exercises

15-1. Can the use of text data augmentation help with privacy concerns?

15-2. What are some instances where data augmentation may not be beneficial for a specific task?

References

The WordNet thesaurus: George A. Miller, “WordNet: A Lexical Database for English” (1995), https://dl.acm.org/doi/10.1145/219717.219748.

Machine Learning Q and AI

30 Essential Questions and Answers on Machine Learning and AI