Processing Data for Large Language Models

This article provides a data processing guide to help practitioners who are developing large language models (LLMs) to overcome some of its challenges.
Last Updated: Dec 21, 2022
Large language models (LLMs) are massively improving their performance — and complexity. This complexity brings real challenges, particularly because of the scale of LLMs.
In this article, we'll look at how to process data for LLM development to meet these challenges. Here's what we'll be covering:

Table of Contents



Let's get started.

Introduction to LLM Development

LLMs work so effectively, in part, because of their size: They're trained on immense datasets and thus have a broader understanding than smaller models trained on smaller datasets.
This data includes a broad range of themes, genres, languages, etc. but the driving concept is "the more data, the better." Recent datasets such as C4, The Pile, The Bigsicence Roots Corpus, and OpenWebText have helped increase the size of training datasets by gathering and cleaning enormous amounts of text from internet crawls for the purpose of pre-training LLMs.
But because it is so expensive to perform manual review and curation on massive datasets, many of these datasets have quality issues. This has implications far beyond metrics like perplexity and validation loss, as learned models reflect the biases present in their training data.
Past that, quantitatively and qualitatively understanding these datasets is therefore a research challenge in its own right. As data is the fuel driving growth for these LLMs, it is crucial to understand and document the composition of the datasets used to train large language models. A fundamental challenge that remains is how to quantify the value of data in algorithmic predictions and decisions. What is and is not appropriate data to train on can therefore vary wildly with the application context. The best approach is to document–rather than eliminate–potentially concerning aspects of datasets.
In machine learning, the training data and the test (evaluation) data are similar or at least of the same type. But for large language models, the training data is just “raw text”. This presents a myriad of challenges when it comes to creating training/validation/test splits that do not overlap with benchmarking datasets.
With all that said, let's look at how to handle the massive datasets these models require for training!

Preprocessing Datasets

While the objective for training language models may vary in the context of their downstream applications, there are a few steps and processes practitioners can take to ensure that the data used to train LLMs is clean and robust. These include but are not limited to:
  • Handling junk data
  • De-duplication
  • Decontamination
  • Toxicity and Bias Control
  • Personal Identifiable Information Control
  • Prompt Control

Handling Junk Data

Despite their size, large-scale datasets still have uneven text representation with a large amount of gibberish and boilerplate text - (read HTML, Lorem ipsum).
Extracting text from websites for language modeling, especially for multilingual corpora, is highly nontrivial. However, it's important to remove such junk from the datasets before using it to train a model that is conditioned to predict the next token given all previous tokens.
Data cleansing mechanisms and tools like justext, trafilatura can be used to remove boilerplate HTML text while striking a balance between limiting noise (precision) and including all valid parts (recall).
Another very useful method for handling junk in web corpora includes filtering on the accompanying metadata. For instance, when creating the WebText Corpus to train GPT-2, researchers at OpenAI scraped all outbound links on reddit that received at least 3 karma (upvotes). Such heuristics help reduce the amount of noise in datasets while still ensuring that the data is of highly quality.

Document Length Considerations

The goal of language modeling is to learn to generate text conditioned on previous tokens. In this context, removing very short documents (text with less than 100 or so tokens) from the corpus can help remove noise by creating contiguous text to model dependencies in the text.
Furthermore, since most language models today are based on the transformer architecture, it is useful to preprocess and chunk very large documents into contiguous spans of the desired length. For example, the following code snippet from the datasets library shows how to chunk very large documents into non-overlapping spans:
def chunk_examples(examples):
chunks = []
for sentence in examples['sentence']:
chunks += [sentence[i:i + 50] for i in range(0, len(sentence), 50)]
return {'chunks': chunks}

Machine Generated Text

One of the goals of training language models is to capture the distribution of human languages. However, web-crawl datasets contain a large quantity of machine-generated text in the form of generations from existing language models, OCR text and machine-translated text.
For instance, the data from patents.google.com forms a large part of the C4 corpus. It uses machine translation to translate patents from patent offices around the world into English. Additionally, data in web corpora also contains OCR-generated text from scanned books and documents. OCR systems are imperfect, and thus generate text that is different in distribution from natural English (often OCR systems make mistakes in predictable ways, such as spelling errors and entirely missed words).
While it's very hard to identify machine-generated text (and this is still a topic of active research!), there are a few tools such as ctrl-detector that can be used to identify and detect machine-generated text. When preprocessing a corpus for language modeling it is important to characterize and document the presence of machine-generated text in the corpus.

De-duplication

Datasets created by scraping raw text from the internet often result in the same sequences being repeated multiple times. For example, in "Deduplicating Training Data Makes Language Models Better" the authors find that a single 50 word sequence is repeated in the C4 dataset 60,000 times!
Training models on de-duplicated datasets is faster and are less likely to result in memorization. More recently, researchers have also shown that language models trained on duplicated data are susceptible to privacy attacks, where adversaries generate sequences from a trained model and detect which sequences are memorized from the training set. In their work, "Deduplicating Training Data Mitigates Privacy Risks in Language Models" the authors show that the rate at which language models regenerate training sequences is superlinearly related to a sequence’s count in the training set. For instance, a sequence that is present 10 times in the training data is on average generated 1000x more often than a sequence that is present only once.
De-duplication can be performed at varying levels of granularity. Ranging from exact match de-duplication to fuzzy de-duplication tools such as deduplicate-text-datasets, and datasketch can help reduce and remove redundant text from the corpus being processed. As noted by many researchers, it's important to understand that the de-duplication process requires a lot of computational resources (CPU and RAM) due to the size of web crawl datasets and it's therefore recommended to run such computations in distributed settings.

Decontamination

Normally, in machine learning, data hygiene (i.e. the separation of training and testing data) is quite straightforward. However, in the case of large language models where both training and bench-marking datasets are sourced from the internet, ensuring their separation a priori can be quite challenging.
For example, when using benchmark data (such as question-answer pairs) to evaluate the capabilities of a large language model, it makes a difference whether the benchmark data appears in the language model training data. In this case, the benchmark performance will be skewed upwards.
Decontamination specifically refers to the process of removing instances from the training datasets that overlap with existing bench-marking datasets. Similar to the processes of de-duplication training datasets' integrity can be maintained by removing instances that overlap with existing benchmark sets. For instance, when creating the WebText dataset researchers at OpenAI decontaminated the dataset by removing all Wikipedia content from the training set because the Wikipedia data was used extensively in their benchmark datasets. In another instance, the researchers at EleutherAI showed a way to decontaminate bench-marking datasets through their lm-eval harness package for models where the decontamination of the training dataset was not feasible.
More specifically, it's important to consider two types of data contamination:
Input-and-output contamination:
This type of data contamination occurs when downstream task labels are available in the pre-training corpus. For tasks similar to language modeling (e.g., abstractive summarization) the task labels are target tokens. If target text occurs in the pre-training corpus, the model can learn to copy the text instead of actually solving the task.
Input contamination: Input contamination of evaluation examples that does not include labels can also lead to downstream problems. For instance, when performing zero-shot and few-shot evaluations the benchmark results are biased upwards. Therefore, it's important to carefully consider decontamination steps on pre-training datasets with popular benchmarks.

Toxicity and Bias Control

While web corpora are diverse, it is also filled with toxic and biased content. For instance, the authors of the RealToxicityPrompts paper use the PerspectiveAPI to report that 2.1% of OpenWebText and 4.3% of WebText has a toxicity score >= 50%.
When training language models it's important to pay attention and even possibly filter out toxic content from the pre-training datasets using tools such as PerspectiveAPI. This ensures that the models do not exhibit bias and produce harmful content in downstream applications. One example to alleviate this is the issue of filtering text for "bad words" for instance, the authors of the C4 filter out text contained in this list. In another example, the researchers behind the PILE dataset categorize harmful content using spamscanner.
However, such filtering steps must be carried out with great care while paying attention to downstream applications since the voices of people most likely to hew to a hegemonic viewpoint are also more likely to be retained by these filters. A detailed analysis of pejorative content, and gender/religion biases must be carried out before using the data to pre-train language models.

Personal Identifiable Information Control

While collecting large datasets, it's also important to understand legal aspects concerning the instances in the datasets. Specifically, special attention must be paid to handling personally identifiable information (PII), such as proper names, organization names, medical records, social identification numbers, medical records, etc.
Depending on the application it's important to either mask or removes such information before pre-training language models. Tools such as presidio and pii-codex provide pipelines to detect, analyze and handle Personally Identifiable Information from text data.

Prompt Control

Recent models such as FlanT5, InstructGPT, and Galactica have incorporated prompt pre-training and instruct fine-tuning as ways to improve the performance of these models on downstream tasks.
Incorporating these methods when building datasets requires significant efforts into various aspects of prompt design and creation while maintaining diversity. Tools such as PromptSource can be used to create, share and use natural language prompts. For instance, a classification task can be modeled using the following prompt using the library:
# actual data point
{
"text":"Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
"label":2
}
# prompted
Input

Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green
again.
What label best describes this news article?
Target

Business

Documentation

While it has been known for a while that documentation is important, the machine learning community often treated it as an ad-hoc process.
More specifically, in machine learning, practitioners tend to think of datasets as fixed objects that are collected and fed into a training algorithm. Despite the importance of data to machine learning, there is currently no standardized process for documenting machine learning datasets.
To address these gaps, some researchers have proposed Datasheets and Data Statements. The documentation guides propose a way to document motivation, composition, collection, Curation rationale, Preprocessing/cleaning/labeling, Usage, Distribution, and Maintenance of datasets among other things. The key focus is the transparency of processes and data. As an example, look at the Datasheet for The Pile dataset. It serves as a good starting point and inspiration for people interested in documenting their custom datasets. The good folk at HuggingFace also provides a nice template to help with the documentation of datasets on the hub.

Key Takeaways

LLMs have drastically changed the way NLP and NLU tasks are handled by machine learning practitioners.
While the availability of large web corpora has definitely paved the way forward for training bigger and bigger models, it's important to consider and preprocess the datasets these models are trained on.
In this report, we covered some key steps a practitioner can take to ensure that the data used to train such models are appropriate and remain clean. Additionally, we provide links to resources that can help perform some of these tasks.

Never lose track of another ML project. Try W&B today.