Tuesday, December 24, 2024

How BERT and GPT models are changing the NLP game

Share

An artificial intelligence that can compose Shakespeare’s sonnets. Artificial intelligence that can design a website based on a straightforward user description. Artificial intelligence that can summarize a description of quantum computing for an eighth-grader. Since the introduction of the GPT-3 language model this year to the natural language processing (NLP) and machine learning enthusiast community full of stories modern alleged capabilities of language-based artificial intelligence.

Recent advances in NLP have been in the works for several years, starting in 2018 with the release of two massive deep learning models: GPT (Generative Pre-Training) by Open AI and BERT (Bidirectional Encoder Representations by Transformers) for understanding language, including Google’s BERT-Base and BERT-Vast. Unlike previous NLP models, BERT is an open source, bidirectional, and unsupervised language representation that is pre-trained solely using a plain text corpus. Since then, we have seen the development of other massive deep learning language models: GPT-2, RoBERT, ESIM+GloVe, and now GPT-3 – a model that has spawned thousands of technical articles.

How massive deep learning models work

estimate the probability of the words appearing in the sentence or the existence of the sentence itself. As such, they are useful building blocks for many NLP applications. However, they often require burdensome amounts of training data to be useful for specific tasks and domains.

models are designed to address these pervasive problems with training data. They are pre-trained using massive amounts of unannotated data to provide a general-purpose deep learning model. By fine-tuning these pre-trained models, downstream users can create task-specific models with smaller annotated training datasets (a technique called transfer learning). These models represent a breakthrough in NLP: state-of-the-art results can now be achieved with smaller training data sets.

Until recently, the state-of-the-art NLP language models were RNN models. They are useful for sequential tasks such as abstract summarization, machine translation, and general natural language generation. RNN models process words sequentially, in the order in which they appear in context, one word at a time. As a result, these models are arduous to parallelize and poorly preserve contextual relationships for long text. As we discussed in a previous post, context is key in NLP.

The Transformer, a model launched in 2017, sidesteps these problems. Transformers (such as BERT and GPT) exploit an attention mechanism that “draws attention” to the words most useful in predicting the next word in a sentence. Thanks to these attention mechanisms, Transformers process the input sequence of words simultaneously and map the appropriate relationships between words, regardless of how far apart the words appear in the text. As a result, Transformers is highly parallelizable, can train much larger models at a faster pace, and uses context clues to solve many of the ambiguity problems that plague text.

Individual transformers also have their own unique advantages. Until this year, BERT was the most popular deep learning NLP model, achieving state-of-the-art performance on many NLP tasks.

Training for 2.5 billion words, its main advantage is the exploit of bidirectional learning to obtain the context of words simultaneously from left-to-right and right-to-left context. BERT’s bidirectional training approach is optimized for Masked LM prediction and outperforms left-to-right training after a tiny number of pre-training steps. During the model training process, training Next Sentence Prediction (NSP) allows the model to understand how sentences relate to each other, whether sentence B should precede or follow sentence A. As a result, he is able to gain more context. For example, he can understand the semantic meaning of the word bank in the sentences: “Raise your oars when you get to the river bank” and “The bank is sending you a new debit card.” To understand this, he uses the clues from left to right river and from right to left.

Unlike BERT models, GPT models are unidirectional. The main advantage of GPT models is the sheer amount of data on which they are pre-trained: GPT-3, the third-generation GPT model, was trained on 175 billion parameters, which is about 10 times larger than previous models. This truly massive, pre-trained model means users can fine-tune NLP tasks using very little data to perform modern tasks. While Transformers have generally reduced the amount of data needed to train models, GPT-3 has a distinct advantage over BERT in that it requires significantly less data to train models.

For example, the model learned to write with just 10 sentences essay on why people shouldn’t be afraid of artificial intelligence. (Although it should be noted, the variable quality of these free essays shows the limitations of today’s technology.)

Tasks performed using BERT and GPT models:

  • Natural language reasoning is a task performed using NLP that allows models to determine whether a statement is true, false, or indeterminate based on a premise. For example, if the premise is “tomatoes are sweet” and the statement is “tomatoes are fruits”, they can be marked as undefined.
  • Answer the question enables developers and organizations to create and code question answering systems based on neural networks. In question answering tasks, the model receives a text-based question and returns the answer in text form, clearly marking the beginning and end of each answer.
  • Text classification used for sentiment analysis, spam filtering and message categorization. Employ BERT to fine-tune content category detection in any text classification exploit case.

Latest Posts

More News