
Author’s photo | Canva
# Entry
# 1. Building tokenizers from scratch
Project 1: How to build a Bert WordPiece tokenizer in Python and HuggingFace
Project 2: Let’s build a GPT tokenizer
Text preprocessing is the first and most essential part of any NLP task. It allows you to transform raw text into something that a machine can actually process by breaking it down into smaller units such as words, subwords, and even bytes. To get a good idea of how it works, I recommend checking out these two amazing projects. The first one will walk you through building a WordPiece BERT tokenizer in Python using Face Hugging. It shows how words are broken into smaller subword units, for example by adding “##” to mark parts of a word, which helps models like BERT deal with scarce or misspelled words by breaking them into familiar parts. second movie, “Let’s Build a GPT Tokenizer” by Andrej Karpathyit’s a bit long, but that’s it GOLD rescue. It describes how GPT uses byte-pair encoding (BPE) at the byte level to combine common byte sequences and more malleable text handling, including spaces, punctuation, and even emoticons. I really recommend watching this if you want to see what actually happens when text is converted to tokens. Once you get comfortable with tokenization, everything else in NLP becomes much clearer.
# 2. NER in action: recognizing names, dates and organizations
Project 1: Named Entity Recognition (NER) in Python: Pre-trained and custom models
Project 2: Building an entity extraction model using BERT
Once you understand how text is represented, the next step is to learn how to actually extract meaning from it. A great place to start is Named Entity Recognition (NER), which teaches a model to recognize entities in a sentence. For example, in the case of “Apple hit an all-time high share price of $143 in January,” a good NER system should select “Apple” as the organization, “$143” as the money, and “January this year” as the date. The first video shows how to apply pre-trained NER models with libraries such as spaCy AND Face hugging transformers. You’ll see how to enter text, get entity predictions, and even visualize them. The second video goes a step further and walks you through creating an entity extraction system by tuning BERT yourself. Instead of relying on a ready-made library, you code a pipeline: tokenize text, match tokens to entity labels, tune the model PyTorch Or TensorFlowand then apply it to tag recent text. I would recommend this as a second project because NER is one of those tasks that really makes NLP more practical. You start to see how machines can understand “who did what, when and where.”
# 3. Text classification: sentiment prediction with BERT
Design: Text classification | Sentiment analysis with BERT using huggingface, PyTorch and Python tutorial
After learning how to represent text and extract features, the next step is to train models to assign labels to text, a classic example of which is sentiment analysis. It’s quite an senior project and you may need to make one change to get it working (check the comments on the video), but I still recommend it because it also explains how BERT works. If you’re not familiar with transformers yet, this is a good place to start. The project will walk you through the process of using a pre-trained BERT model with Hugging Face to classify text such as movie reviews, tweets or product reviews. In the video, you’ll see how to load a labeled dataset, pre-process the text, and tune BERT to predict whether each example is positive, negative, or neutral. This is a clear way to see how tokenization, model training, and evaluation come together in one workflow.
# 4. Building text generation models using RNN and LSTM
Project 1: AI Text Generation – Next Word Prediction in Python
Project 2: Text Generation with LSTM and Spelling with Nabil Hassein
Sequence modeling deals with tasks that result in a sequence of text and is a huge part of how up-to-date language models work. These projects focus on text generation and next word prediction, showing how a machine can learn to continue sentence by word. The first video walks you through creating a basic Recurrent Neural Network (RNN) language model that predicts the next word in a sequence. This is a classic exercise that really shows how the model captures patterns, grammar, and structure in text, which models like GPT do on a much larger scale. The second video uses long-term short-term memory (LSTM) to generate coherent text from prose or code. You’ll see how the model feeds one word or character at a time, how to sample predictions, and even how tricks like temperature and beam searching control the creativity of the generated text. These projects clearly show that text generation is not magic, but an wise combination of predictions.
# 5. Construction of the Seq2Seq machine translation model
Design: PyTorch Seq2Seq tutorial on machine translation
The final project takes NLP beyond English and into real-world tasks, focusing on machine translation. In this case, you build a network of encoders, where one network reads and encodes the source sentence, and the other one decodes it into the target language. This is basically what Google Translate and other translation services do. The tutorial also introduces attention mechanisms that allow the decoder to focus on the right parts of the input, and explains how to practice parallel text and evaluate translations using metrics such as the Bilingual Evaluation Understudy (BLEU) score. This project combines everything you have learned so far into a practical NLP task. Even if you’ve used translation apps before, building a toy translator will give you a hands-on experience of how these systems actually work behind the scenes.
# Application
This brings us to the end of the list. Each project covers one of the five main areas of NLP: tokenization, information extraction, text classification, sequence modeling and applied multilingual NLP. By trying them out, you’ll get a good idea of how NLP pipelines work from start to finish. If you found these projects helpful, please give credit to the tutorial creators and share what you did.
Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of artificial intelligence and medicine. She is co-author of the e-book “Maximizing Productivity with ChatGPT”. As a 2022 Google Generation Scholar for APAC, she promotes diversity and academic excellence. She is also recognized as a Teradata Diversity in Tech Scholar, a Mitacs Globalink Research Scholar, and a Harvard WeCode Scholar. Kanwal is a staunch advocate for change and founded FEMCodes to empower women in STEM fields.
