“Finally Machines can answers all your questions”.

Published in

Dev Genius

9 min readJun 5, 2022

Recently I have been working on Coveo search ML models to enhance search experience of our e-commerce website. While exploring Coveo ML models features, I started looking and understanding the capabilities of BERT (Bi-directional encoder representations from transformers) and I was super amazed with its capabilities and performance of the BERT pre-trained model. BERT is a transformer-based machine learning technique for natural language processing pre-training and developed by Google.

In this article I was trying to show how easily anyone can use BERT pre-trained model by using NVIDIA NGC containers on GCP cloud and trained using your own context for questions / answers. This tutorial has 3 sections which will help you understand the below topics as well as you can build your own BERT model:

1.NVidia NGC Catalogs

2.BERT Model for NLP

3.Example for BERT Questions/Answers using GCP

Let’s start !

1.NVidia NGC Catalogs

The NVidia NGC Catalog is a software which consists of pre-trained models, HPC and GPU enabled various machine learning and deep learning framework like Tensor flow, PyTorch, MXNet, NLTK etc. It also includes docker runtime, NVidia drivers and helm charts for production deployment of these models.

In short, it deploys performance-optimized AI/HPC software containers, pre-trained AI models, and Jupyter Notebooks that accelerate AI developments and HPC workloads on any GPU-powered on-premise, cloud, and edge systems.

Below is the software stack for NVidia NGC:

2. BERT Model for Natural Language Processing

BERT, short for Bidirectional Encoder Representations from Transformers, is a Machine Learning (ML) model for natural language processing. It was developed in 2018 by researchers at Google AI Language and serves as a role model solution to the most common language tasks, such as sentiment analysis and named entity recognition. Some of the most common language tasks it can handle like mentioned below:

Search , Question/Answers, Sentiment Analysis , Text classifications, parts of speech (POS), recognizing named entities (NER), sentiment classification, text generation, summarization and similarity matching.

The reason BERT is very important and crucial are two-fold:

Dataset Size: Language is messy, complex, and much harder for computers to learn than identifying images. They need much more data to get better at recognising patterns in language and identifying relationships between words and phrases. Models like the latest GPT-3 were trained on 45TB of data and contain 175 billion parameters. These are huge numbers, so very few people or even organizations have the resources to train these types of models. If everyone had to train their own BERT, we would see very little progress without researchers building on the power of these models. Progress would be slow and limited to a few big players.
Fine-tuning: Pre-trained models had the dual benefit that they could be used “off-the-shelf”, i.e. without any changes, a business could just plug BERT into their pipeline and use it with a chatbot or some other application. But it also meant these models could be fine-tuned to specific tasks without much data or model tweaking. For BERT, all you need is a few thousand examples and you can fine tune it to your data.

Let’s understand more about BERT and why it is very successful.

The Transformer architecture:

To make it a general understanding of the Transformer architecture and how it relates to models like BERT, this article will let understand.

As we know, the original Transformer paper was called “Attention is all you need”. The name itself is important since it points to how it deviates from previous approaches. Earlier NLP models like ELMo employed RNNs to process text sequentially in a loop-like approach mentioned below:

RNNs with sequence to sequence approaches processed text sequentially until they reached an end of sentence token (<eos>). In this example an request, “ABC” is mapped to a reply “WXYZ”. When the model receives the <eos> token the hidden state of the model stores the entire context of the preceding text sequence. Source: *A Neural Conversation Model*

Now think of a simple sentence, like “The cat ran away when the dog chased it down the street”. For a person, this is an easy sentence to comprehend, but there’s actually a number of difficulties if you think of processing this sequentially. Once you get to the “it” part, how do you know what it refers to? You could have to store some state to identify that the key protagonist in this sentence is the “cat”. Then, you’d have to find some way to relate the “it” to the “cat” as you continue to read the sentence.

Now imagine that the sentence could be any number of words in length, and try to think about how you would keep track of what’s being referred to as you process more and more text.

This is the problem sequential models ran into like pre-BERT Model.

They were limited. They could only prioritise the importance of words that were most recently processed. As they continued to move along the sentence, the importance or relevance of previous words started to diminish.

Think of it like adding information to a list as you process each new word. The more words you process, the more difficult it is to refer to words at the start of the list. Essentially, you need to move back, one element at a time, word by word until you get to the earlier words and then see if those entities are related.

Does the “it” refer to the “cat”? This is known as the “Vanishing Gradient” problem, and ELMo used special networks known as Long Short-Term Memory Networks (LSTMs) to alleviate the consequences of this phenomenon. LSTMs did address this issue, but they didn’t eliminate it.

Ultimately, they couldn’t create an efficient way to “focus” on the important word in each sentence. This is the problem the Transformer network addressed by using the mechanism we already know as “attention”.

This gif is from a great blog post about understanding attention in Transformers. The green vectors at the bottom represent the encoded inputs, i.e. the input text encoded into a vector. The dark green vector at the top represents the output for input 1. This process is repeated for each input to generate an output vector which has attention weights for the “importance” of each word in the input which are relevant to the current word being processed. It does this via a series of multiplication operations between the Key, Value and Query matrices which are derived from the inputs. Source: *Illustrated Self-Attention*.

The “Attention is all you need” paper used attention to improve the performance of machine translation. They created a model with two main parts:

Encoder: This part of the “Attention is all you need” model processes the input text, looks for important parts, and creates an embedding for each word based on relevance to other words in the sentence.
Decoder: This takes the output of the encoder, which is an embedding, and then turns that embedding back into a text output, i.e. the translated version of the input text.

The key part of the paper is not, however, the encoder or the decoder, but the layers used to create them. Specifically, neither the encoder nor the decoder used any recurrence or looping, like traditional RNNs. Instead, they used layers of “attention” through which the information passes linearly. It didn’t loop over the input multiple times — instead, the Transformer passes the input through multiple attention layers.

You can think of each attention layer as “learning” more about the input, i.e. looking at different parts of the sentence and trying to discover more semantic or syntactic information. This is important in the context of the vanishing gradient problem we noted earlier.

As sentence length increases, it gets increasingly difficult for RNNs to process them and learn more information. Every new word means more data to store, and makes it harder to retrieve that to understand the context in the sentence.

This looks scary, and in truth it is a little overwhelming to understand how this works initially. So don’t worry about understanding it all right now. The main takeaway here is that instead of looping the Transformer uses scaled dot-product attention mechanisms multiple times in parallel, i.e. it adds more attention mechanisms and then processes input in each in parallel. This is similar to looping over a layer multiple times in an RNN. Source: *Another great post on attention*

For comparison, the largest BERT model consists of 24 attention layers. GPT-2 has 12 attention layers and GPT-3 has 96 attention layers.

3. Example for BERT Questions/Answers using GCP

The BERT which stands for BiDirectional Encoder Representations from Transformer is a method / pre-trained model which can be re-trained or can be used for variety of natural language processing tasks or use cases. In this case we will be using Nvidia NGC catalog for BERT questions / answers to run on GCP Vertex AI.

This notebook demonstrates:

Inference on QA task with BERT Large model
The use/download of fine-tuned NVIDIA BERT models
Use of Mixed Precision for Inference

System Requirements:

NVidia NGC catalog pre-trained model BERT
NVidia docker image for BERT
NVidia docker image for Tensorflow application framework v.1.15.5
NVidia docker runtime
Google Vertex AI workbench
NVidia Tesla V100 GPU — 1
NVidia
250 GB SSD drive
Ubuntu 20.x

To meet all the requirements above, the simplest way is to create an account in NVidia NGC : https://catalog.ngc.nvidia.com/ and register it.

Login it and go to BERT for TensorFlow Jupyter Notebook and click on deploy it to Vertex AI. Just to make sure you need tensorflow v1.15.5 to run this model. Rest of the steps are mentioned below and it is very straight forward.

BERT Model Configurations:

Model : BERTLARGE
Hidden layers :24 Encoders
Hidden unit size: 1024
Attention heads: 16
Feedforward filter size: 4x1024
Max sequence length:512
Parameters: 330 M

Among the many configurations available we will download one of these two:

bert_tf_ckpt_large_qa_squad2_amp_384 which are trained on the SQuaD 2.0 Dataset.

Run this below code:

use_mixed_precision_model = True

#Gathering the data and Model directories:

# bert_tf_ckpt_large_qa_squad2_amp_384
DATA_DIR_FT = ‘/workspace/bert/data/finetuned_large_model_SQUAD2.0’
!mkdir -p $DATA_DIR_FT

!wget — content-disposition -O $DATA_DIR_FT/bert_tf_ckpt_large_qa_squad2_amp_384_19.03.1.zip \
https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_ckpt_large_qa_squad2_amp_384/versions/19.03.1/zip \
&& unzip -n -d $DATA_DIR_FT/ $DATA_DIR_FT/bert_tf_ckpt_large_qa_squad2_amp_384_19.03.1.zip \
&& rm -rf $DATA_DIR_FT/bert_tf_ckpt_large_qa_squad2_amp_384_19.03.1.zip

# Download BERT helper scripts
!wget -nc — show-progress -O bert_scripts.zip \
https://api.ngc.nvidia.com/v2/recipes/nvidia/bert_for_tensorflow/versions/1/zip
!mkdir -p /workspace/bert
!unzip -n -d /workspace/bert bert_scripts.zip

#BERT Config

# Download BERT vocab file
!mkdir -p /workspace/bert/config.qa
!wget -nc https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt \
-O /workspace/bert/config.qa/vocab.txt

#Writing /workspace/bert/config.qa/bert_config.json

%%writefile /workspace/bert/config.qa/bert_config.json
{
“attention_probs_dropout_prob”: 0.1,
“hidden_act”: “gelu”,
“hidden_dropout_prob”: 0.1,
“hidden_size”: 1024,
“initializer_range”: 0.02,
“intermediate_size”: 4096,
“max_position_embeddings”: 512,
“num_attention_heads”: 16,
“num_hidden_layers”: 24,
“type_vocab_size”: 2,
“vocab_size”: 30522
}

#Helper Functions:

# Create dynamic JSON files based on user inputs
def write_input_file(context, qinputs, predict_file):

# Remove quotes and new lines from text for valid JSON
context = context.replace(‘“‘, ‘’).replace(‘\n’, ‘’)

# Create JSON dict to write
json_dict = {
“data”: [
{
“title”: “BERT QA”,
“paragraphs”: [
{
“context”: context,
“qas”: qinputs
}]}]}

# Write JSON to input file
with open(predict_file, ‘w’) as json_file:
import json
json.dump(json_dict, json_file, indent=2)

# Display Inference Results as HTML Table
def display_results(predict_file, output_prediction_file):
import json
from IPython.display import display, HTML

# Here we show only the prediction results, nbest prediction is also available in the output directory

results = “”
with open(predict_file, ‘r’) as query_file:
queries = json.load(query_file)
input_data = queries[“data”]
with open(output_prediction_file, ‘r’) as result_file:
data = json.load(result_file)
for entry in input_data:
for paragraph in entry[“paragraphs”]:
for qa in paragraph[“qas”]:
results += “<tr><td>{}</td><td>{}</td><td>{}</td></tr>”.format(qa[“id”], qa[“question”], data[qa[“id”]])

display(HTML(“<table><tr><th>Id</th><th>Question</th><th>Answer</th></tr>{}</table>”.format(results)))

BERT Inference: Question Answering

We can run inference on a fine-tuned BERT model for tasks like Question Answering.

Here we use a BERT model fine-tuned on a SQuaD 2.0 Dataset which contains 100,000+ question-answer pairs on 500+ articles combined with over 50,000 new, unanswerable questions.

# Create BERT input file with (1) context and (2) questions to be answered based on that context — Very Important

predict_file = ‘/workspace/bert/config.qa/input.json’

%%writefile $predict_file
{“data”:
[
{“title”: “Littelfuse QA”,
“paragraphs”: [
{“context”:”Littelfuse offers a variety of circuit protection devices to protect the circuits on passenger cars. Our automotive products, including our new resettable devices, are industry standards. The wide selection of our high quality offering allows you to searchfor the optimal solution that fits your specific needs. Browseour products below.”,
“qas”: [
{ “question”: “Does Littelfuse offer passenger cars products?”,
“id”: “Q1”
}
]}]}]}

Running Question/Answer Inference

To run QA inference we will launch the script run_squad.py with the following parameters:

import os

# This specifies the model architecture.
bert_config_file = ‘/workspace/bert/config.qa/bert_config.json’

# The vocabulary file that the BERT model was trained on.
vocab_file = ‘/workspace/bert/config.qa/vocab.txt’

# Initiate checkpoint to the fine-tuned BERT Large model
init_checkpoint = os.path.join(‘/workspace/bert/data/finetuned_large_model_SQUAD2.0/model.ckpt’)

# Create the output directory where all the results are saved.
output_dir = ‘/workspace/bert/results’
output_prediction_file = os.path.join(output_dir,’predictions.json’)

# Whether to lower case the input — True for uncased models / False for cased models.
do_lower_case = True

# Total batch size for predictions
predict_batch_size = 8

# Whether to run eval on the dev set.
do_predict = True

# When splitting up a long document into chunks, how much stride to take between chunks.
doc_stride = 128

# The maximum total input sequence length after WordPiece tokenization.
# Sequences longer than this will be truncated, and sequences shorter than this will be padded.
max_seq_length = 384

Run Inference

# Ask BERT questions
!python /workspace/bert/run_squad.py \
— bert_config_file=$bert_config_file \
— vocab_file=$vocab_file \
— init_checkpoint=$init_checkpoint \
— output_dir=$output_dir \
— do_predict=$do_predict \
— predict_file=$predict_file \
— predict_batch_size=$predict_batch_size \
— doc_stride=$doc_stride \
— max_seq_length=$max_seq_length

Display Results:

Question:Does Littelfuse offer passenger cars products?

Answer: Littelfuse offers a variety of circuit protection devices to protect the circuits on passenger cars

You can now add any context and multiple questions and try to write that into input.json file and run inference command which will provide you the answers. So far I have tried many QA and it is giving very fast and accurate results. I will also be publishing the code in github soon.

“Finally Machines can answers all your questions”.

1.NVidia NGC Catalogs

2. BERT Model for Natural Language Processing

The Transformer architecture:

3. Example for BERT Questions/Answers using GCP

BERT Inference: Question Answering

Running Question/Answer Inference

Run Inference

Display Results:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Dev Genius

Written by Satyendra Kumar

No responses yet