Back to Blog

Understanding Deep Learning Algorithms for Named Entity Recognition:

Understanding Deep Learning Algorithms for Named Entity Recognition-

What is Named Entity Recognition (NER)

Named-entity recognition (NER) (also known as entity identification and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into predefined categories. Named Entity Recognition consists of extracting structured pieces of information such as name, location, company, university name, activity sector, etc. from chunks of unstructured text.

Try our NER Model click here.

Before going into the implementation and algorithms for NER, let’s first understand the data used for NER.

Understanding the data

First, let’s discuss what Sequence Tagging is. You may have heard of it under different names: Named Entity Recognition, Part-of-Speech Recognition, etc. We’ll use Named Entity Recognition (NER) for the rest of this post.

One example is:

“Sony CE-7RM3 has Remarkable Eye AutoFocus”

Sony                   B-PNAME

CE-7RM3            I-PNAME

has                      O

Remarkable       O

Eye                      B-PFEATURE           

Autofocus          I-PFEATURE

The entities are :

  • PNAME used to represent the product name tag.
  • PFEATURE used to represent the product feature tag.
  • O used to represent no entity tag.

Some entities (like Sony CE-7RM3) have multiple words, we use a tagging scheme to distinguish between the beginning (tag B-...), or the inside of an entity (tag I-...).

For our implementation, we are assuming that the data is stored in a two .txt file with one file containing a single sentence per line and another file containing corresponding entities per line, like the following example.


“Sony CE-7RM3 has Remarkable Eye AutoFocus”





NOTE: makes the data labeling very easy to tag and annotate text in an interface for sequence tagging (NER). Checkout labeling software by 

Feature Set

A feature set is a subset of your data set, which is used as the input to your machine learning algorithm, is dividing the dataset into two sets.

  • training set - a subset to train a model.
  • test set - a subset to test the trained model.

In no time, a feature set was created for machine learning training using the platform. provides two options for defining and training test sets. A user can either split the dataset, or explicitly extract from the dataset. also provides a summary of the feature set you are creating so you can analyze whether your feature set is properly balanced and ensure there are no biases or bad data. After all, your machine learning model is only as good as the data it is being fed with.

NER Algorithms using Deep Learning

However, if we just pause for a sec and think about it in an abstract manner, we just need a system that assigns a class (a number corresponding to a tag) to each word in a sentence.

“But wait, why is it a problem? Just keep a list of products name and features” 

What makes this problem non-trivial is that a lot of entities, like product names or features are just made-up names for which we don’t have any prior knowledge. Thus, what we really need is something that will extract contextual information from the sentence, just like humans do.

Named Entity Recognition with Bidirectional LSTM and CRF

RNNs are capable of handling different input and output combinations, In NER tasks we will work with many to many architectures. Our task is to output tag (y) for a token (X) ingested at each time step.

There are lots of layers involved to implement bi-LSTM and CRF algorithm, let's discuss their use and how CRF works.

Embedding Layer

We need to use a dense representation for each word. The first thing we can do is load some pre-trained word embeddings. The embedding layer will transform each token into a vector of n dimensions. The n dimension will vary based on the different embeddings used.

For example: if we are using an Elmo pre-trained word embeddings, each word will be represented as a vector of 512 dimensions.


Once we have our word representation we simply run an LSTM (or bi-LSTM) over the sequence of word vectors and obtain another sequence of vectors. For each word in its context, we got a  meaningful representation. we’re gonna use bi-LSTM here which is best known for maintaining the context of words for long sentences. Long short-term memory (LSTM) cells are the building block of recurrent neural networks (RNNs). While plain LSTM cells in a feedforward neural network process text just like humans do (from left to right), Bi-LSTMs also consider the opposite direction. This allows the model to uncover more patterns as the amount of input information is increased. In other words, the model not only considers the sequence of tokens after a token of interest, but also before the token of interest.

Time Distributed Layer

We are dealing with Many to Many RNN Architecture where we expect output from every input sequence for example (a1 →b1, a2 →b2… an →bn) where a and b are inputs and outputs of every sequence. The TimeDistributeDense layers allow you to apply Dense(fully-connected) operation across every output over every time-step. If you don’t use this, you would only have one final output.

Decoding: The ultimate step. Once we have a vector representing each word, we can use it to make a prediction.

Computing Tags Scores: At this stage, each word is associated with a vector that captures information from the meaning of the word and its context. Let’s use it to make a final prediction. We can use a fully connected neural network to get a vector where each entry corresponds to a score for each tag.

Decoding the scores Then, we have two options to make our final prediction.

  1. Softmax
  2. Linear-chain CRF

In both the above options, we want to compute the probability of tagging sequence and find the sequence with the highest probability.

Choosing softmax as a final prediction has a disadvantage as it makes local choices. We will be using CRF to find the best sequence of tags for a given input.

Now we are in a position to understand the Deep Learning architecture used to do sequence tagging tasks.

How scoring works with linear chain CRF

The below images show the scoring of a sentence with linear-chain CRF.

The path PNAME-PNAME-PFEATURE-PFEATURE has a score of 

1 + 15 + 2 + 8 + 2 + 11 + 1 + 5 = 45

The path PNAME-O-PFEATURE-PFEATURE has a score of 

1 + 15 + 4 +12 + 2 + 11 + 1 + 5 = 51

Between these two possible paths, the one with the best score is  PNAME-O-PFEATURE-PFEATURE.

Now that we understand the scoring function of the CRF, we need to do 2 things:

  1. Find the sequence of tags with the best score.
  2. Compute a probability distribution over all the sequences of tags.

For More details and implementation - here are few other neural network architectures for NER, but the main idea will remain the as we had discussed in the bi-LSTM + CRF algorithm.

1. Bidirectional LSTM-CNNs

More details and implementation in keras

2. Bidirectional LSTM-CNNS-CRF

Model Training using is a no-code platform to train and deploy AI and machine learning models. In the remaining section, We will be building a content moderation model using which helps in extracting product names and product features from unstructured text.

Algorithm Selection provides relevant and optimized algorithms to train ML models depending on the ML template, Skyl provides out of box algorithms, and just has to select from the drop-down, we will be using LSTM + CRF algorithm for our model.

Embedding Selection

We had already discussed the embedding layer, how it helps in converting words into vector spaces, provides different options for embedding selection and dimension for word vectors to use. We have selected Elmo embedding and dimension of 512, i.e each word converted into 512 vectors.

Configure hyper tune parameter for training provides very easy to experiment with the model of different batch sizes, epochs, LSTM unit & learning rate.

We run the training with the following hyperparameters:

  • Batch size: auto
  • Learning rate: 0.001
  • Number of epochs: 25
  • LSTM cell: 100

Optimizer and Loss Function

Error is calculated as the difference between the actual output and the predicted output. The function that is used to compute this error is known as Loss Function.

To minimize the prediction error or loss, the model while experiencing the examples of the training set, updates the model parameters W . It determines the cost/penalty of the model. So minimizing the error is also called minimizing the cost function.

Picking the right optimizer and loss function with the right parameters can help you squeeze the last bit of accuracy out of your neural network model. provides an easy selection of optimizers and loss functions. We will be using Adam optimizer and CRF log-likelihood as a loss function.

Performance Metrics

A real metric is needed to assess the model's performance, we will use precision, recall and f1-score metrics to evaluate the performance of the model.

Training Results

After training has been completed, The platform generates reports on test sets based on performance metrics we have selected while configuring the training.

By observing these performance metrics, we can easily come up with the conclusion whether a model is trained well, or underfit/overfit.

If evaluation metrics are not satisfactory, the platform provides lots of options to get the best model out, Here are the few things which we can do:

  • Using a different pre-trained embedding (ELMO, BERT, etc.) for word representation if the model overfits.
  • Reduce weights amount. Overfitting occurs mainly because the model is too complex. We tried to reduce the model's complexity by reducing LSTM dimensions 
  • Further model fine-tuning (epochs, layer dimensions, learning rate, etc.)
  • Select different algorithms

Model Deployment

Now we have fine-tuned the model, and performance metrics are satisfactory on the test set, platform provides one-click deployment of the model in production, deploy the model, go to the ‘Deployment’ tab and select the ‘Deploy’ button.

After deploying the model, inference API will be available which can be integrated into your application.

You can check out the above model in the featured models section of website.

The model manages to extract many entities of different types and complexities. The performances on product mentions and features are very satisfactory for camera reviews as the above model is trained only for camera reviews dataset.