Data is being generated at a massive rate across all the businesses. When the data volumes are large, we would want to look at the subsets(apply filters) of data that might be of interest to us. So traditionally these filters translate to database queries. A simple query for all profiles in IBM would look like
SELECT * from profiles where company = “IBM”
Draup faces a very similar challenge when it comes to presenting its data to users in a meaningful manner. It has a large ecosystem of profiles which the users can narrow down based on several filter parameters. You can select profiles based on company, location, skills, business function and a few more.
Here is what the filter based approach looks like.
When each category has only a few options, filtering is easy but as the options for each category increases, then the complexity blows out of proportions.
An alternative approach for querying
At Draup we wanted to do away with this way of filtering the search results and improve our user experience. So we worked on an alternative way of querying the platform. The users can now enter free text queries and say goodbye to all the clutter created by filters. Here are a few examples:-
- Show me data engineers skilled in python - Someone who is located at san francisco with the qualification of a data scientist. - Show me top executives in Amazon - Adam from Microsoft, Redmond
So the first query should translate to the following SQL query.
SELECT * from profiles where JOB_TITLE = “data engineer” and SKILL = “python”
We approached this problem from a machine learning viewpoint. This problem is known as Named Entity Recognition(NER) in the ML world.
Building a Named Entity Recognition System
Named entity recognition is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person and organization names, locations, dates, etc.
There are brilliant open-source models (1, 2, 3, 4)for NER but they are very generic in nature. These models work well with general entity types like person names, locations, organizations, dates, etc, but at Draup we are concerned with much more. We have several other entities like skills, sub-verticals, business functions, level in organization, etc, which cannot be covered by these pre-trained models. So we came to the conclusion that we have to build our own NER system.
Building the dataset
This is usually the most crucial part of any Machine Learning (ML) process. ML follows a simple rule, “garbage in, garbage out”. This means that an ML model is only as good as the data it is trained on. Keeping this in mind we worked on generating as many examples as possible. We could reach to about 200 possible queries. This is a relatively small dataset for training a model. A careful study of user querying patterns gave us ideas about how to generate more data by data augmentation. People seem to not care about properly capitalizing important words in a free text query. A lot of the users pay no attention to using proper punctuations but still expect models to work. These insights help us build a quick data augmentation pipeline to create more training examples for us. All these efforts resulted in a total of 1000 training examples.
Choice of Modelling Technique
There are two major themes for building a NER system:-
- Traditional algorithms like conditional random fields (CRF)
- Deep learning based approaches
Deep learning based approaches in the realm of text data work really well if you have a large dataset. About 1000 example doesn’t cut the bar. More recently, general language models like Google’s BERT or OpenAI’s GPT-2 have shown promising results on smaller datasets. However, these models are huge in size and we felt that they are a bit of an overkill for our task. Another important drawback of deep learning when compared to traditional approaches is that it’s difficult to interpret and explain model behavior.
Conditional random fields, on the other hand, works quite well on NER task even when the data is limited.
Conditional Random Fields Model
This section talks about CRF in some intuitive detail but has some mathematics involved. You can choose to skip it.
A conditional random fields model is used when we are working with sequences. In our case, the input is a sequence of words and the output is a sequence of entity tags.
Let’s call the word sequence as x̄ and the tag sequence as y̅.
Also, let’s define what we call as feature functions: f(yᵢ₋₁, yᵢ, x̅, i)
Here the feature function takes 4 parameters:-
1: i, the current index in the sequence
2: x̄, the entire input sequence
3: yᵢ₋₁, the previous output tag indexed on i
4: yᵢ, the current output tag indexed on i
To make thing more clear, let’s define a sample feature function.
f(yᵢ₋₁, yᵢ, x̅, i) = { 1 if both yᵢ₋₁ & yᵢ are TITLE and the current word is ‘Engineers’, else 0}
As you can see, this is quite a descriptive feature function and if we define a lot of such feature functions we extract a lot of information about our text data. Here is another feature function.
f(yᵢ₋₁, yᵢ, x̅, i) = { 1 if yᵢ₋₁ is OTHER, yᵢ is TITLE and the current word is capital cased, else 0}
After collecting a bunch of feature functions we want to find a probability distribution function. This function should tell what is the probability of every possible y̅ given x̅. Below equation defines this probability.