Calculate Candidate Compatibility Percentage for a Job Position (NLP) part 2

Patrizia Castagno
9 min readJan 14, 2024

This section is the continuation of: Calculate Candidate Compatibility Percentage for a Job Position (NLP)

It’s crucial to keep in mind that machines lack an inherent understanding of characters, words, or sentences. Their capability is restricted to processing numerical data exclusively. Consequently, textual information must undergo encoding into numerical formats for input or output in any machine.

In this scenario, employing Feature Encoding is imperative as it involves transforming meaningful text into a numerical/vector representation, preserving the context and relationships between words and sentences. This enables a machine to comprehend patterns in any text and discern the context of sentences. In the realm of Feature Encoding, various methods exist. For this particular case, we have opted to employ the Bag of Words (BoW) method.

What is Bag of Words (BoW)?

Bag of Words, commonly abbreviated as BoW, serves as a simplified representation extensively used in natural language processing (NLP) and information retrieval. In the BoW model, a document is portrayed as an unordered set of words, disregarding grammar and word order, while meticulously tracking the frequency of each word’s occurrence.

Essentially, BoW solely focuses on the presence or absence of words, without concern for their sequential arrangement.

How does BoW operate?

It constructs a vector based on the presence (1) or absence (0) of a word. Consequently, the resulting encodings are highly sparse and multidimensional. BoW’s counting mechanism tallies the occurrences of each word in the document. For example, suppose you have the following two short sentences:

1. “I love programming.”
2. “Programming is fun.”

Now, let’s create a basic Bag of Words representation:

  1. Create a vocabulary by listing all unique words from the sentences: [“I”, “love”, “programming”, “is”, “fun”].
  2. Represent each sentence as a vector, indicating the presence or absence of each word:
  • Sentence 1: [1, 1, 1, 0, 0] (I: 1, love: 1, programming: 1, is: 0, fun: 0)
  • Sentence 2: [0, 0, 1, 1, 1] (I: 0, love: 0, programming: 1, is: 1, fun: 1)



Patrizia Castagno

Physics and Data Science.Eagerly share insights and learn collaboratively in this growth-focused