ANLP(3)- Word Vectorization and Similarity Metrics
An Algorithmic Approach to NLP
Vectorization
Vector Semantics is a standard way to represent words semantically in NLP. The main idea is that two words that occur in very similar distribution have similar meaning. For example, the sentences “Oscar barks a lot”, “Oscar wags his tail whenever he sees a tennis ball” might have made you think that Oscar is a dog, this was due to its surrounding words in the sentences, and this is what vectorization aims to achieve. Vectorization is essentially converting strings into numerical value for further processing.
Bag of Words
Bag of Words is simplest way to vectorize text in NLP. The name bag of words arises because the order of words in the document does not matter in the vectorization process. Consider a document as shown below:
“There used to be Stone Age”
“There used to be Bronze Age”
“There used to be Iron Age”
“There was Age of Revolution”
“Now it is Digital Age”
Now if we were to make a simple list of words in the document, it would be: [“There”,”was”,”to”,”be”,”used”,”Stone”,”Bronze,”Iron”,”Revolution”,”Digital”,”Age”,”of”,”Now”,”it”,”is”]
Now, We can represent each sentence as a list of word counts of these words. for example:
“There used to be bronze age” = [1,0,1,1,1,0,1,0,0,0,1,0,0,0,0]
This is essentially in a vector form and we know that all the parts of the document can be represented in the same vector dimension n(where n is the number of words in the document).
Here’s a simpler example for the bag of words vectorization approach.
TF-IDF Algorithm
TF-IDF stands for Term Frequency-Inverse Document Frequency. To understand this, we need to first understand the terms “Term Frequency” and “Inverse Document Frequency”.
Term Frequency(TF):
Term frequency is simply the measure of how frequent the word is in the document. It can be defined as:
But usually, in documents with huge amounts of data, it is better to scale the TF down to make calculations easier and faster, so we use the logarithm of the direct count. However, since we cannot take the log of 0, we add q to the count. the new TF becomes:
Example: TF, IDF, TF-IDF values for the sentence “The TFIDF Vectorization Process is Beautiful Concept” in different documents.
Inverse Document Frequency(IDF)
Another idea for determining the relevance of a word is inverse document frequency. It is based on the concept that words that are used fewer frequently are more informative and significant. IDF is expressed by the following formula:
Usually, we see that words like “The” or “is” are more frequent but we know that they are the least significant, IDF helps us identify such words by giving them a lesser score and frequent words in documents more score.
Example: TF, IDF, TF-IDF values for the sentence “The TFIDF Vectorization Process is Beautiful Concept” in different documents.
TF-IDF
The TF-IDF score is simply the multiplication of both IDF and TF, it is lesser the values of common words that are used in all documents (such as “the”).
Example: TF, IDF, TF-IDF values for the sentence “The TFIDF Vectorization Process is Beautiful Concept” in different documents.
This algorithm for vectorization essentially gives more score to important words in a document.
Similarity Metrics
Through Vectorization, words and sentences can be converted into high-dimensional vectors, which are structured so that each vector’s geometric position can be used to provide meaning. Take the vector of King, subtract the vector of Man, and add the vector of Woman for a well-known example. Queen is the closest matching vector to the resultant vector.
We may apply the same approach to larger sequences, such as phrases or paragraphs, and discover that proximity/orientation between those vectors corresponds to similar meaning.
So, similarity is vital, and we’ll go over the three most used metrics for calculating that similarity.
Euclidean Distance
Euclidean Distance is the simplest of the metrics used for similarity. Let us take three vectors : a = (.01, .08, ,11), b = (.01, .07,.1), c = (.91, .57, .6).
Clearly, a and b are closer to each other while c is distant from both, we can find the same using the Euclidean distance formula, where more distance implies less similarity.
applying this to the above example gives us d(a, b) = 0.0141, d(c, b) = 1.136, d(a, c) = 1.145
Dot Product
One disadvantage of Euclidean distance is that it does not take into account orientation in its computation — it is simply focused on magnitude. And this is when our other two measures come in handy. The dot product is the first of these.
The dot product takes into account vector magnitude as well as direction (orientation). We worry about orientation because the direction of a vector — not necessarily its magnitude — can represent identical meaning (as we’ll see).
For example, we might discover that the magnitude of our vector correlates with the frequency of a term it represents in our dataset. Now, the word hi has the same meaning as hello, and this may not be reflected if our training data contains 1000 instances of hi and just two instances of hello. As a result, the orientation of vectors is frequently regarded as being just as essential (if not more so) than their distance.
The dot product considers the angle between vectors, where the angle is ~0, the cosθ component of the expression equals ~1. The cos component equals 0 if the angle is closer to 90 (orthogonal/perpendicular), and -1 if the angle is closer to 180. As a result, the cos component improves the result when the angle between the two vectors is smaller. As a result, a higher dot-product corresponds to a larger degree of orientation.
Essentially:
- Two vectors that point in a similar direction return a positive dot-product.
- Two perpendicular vectors return a dot-product of zero.
- Vectors that point in opposing directions return a negative dot-product.
Cosine Similarity
In Cosine Similarity, the vectors orientations are the only ones considered, independent of the magnitude of the vectors. This is because we only consider the angle between two vectors which implies that we ignore its magnitude totally, it is simply a normalized dot product.
The three metrics discussed have their own pros and cons, and either one of them can be used depending on the task being performed.