TL;DR Spoken language is hard and ambiguous, so machines can’t understand it. They can read numbers though, so we just need to translate the text to numbers. Today we use semantic word embeddings, so machines can see relations between words
This article has been inspired by my last conversation with Olga - a friend studying at London School of Economics, hence the title. Although Olga is not an engineer we had a nice talk about machine learning basics and how these systems work. As I want to have more conversations like this one, here is a quick overview of what is going on in the NLP world, so next time you have a beer with a high school nerdy friend, you can impress them by knowing phrases like “word embeddings”.
What is NLP
So what is all the hype about? Natural Language Processing is a field with a purpose of translating a language spoken by humans into one understood by machines, and vice versa. Thanks to advancements in the field we have Google Translate, voice assistants or phone autocorrect adjusting to our texting styles. Among many other use cases, NLP is pretty integral part of our lives by now.
Here comes the tricky part. Human language is extremely ambiguous. Sometimes it is hard for us to understand somebody else without the context and it is even worse for a computer. Let’s talk about grammar.
In linguistics, grammar (from Greek: γραμματική) is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language. The term refers also to the study of such rules, and this field includes phonology, morphology, and syntax, often complemented by phonetics, semantics, and pragmatics. — Wikipedia
Universal grammar and hierarchy
For a long time, the ultimate goal of computational linguists was discovering a Universal grammar, a term coined by Noam Chomsky. Existence of UG would mean that there are grammar rules innate to humans, independent of our sensory experience. Chomsky also worked on a theory, that would take the concept further. His hierarchy was meant to be a pattern, mathematical formula that would fit all the correct expressions of a language and assign them roles and types. Existence of such tool would allow us to parse the language. Parsing means going word for word, checking the text against a set of rules and determining what it means. It would work great given how current computers operate.
Long story short, the Chomsky Hierarchy concept failed. Which is a pity, because it means that we need to find another way to make computers understand our languages. Here come statisticians.
Vectors and word embeddings
First, one has to understand, that the text itself is meaningless for a computer. The machine only performs numerical operations, so we need to vectorize it first. The operation of vectorizing takes a chunk of text and translates it to a set of numbers. Here is an example from Kanye West’s song Graduation Day
I'm no longer confused but don't tell anybody I'm about to break the rules but don't tell anybody I got something better than school but don't tell anybody My momma would kill me but don't anybody
Let’s assume that every line is a separate phrase in a document. How to transform these lyrics into vectors? There multiple methods. I will point out 3 — 2 basic ones, and 1 considered state of the art in AI research.
Bag of words
Bag of words is the most basic vectorization algorithm. It just counts occurrences of every word in the phrase and then represents the phrase as a vector of those occurrences. Let’s see the result of computing it in Python. You can totally ignore this part if you’re not familiar with it.
First let’s do imports and define the text that we’ll be working on.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import CountVectorizer import pandas as pd
text = ["I'm no longer confused but don't tell anybody", "I'm about to break the rules but don't tell anybody", "I got something better than school but don't tell anybody", "My momma would kill me but don't anybody"]
Now let’s use the vectorizer to see the results of the computation
bag_vectorizer = CountVectorizer() bag_data = bag_vectorizer.fit_transform(text) pd.DataFrame(bag_data.toarray(), columns = bag_vectorizer.get_feature_names())
4 rows × 22 columns
TF-IDF stands for “term frequency–inverse document frequency” and calculates the value of each word based on its frequency in the phrase against the frequency in the document. The goal here is to extinguish words that are popular but not too popular. Skipping the imports now, here is an example written in python
tfidf_vectorizer = TfidfVectorizer() tfidf_data = tfidf_vectorizer.fit_transform(text) pd.DataFrame(tfidf_data.toarray(), columns = tfidf_vectorizer.get_feature_names())
4 rows × 22 columns
Having this representation, a computer will be able to apply coding and algorithms to reason about the text. We will cover it in the second part.
Word2vec and semantic word embeddings
As two previous algorithms were mostly just examples
of what vectorization is all about, they wouldn’t
convey that much useful information about the phrase
by themselves. Our voice assistants wouldn’t be so
communicative with just them. Here comes the state
of the art algorithm family - semantic word embeddings.
The pioneering algorithm in this approach was word2vec.
It has been originally developed by a team led by a Czech
computer scientist Tomas Mikolov. We won’t cover it
in depth, as it is much much much more complicated than
the previous two. However, you can think about it
as a way of translating phrases and their
together with their similarities. For example, if
vector for a word “sun” is
vector for word “moon” would be much closer to it
[1,0.7,0,0,0.6]) than a vector for
“grass” (let’s say
By their premise, semantic word embeddings allow for
performing operations on the vectors, that would give
a similar results to our human understanding. One example
Paris - France + Poland = Warsaw
Word2Vec optimizes for preserving those relationships for a more efficient computations in the future. For more detailed explanation of the algorithm I highly recommend video by Python programmer.
In the first part we covered word embeddings, why do we need them and how they look like. In the second part, I will cover the most popular use cases of machine learning in nlp and try to explain why they work.
Have a great day,
Huge thanks to Stanisław for correcting me on some mistakes. (If you spotted one, please let me know!)