My Profile Photo

Wojtek Kulikowski


I am a software engineer writing on ai, startups and product development. I need to yet include words python and cloud for SEO purposes


Machine learning and NLP basics for economics majors — 1

TL;DR Spoken language is hard and ambiguous, so machines can’t understand it. They can read numbers though, so we just need to translate the text to numbers. Today we use semantic word embeddings, so machines can see relations between words

Photo of a notepad and pencil unsplash-logoKelly Sikkema

This article has been inspired by my last conversation with Olga - a friend studying at London School of Economics, hence the title. Although Olga is not an engineer we had a nice talk about machine learning basics and how these systems work. As I want to have more conversations like this one, here is a quick overview of what is going on in the NLP world, so next time you have a beer with a high school nerdy friend, you can impress them by knowing phrases like “word embeddings”.

What is NLP

So what is all the hype about? Natural Language Processing is a field with a purpose of translating a language spoken by humans into one understood by machines, and vice versa. Thanks to advancements in the field we have Google Translate, voice assistants or phone autocorrect adjusting to our texting styles. Among many other use cases, NLP is pretty integral part of our lives by now.

Here comes the tricky part. Human language is extremely ambiguous. Sometimes it is hard for us to understand somebody else without the context and it is even worse for a computer. Let’s talk about grammar.

In linguistics, grammar (from Greek: γραμματική) is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language. The term refers also to the study of such rules, and this field includes phonology, morphology, and syntax, often complemented by phonetics, semantics, and pragmatics. — Wikipedia

Universal grammar and hierarchy

For a long time, the ultimate goal of computational linguists was discovering a Universal grammar, a term coined by Noam Chomsky. Existence of UG would mean that there are grammar rules innate to humans, independent of our sensory experience. Chomsky also worked on a theory, that would take the concept further. His hierarchy was meant to be a pattern, mathematical formula that would fit all the correct expressions of a language and assign them roles and types. Existence of such tool would allow us to parse the language. Parsing means going word for word, checking the text against a set of rules and determining what it means. It would work great given how current computers operate.

Long story short, the Chomsky Hierarchy concept failed. Which is a pity, because it means that we need to find another way to make computers understand our languages. Here come statisticians.

Vectors and word embeddings

First, one has to understand, that the text itself is meaningless for a computer. The machine only performs numerical operations, so we need to vectorize it first. The operation of vectorizing takes a chunk of text and translates it to a set of numbers. Here is an example from Kanye West’s song Graduation Day

I'm no longer confused but don't tell anybody
I'm about to break the rules but don't tell anybody
I got something better than school but don't tell anybody
My momma would kill me but don't anybody

Let’s assume that every line is a separate phrase in a document. How to transform these lyrics into vectors? There multiple methods. I will point out 3 — 2 basic ones, and 1 considered state of the art in AI research.

Bag of words

Bag of words is the most basic vectorization algorithm. It just counts occurrences of every word in the phrase and then represents the phrase as a vector of those occurrences. Let’s see the result of computing it in Python. You can totally ignore this part if you’re not familiar with it.

First let’s do imports and define the text that we’ll be working on.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
text = ["I'm no longer confused but don't tell anybody",
        "I'm about to break the rules but don't tell anybody",
        "I got something better than school but don't tell anybody",
        "My momma would kill me but don't anybody"]

Now let’s use the vectorizer to see the results of the computation

bag_vectorizer = CountVectorizer()
bag_data = bag_vectorizer.fit_transform(text)
pd.DataFrame(bag_data.toarray(), columns = bag_vectorizer.get_feature_names())
about anybody better break ... than the to would
0 0 1 0 0 ... 0 0 0 0
1 1 1 0 1 ... 0 1 1 0
2 0 1 1 0 ... 1 0 0 0
3 0 1 0 0 ... 0 0 0 1

4 rows × 22 columns

TF-IDF

TF-IDF stands for “term frequency–inverse document frequency” and calculates the value of each word based on its frequency in the phrase against the frequency in the document. The goal here is to extinguish words that are popular but not too popular. Skipping the imports now, here is an example written in python

tfidf_vectorizer = TfidfVectorizer()
tfidf_data = tfidf_vectorizer.fit_transform(text)
pd.DataFrame(tfidf_data.toarray(), columns = tfidf_vectorizer.get_feature_names())
about anybody better ... the to would
0 0.000000 0.253897 0.000000 ... 0.000000 0.000000 0.000000
1 0.400823 0.209166 0.000000 ... 0.400823 0.400823 0.000000
2 0.000000 0.209166 0.400823 ... 0.000000 0.000000 0.000000
3 0.000000 0.216367 0.000000 ... 0.000000 0.000000 0.414622

4 rows × 22 columns

Having this representation, a computer will be able to apply coding and algorithms to reason about the text. We will cover it in the second part.

Word2vec and semantic word embeddings

As two previous algorithms were mostly just examples of what vectorization is all about, they wouldn’t convey that much useful information about the phrase by themselves. Our voice assistants wouldn’t be so communicative with just them. Here comes the state of the art algorithm family - semantic word embeddings. The pioneering algorithm in this approach was word2vec. It has been originally developed by a team led by a Czech computer scientist Tomas Mikolov. We won’t cover it in depth, as it is much much much more complicated than the previous two. However, you can think about it as a way of translating phrases and their together with their similarities. For example, if vector for a word “sun” is [1,0.8,0,0,0.5], the vector for word “moon” would be much closer to it (let’s say [1,0.7,0,0,0.6]) than a vector for “grass” (let’s say [0,1,1,0.6,0.4]). By their premise, semantic word embeddings allow for performing operations on the vectors, that would give a similar results to our human understanding. One example is,

Paris - France + Poland = Warsaw

Word2Vec optimizes for preserving those relationships for a more efficient computations in the future. For more detailed explanation of the algorithm I highly recommend video by Python programmer.

Going further

In the first part we covered word embeddings, why do we need them and how they look like. In the second part, I will cover the most popular use cases of machine learning in nlp and try to explain why they work.

For now, I hope you enjoy the article and learned something new. If you want to stay in touch or give some feedback I will greatly appreciate you texting me at Twitter or Insta

Have a great day,

Wojtek

Huge thanks to Stanisław for correcting me on some mistakes. (If you spotted one, please let me know!)