Mathematicians see the world differently. I built a One Hot Encoder using an identity matrix.
Updated: May 24, 2021
The notebook can be found here:
Recently I was asked to do some text cleanup for a Natural Language Processing assignment. We were asked to build the functions ourself and not rely on libraries like NLTK.
Eventually we were asked to build a One Hot Encoder to vectorize the words in our text. So I decided to start with an identity matrix.

Courtesy of https://www.onlinemathlearning.com/
I chose a simple sentence that could prove that my function was working as desired.
sentence = "Apple baNana Pear orange, pear!"
I chose this nonsense sentence to prove a few things. I don't get extra words when words happen multiple times or have different capitalization. I also need to ensure that we are not getting any punctuation included with the words in the sentence.
Here is my logic flow for the problem.
Turn the sentence into tokens.
Make the tokens lowercase.
Find the unique set of words in my sentence.
Find the number of words in my set. (Used to build my identity matrix)
Build the identity matrix.
Save it to a database with the index set as the unique words in my set.
This gives me the following:

But you may notice that I did not include the first column. The reason for this is that banana does not need its own column in the dataset since it can be represented as the vector [0, 0, 0].
“What I have essentially built is a key that I can quickly reference with the index”
Then I simply take my sentence and and go token by token and match the token to the index grab the row and flatten it into a list.
Now I have turned my words into one hot encoded vectors as required.
Comments