top of page

Mathematicians see the world differently. I built a One Hot Encoder using an identity matrix.

Jason Ismail

Updated: May 24, 2021

The notebook can be found here:


Recently I was asked to do some text cleanup for a Natural Language Processing assignment. We were asked to build the functions ourself and not rely on libraries like NLTK.


Eventually we were asked to build a One Hot Encoder to vectorize the words in our text. So I decided to start with an identity matrix.



I chose a simple sentence that could prove that my function was working as desired.


sentence = "Apple baNana Pear orange, pear!"

I chose this nonsense sentence to prove a few things. I don't get extra words when words happen multiple times or have different capitalization. I also need to ensure that we are not getting any punctuation included with the words in the sentence.


Here is my logic flow for the problem.

  • Turn the sentence into tokens.

  • Make the tokens lowercase.

  • Find the unique set of words in my sentence.

  • Find the number of words in my set. (Used to build my identity matrix)

  • Build the identity matrix.

  • Save it to a database with the index set as the unique words in my set.

This gives me the following:

But you may notice that I did not include the first column. The reason for this is that banana does not need its own column in the dataset since it can be represented as the vector [0, 0, 0].


What I have essentially built is a key that I can quickly reference with the index

Then I simply take my sentence and and go token by token and match the token to the index grab the row and flatten it into a list.


Now I have turned my words into one hot encoded vectors as required.


Comments


Commenting has been turned off.

DON'T MISS THE FUN.

Thanks for submitting!

Looking to Hire?

Connect with a Versatile Data Scientist

 

 


Are you in need of tailored data science solutions for your business? I'm here to help. With a Master's Degree in Data Science and a Bachelor's in Mathematics, I bring a blend of academic rigor and practical experience to the table.

Expertise in Building Comprehensive Data Solutions:

Proficient in developing end-to-end data science projects, including the collection, cleaning, and analysis of raw data.
Specialized in Python.


Technical Proficiencies:

Skilled in using Pandas, Yolo, NumPy, PyTorch and Keras/TensorFlow for creating sophisticated Deep Neural Networks.
Experienced in computer vision and leveraging Nvidia CUDA for high-performance computing tasks.


Personal Qualities:

Recognized by peers, mentors, and students as a dedicated and hardworking professional. I come with a long list of references.


Known for facing challenges head-on and being a supportive team player.
Skilled at making complex concepts accessible and relatable, with a passion for continuous learning.


Contact Information:

Jason Ismail
Masters in Data Science, Bachelors in Mathematics
LinkedIn Profile
Phone (Text Only): 719-322-8479

About Me

Data Science

Data Science isn't just my career; it's the realization of a lifelong passion where my love for mathematics, programming, and technology converge. Over the past 20 years, I've nurtured a deep fondness for computers, starting from building them to exploring their immense capabilities.

My academic path initially led me to programming and then chemistry, where I excelled nationally in the 98th percentile. This experience, however, led to an epiphany - it was the mathematical elements within chemistry that truly captivated me. This revelation steered me towards a scholarship in Mathematics and a subsequent career in teaching.

But the true calling came with Data Science. Here, I found an exhilarating opportunity to transform abstract mathematical theories into impactful, real-world applications. My focus now is on cutting-edge areas such as Artificial Intelligence, Neural Networks, Computer Vision, and Reinforcement Learning - fields where I can blend my analytical skills with creative problem-solving to innovate and advance the boundaries of technology.

Data Science for me is more than a profession; it's a canvas where I paint with numbers and algorithms, creating solutions that matter.

POST ARCHIVE

bottom of page