Exploratory Data Analysis With NLP Project

“I hope this article can help someone who interesting in natural language processing (NLP).”

3 min readFeb 28, 2021

I use data provided by the Home Depot Product Search Relevance competition on Kaggle.

1. import libraries and load data

# basic libraries
import pandas as pd
import numpy as np# help data frame can show side by side 
from IPython.display import display,HTML# statistic libraries
import seaborn as sns
from scipy import stats# plot
import matplotlib.pyplot as plt# loop step
from tqdm import tqdm# load data
df_train = pd.read_csv('../Data/train.csv', encoding="ISO-8859-1")
df_test = pd.read_csv('../Data/test.csv', encoding="ISO-8859-1")
df_attributes = pd.read_csv('../Data/attributes.csv')
df_product_descriptions = pd.read_csv('../Data/product_descriptions.csv')

2. Exploratory Data Analysis

To-Do List:

Generate data frame table. It will let you know more specifically what the data looks like. Code is showing below:

grid_df_display(
    list_df = [df_train, df_attributes, df_test,               
               df_product_descriptions], 
    list_df_name = [‘Traing Data’, ‘Attributes’, ‘Test Data’,      
                    ‘Product Descriptions’],
    list_number_of_data = [5, 28, 6, 5], 
    row = 2, col = 2, fill = ‘col’
)

This data set contains a number of products and real customer search terms from Home Depot’s website. — This data set contains several products and customer search terms from Home Depot’s website.

Check the shape of the data. It will let you know more precisely how big the data it is.
Checking missing values in each column and confirm that the missing values are redundant or not. Sometimes, missing values have some critical information. For example, if some products’ waterproof attribution is ‘yes’, the other was `null` or even doesn’t have the attribution. It probably means these products don’t have any water-resistant ability.
Check the column type and covert feature to the right type.

In this case, we must convert the id and **product_uid’s** column type to object.

Checking duplicates by row and confirm that duplicates are redundant or not.

Note: sometimes duplicates are not redundant. In some cases, we even add duplicate data through the resampling method.

To understand the basic patterns of data through descriptive statistics.

Image displays the training data descriptive statistics index

Checking distribution for the target and important features. We usually use the normal distribution as a benchmark for comparison. If the data is the normal distribution, the probability plot will show that two 45-degree lines overlap.

The image shows the pattern of the target variable.

Find out product attributes that have a lot of frequency from the attribute data set.

By looking at the table on the left, we can see that product attributes such as ‘MFG Brand Name’, ‘Color Family’, ‘Material’, ‘Color/Finish’, and ‘Certifications and Listings’ have a large frequency. I think these attributes are meaningful to website users.

So, we got the idea to extract critical information from each product. It helps us to predict the relevance between the search term and products on the website.

Conclusions

This article is my first try. I hope you guys can find something you need. And I will do my best to improve my writing skill and English. The whole analysis process will be put on my Github. Next time I will more dig into NLP-related skills. Thanks a lot.