Exploratory Data Analysis With NLP Project
“I hope this article can help someone who interesting in natural language processing (NLP).”
I use data provided by the Home Depot Product Search Relevance competition on Kaggle.
1. import libraries and load data
# basic libraries
import pandas as pd
import numpy as np# help data frame can show side by side
from IPython.display import display,HTML# statistic libraries
import seaborn as sns
from scipy import stats# plot
import matplotlib.pyplot as plt# loop step
from tqdm import tqdm# load data
df_train = pd.read_csv('../Data/train.csv', encoding="ISO-8859-1")
df_test = pd.read_csv('../Data/test.csv', encoding="ISO-8859-1")
df_attributes = pd.read_csv('../Data/attributes.csv')
df_product_descriptions = pd.read_csv('../Data/product_descriptions.csv')
2. Exploratory Data Analysis
- Generate data frame table. It will let you know more specifically what the data looks like. Code is showing below:
list_df = [df_train, df_attributes, df_test,
list_df_name = [‘Traing Data’, ‘Attributes’, ‘Test Data’,
list_number_of_data = [5, 28, 6, 5],
row = 2, col = 2, fill = ‘col’
- Check the shape of the data. It will let you know more precisely how big the data it is.
- Checking missing values in each column and confirm that the missing values are redundant or not. Sometimes, missing values have some critical information. For example, if some products’ waterproof attribution is ‘yes’, the other was `null` or even doesn’t have the attribution. It probably means these products don’t have any water-resistant ability.
- Check the column type and covert feature to the right type.
- Checking duplicates by row and confirm that duplicates are redundant or not.
Note: sometimes duplicates are not redundant. In some cases, we even add duplicate data through the resampling method.
- To understand the basic patterns of data through descriptive statistics.
- Checking distribution for the target and important features. We usually use the normal distribution as a benchmark for comparison. If the data is the normal distribution, the probability plot will show that two 45-degree lines overlap.
- Find out product attributes that have a lot of frequency from the attribute data set.
By looking at the table on the left, we can see that product attributes such as ‘MFG Brand Name’, ‘Color Family’, ‘Material’, ‘Color/Finish’, and ‘Certifications and Listings’ have a large frequency. I think these attributes are meaningful to website users.
So, we got the idea to extract critical information from each product. It helps us to predict the relevance between the search term and products on the website.
This article is my first try. I hope you guys can find something you need. And I will do my best to improve my writing skill and English. The whole analysis process will be put on my Github. Next time I will more dig into NLP-related skills. Thanks a lot.