Have you ever found yourself reading reviews on a product that you’re interested in trying? It can be time-consuming when there are thousands of reviews to read. The natural order of things is to go by the average user rating. Even going by average review rating is hard because all of us have a different idea of what constitutes a 1-5 star review. Seeing this problem, I wanted to create a simple word cloud from the text of every review to determine what the most common words used were to get an idea of what reviewers talked about most when referencing their favorite or least favorite product.

I decided to use Paulas Choice 2% BHA Liquid Exfoliant as my example product. Paulas Choice is one of the premier skincare brands in the world today by providing high-quality skincare products and delivering to its consumers the clinical research behind their products. This consumer transparency is not new for the company and Paula Begoun, the creator of Paula’s Choice. Paula has been providing the research behind her choices for more than 35 years.

Gathering the reviews was relatively easy using Selenium and Beautiful Soup to grab the text of each review and the corresponding star rating. From there, I used a library called Word Cloud to produce a visualization that showed the most commonly used words after removing stop words.

What Are Stop Words?

Much of what makes up a sentence are words that don’t provide any real meaning when processing text. Words like “I”, “we”, “are”, etc are words that don’t describe anything about the product. The removal of these words make it easier to narrow in on the descriptive words found in the targeted text. After only using the default stop words provided by the Word Cloud dictionary the following visualization was produced

wc = WordCloud(background_color="white", 


After the first iteration, it was clear the there were problems. The same words were appearing in the word cloud multiple times, and some of the words were domain-specific stop words. Words like “product”, “Paulas Choice”, and “skin” were all used as stop words since there redundant descriptiveness served no purpose when talking about a Paula’s Choice product used for the skin.

End Result

The final iteration included the use of SpaCy’s stop words and tokenization along with the addition of the domain-specific stop words. To eliminate the repeating words, the text was tokenized, and each word was reduced to its root word using lemmatization. Words like “starting” and “started” would be reduced to the word “start” to deal with similar words repeatedly being shown in the word cloud.

Creating the shape of the word cloud was done by using a simple image mask using paint, and the font was downloaded online. The bottle was cut and pasted using the built-in editing tools provided by Apple Preview.

char_mask = np.array(Image.open("blob.png"))    
wc = WordCloud(background_color="peachpuff",
               font_path="Summer Font Regular.otf",
               color_func=lambda *args, **kwargs: (255,255,255)).generate(wc_chunk_string)

Creating Word Clouds is Fun!

They can help consumers determine what the most commonly used words are when describing a particular product. From here, I would continue adding more stop words until I created a word cloud that I was happy with. I would also perform topic modeling (using Gensim) and running the reviews through a neural network (using TensorFlow) to generate text based on the reviews for fun. With the corresponding star reviews, I could also perform sentiment analysis on the text too.