How to use natural language processing to examine hotel reviews
Studies have shown that TripAdvisor has become extremely important in
the decision-making procedure of a traveler. Although understanding the
shades of TripAdvisor bubble scores vs. thousands of TripAdvisor’s
review text, could be challenging. In the efforts of more thoroughly
understanding if hotel guest reviews effect hotels’ performance overtime,
we have extracted all the English reviews using TripAdvisor for a hotel —
Hilton Hawaiian Village. We won’t discuss the information of web
scraping, a Python code for procedure could be available here.
Loading the Libraries
The Data
There were 13,701 reviews in English on TripAdvisor for the hotel Hilton
Hawaiian Village and reviews’ date range is 2018–08–02 to
2002–03–21.
The maximum weekly reviews were got at 2014 end. The hotel got more
than 70 reviews in the week.
Text Scraping of Reviews Text
We can certainly do a bit better job for combining “stay” & stayed”, as
well as “pool” & “pools”. Stemming is the procedure of decreasing
inflected or derived words to the word stem or root formats.
Bigrams
We want to know the association between words within a review. What
arrangements of words are normal across different review text? Provided
a word sequence, which words are most expected to follow? Which
words provide the strongest association with each other? So, a lot of
exciting text analysis are depending on relationships. Whenever we test
pairs of two successive words, it is named “bigrams”.
Therefore, what are the most general bigrams in TripAdvisor reviews of
Hilton Hawaiian Village?
The most general bigrams is “rainbow tower” and hawaiian village”.
We could visualize bigrams in different word networks:
The given visuals are common bigrams about TripAdvisor reviews,
viewing those, which occurred minimum 1000 times as well as where
neither of the words were stop-words.
A network graph given here showing strong connections among the top
words (“village”, “ocean”, “hawaiian”, and “view”). Although we don’t
observe clear bunch of structure in a network.
Trigrams
At times, Bigrams are not sufficient, let’s observe which are the most
general trigrams in the TripAdvisor reviews of Hilton Hawaiian Village?
The most general trigram here are “hilton hawaiian village” and
“diamond head tower”.
Trending Words in Reviews
Which topics and words have been more or less frequent over the time?
These might provide us an idea of hotel changing ecosystem like service,
problem solving, renovation, and help us predict the topics which will
grow in importance.
We need to ask queries like: which words have increasing frequency in
the TripAdvisor reviews?
We can observe the topmost discussion about “friday fireworks” &
“lagoon” before 2010. And words like “resort fee& and “busy” grew very
quickly before 2005.
Which words have been declining in frequency with the reviews?
It shows some topics where interest has wiped out since 2010, counting
“hhv” (short form of hilton Hawaiian), “upgraded” “prices”, “breakfast”,
and “free”.
It’s time to compare some selected words.
Food and service both were the best topics before 2010. The discussion
about food and service peaked at beginning of data in 2003, this has
been in the descending trends after 2005 having occasional peaks.
Sentiment Analysis
Sentiment analysis is extensively applied to the voice of customer
materials like survey responses and reviews, social media and online for
apps, which range from customer service to marketing to clinical
medicines.
Here, we want to determine an attitude of the reviewer (i.e. hotel guests)
with past experiences or emotional reactions towards a hotel. The
attitude might be an evaluation or a judgment.
The most general positive or negative words in these reviews.
Let’s try one more sentiment library and observe if the results are
similar.
It’s exciting to see that “diamond” was categorized in positive
sentiments.
There is a problem here, for instance, “clean”, as per the context, has
negative sentiments if headed by a word “not”. Unigrams will solve this
issue using negation in majority of cases. It brings us the following topic:
Use Bigrams to Offer Context with Sentiment Analysis
We need to see how frequently the words get preceded by words like “not”.
In fact, 850 times, the word “a” got preceded by the word “not”, and 698
times, a word “the” got preceded by the word “not”. Although this data is
not important.
This states that in data, the most general sentiment-related word to trail
“not” is “worth”, and another common sentiment-related word to trail
“not” is “recommend” that might usually have the positive scoring of 2.
Therefore, in data the words subsidized the most in a wrong direction?
The bigrams “not great”, “not worth”, “not like”, “not recommend”, and
“not good” were the main reasons of miss-identification, creating the text
more positive than this is.
Excepting “not”, there are many other words, which negate the following
terms like “never”, “no”, and “without”. Let’s observe them.
It looks as of the biggest resources of mistaking a word like positive come
from “not great, worth, recommend, good”, and the biggest source of
imperfectly classified negative sentiments is “no problem” and “not bad”.
Finally, let’s discover the utmost positive & negative reviews.
The ID of the most positive review is 2363:
The ID of the most negative review is 3748:
And that’s it!
If you want to know more about scraping TripAdvisor, Sentiment
Analysis, and Text Mining Data for Hotel Reviews, contact Actowiz
Solutions now!
You can also contact us for all your mobile app scraping and web scraping services requirements!