Big data has become the most common topic in the 21st century. Big data is the set of extremely large volume of data, which can be either structured or unstructured, flowing in high speeds from various locations. These kind of data are highly inconsistent with periodic peaks in their flow of data. Natural Language Processing is the synthesis and analysis of natural languages. Natural language processing mainly focuses on data mining, computational linguistics and text analytics. Big data and Natural language processing are commonly merged via two forms. One is the text form and the other is in the audio form. According to a recent survey held by a company called Upwork, the skill of natural language processing has topped the list of fast growing skills in the world. This merge of technologies are all about how efficiently the data are being mined and delivered for a command that is given in a natural language. In the current context ratings and reviews, use a lot of big data and natural language processing. Ratings and reviews are classified under three major sectors. They are movie and entertainment sector, hotel and restaurant sector and online retail sector. “97% of shoppers say reviews influence buying decisions.” (Fan and Fuel, 2016). This statement explains us the importance of rating and reviews in day-today-life. Even though there are many types of NLP tools, most of their functions can be classified under few general topics. They are”,
1. Full text extraction- This part itself contains many phases in it. They are”,
a. Extracting entities – such as companies, people, projects, numerical amounts etc.
b. Categorizing content – Analyzing the texts and categorizing the text as a positive or negative response using sentiment analysis.
c. Clustering content – Distinguishing the main topic and sub topics.
d. Fact extraction – Filling the databases with structured information like charts, which will helpful in analyzing.
e. Relationship extraction – Exploring real world relationships according to the text
2. Identifying and marking phrases and sentences. – In our use case, this function can be significant because keyword identification plays and important role in review sites so that it will be easy to group similar reviews together.
3. Language Identification – A product might contain huge amount of reviews, which are written in several languages. It is essential for the system to group all reviews and present them to the user .In this case spaCy can be considered as a solution as this tool can identify 28 languages.
4. Stemming – Reduces word variations to a simpler form so that the NLP coverage of NLP utilities can be increased.
5. Acronym normalization and tagging- Basically this part is to identify the acronyms found in sentences and to distinguish them.
6. Phrase extraction – This is the extraction f words which have a strong independent meaning . For example the word “Bigdata”, should be mined together because by separately mining this phrase the real meaning will not be shown. Techniques like speech tagging , statistical phrase extraction are used under this phase.
Natural Language Tools that can be used to combine NLP and Big data
Many NLP tools are used in different sectors for different purposes. However there are only a limited number of NLP tool which can be used in ratings and reviews segment. GATE, spaCy, Microsoft Spellcheck API and IBM Watson Tone Analyzer are some of the tools used at present to analyze big data more efficiently.Its is difficult to NLP can be done using two methods. Even though there are many applications on text analysis , Through my research I found out that there is no proper innovation on speech related applications in rating and review segment. Each NLP tool and their purpose will be discussed below.
1. GATE – GATE is a popular open source tool to build search applications.This is a NLP tool kit was developed over the last 20 years. GATE comprises components for language processing such as parsers, machine learning tool, tools for visualizing ad manipulating text, various information extraction tools and evaluation and extraction tools. ANNIE is a part of GATE that contains, tokenizer, sentence splitter, pos tagger, name entity taggers etc. GATE performs functions like entity extraction, part of speech tagging, sentence extraction, and text tokenization.
2. SpaCy- spaCy is a python NLP library, which acts as a bridge for the users to bring their work out of paper to the production. The best part of this application is that, spaCy handles 28 languages efficiently. According to m research done in www.g2crowd.com it is clear that all the clients are 100% satisfied with solutions of spaCy. TechDynasty is an organization, which use spaCy to deal with their client reviews.
3. Microsoft Spellcheck API- This tool allows the users to correct spelling errors “,recognize the difference between brand names and slang .This allows the reviewers to submit their feedback with more accuracy so that others can gain a clear idea about their review. This product has two methods of handling user comments. ‘Spell’ is more aggressive in order to return better search results, while ‘Proof’ is less aggressive and it adds capitalization, basic punctuation to make the user input more meaningful.
4. IBM Watson Tone Analyzer- This tool analyses three different tones through linguistic analysis. They are emotions, social tendencies and language styles respectively. Emotions like anger , fear, joy, sadness and disgust could be identified through this tool. Social tendencies like openness, conscientiousness, extroversion, agreeableness and emotional range can be detected using this tool. Moreover, language styles like confidence, analytical and tentative can be also detected using this analyzer. This tool can assist sentiment analysis for the customer reviews which can give the clients a better understanding about the customer emotions.
At the beginning, we divided the main topic into several avenues, so that we can get a better idea about how this technology is implemented in real life. Initially our plan was to browse through the browsers and gain the information needed. However, while researching I found out that there are no significant articles that relate to my subtopic. Therefore, I held short interviews with personals who are working on projects related to NLP. I managed to gain many practical information and a clear scope of my subtopic through the answers that I got from those individuals. This contributed a lot in my research paper. The questions which I used in my interview are as follows”,
1. What are the technical draw back in review sites?
2. In what way NLP tools can be used to solve these problems
3. Why do you think the sites related to rating and reviews aren’t giving enough importance to the reviews in form of audio?
RESULTS AND DISCUSSION
According to my research, it was clear that NLP is being used in sites related to rating and reviewing, however there is no efficient use of this technology being practiced at present. Therefore I researched n the domain of how big data and NLP can be used in rating and review sites under two major sub domains. They are
1. Movie Rating Systems
2. Product and Place Reviews
Movie Rating Systems
A huge amount of big data is being used under this subdomain. Following are my suggestions on the areas, which big data and NLP both can combine to make the movie rating system more efficient.
1. Prediction of the success of the film through the reviews and the comments the film gets in the first week of its release.
• Predicting the box office collection of the film through past records that had similar trend just as the new film. Through this the movie makers can predict their profit/losses and can prepare better to face future.
2. Relating the movie ratings to viewer’s comments.
• Most of the people check on movie ratings before they watch a movie. Therefore it is a need to maintain the accuracy of the ratings. This could be done efficiently by relating user comments with the ratings of the film. Even then there will be some problems regarding the genuine nature of the comments solution for this will be discussed in thr product review phase of my report.
3. Clustering all reviews under one roof.
• Not that the entire set of people will use the same pages to record their comments. The new trend of reviewing movies is through social media. Through NLP and data mining it is possible group all reveiws regarding one film together.
4. Sentimental Analysis
• Using Sentimental analysis it possible to extract the emotions of the viewers and provide a better background of the film on its success or loss of the film. We can also combine this with retrieving similar data from the past and predict the collection of a particular film.
It is also possible to extract information on each film and provide statistical data on each actor in the field. Efficient data mining through NLP will be adequate to do this job. Even fan base prediction will be possible with this fusion of Techniques.
Product and Place Reviews
This segment has an active involvement of data all around the world. However, there are few drawbacks in retrieving data from reviews. I’ve listed some of it below”,
1. As the data available in this segment is unstructured, the language used will have unusual punctuations and capitalization, which can cause problems in tokenization.
2. NLP alone cannot handle sarcasm in tweets and comments.
3. Hashtags, Tags and other symbols causes problems in tokenization
4. Lack of context will give ambiguity.
5. Spelling mistakes might cause trouble to some NLP tools.