Last updated on 16.05.2020
1. The focus on a case study
1.1. Informatics sub-disciplines
This case study focuses on the comparative analysis of two competing theories under the computational informatics sub-discipline. The case study focuses on how automobile industries can leverage their market growth based on the consumer opinions fetched from twitter data hence providing a competitive edge in their respective domain. The most ideal approach to growing a business according to researches is by using social media to analyze and predict consumer interests, this helps them in terms of cost, time and other resources. I will be comparing two models for analyzing consumer opinion into three layers positive, negative and neutral, so industries can have a good understanding of their product. Based on the outcome of their comparison analysis one of the models will be selected in implementing the software system.
1.2. Candidate theories and techniques
Sentiment analysis is one of the most important areas of research for predicting consumer opinions on products and also helps industries to provide a competitive edge in the ever-growing market. Sentiment analysis uses techniques such as text mining, natural language processing and other computational methods to predict user emotions based on user reviews on a particular product. Automobile industry which has the maximum share in the economic industry, having more than 1 billion vehicles and increasing around the world. This makes Automotive industries very competitive, thus it needs to account for what the consumer has to say about their vehicles. Hence carefully analyzing consumer opinions, provides a competitive edge in their business. Most of the consumer opinions can be obtained from social media platforms like Twitter. Most of the data obtained from twitter are unstructured information has no pre-defined model. Using method such as machine learning, lexical analysis, and hybrid analysis, we can analyze and provide the required results. One of the sentimental analysis methods can be utilized in analyzing user opinions is machine learning analysis. In machine learning strategy, the precision of a classifier is the determination of suitable features in the given data. Considering two models those are Support Vector Machines and Naïve Bayes for the case study. These learning methods can have a huge impact on the performance of the final software given the differences of their approach. Comparisons of these models will be provided in detail.
1.3. The phase of the development life cycle
Considering algorithm strategies for analyzing and mining user opinions from twitter most of the development lifecycle will be in the later phases such as design, implementation, and testing. For an understanding of the algorithm strategies design phase is a better approach to elicit. Furthermore, the development team will have a better understanding of what to implement once the design phase has done its thorough analysis on choosing the best model for implementation. Also helps in specifying system requirements and architecture needed to develop a software system. Software requirements will be discussed in detail further in the case study.
2. Specification of the application scenario
2.1. Problem Identification
Consumers often rely on others opinions before making any decisions in order to buy a product. Twitter is one of the largest microblogging sites, allowing users to tweet their opinions with a given character length of 280. This can be one of the largest resources for gather user data for automotive industry for mining consumer opinions on their vehicles help them analyze and provide useful insights about their products and their competitors. In spite of the fact that these opinions are intended to be useful, the huge accessibility of such opinions and their unstructured nature make it troublesome for organizations to profit by them. Since most of the data gathered from Twitter is unstructured it needs to be pre-processed before feeding it to the model. To resolve this, different methodologies have been developed in analyzing consumer opinion data on Twitter. The data is classified within three clusters positive, negative and neutral. Below figure shows a graph of number of tweets posted on Twitter in different countries by gathering twitter data about Volkswagen, Toyota, Mercedes, BMW, GM, and Tesla as an example
(insights from Twitter data about car makers IBM Watson 2015)
Challenges involved in building a sentimental analysis model is in gathering right amount of data for training the models which can lead to overfitting or underfitting of data can lead to low accuracy and performance issue, interpretation of sentences example detecting sarcasm, unforeseen sentences, grammar and other kind of informal sentences. Support Vector Machines and Naïve Bayes models have different ways of analyzing the data providing different performance based on the process of learning which can help in resolving the above-mentioned challenges.
Below mentioned problems will be addressed by using above mentioned techniques
Providing right amount of data to train the model without underfitting or overfitting.
Classifying the sentences into three clusters positive, negative and neutral.
Distinguishing of language before analyzing (English, Arabic, Swedish etc).
Evaluating the distribution of tweets across positive and negative cluster.
2.2. Requirement elicitation
Requirements can be classified into functional and non-function requirements.
Functional requirement prior to the development involves gathering data this is done by using Twitter API a third-party application to find tweets based on the names of automakers and retrieve it. Programming language used to retrieve those data is Python due to its capability of handling and processing large amounts of data. Next step in the requirement is to create database service to store huge amounts of tweets. System needs to be scalable to handle integration of new modules without affecting its performance. A standard structure needs to be defined to structure the data this can be achieved by using JSON formatting, objects can be stored as attribute-value pairs.
Non-Functional requirements are to gather information in regards to sentiment analysis.
Time required to implementing the algorithms. Software system requirements such as operating system, system availability, robustness, scalability, efficiency, user interface to show desired results.
2.3. Design process
In the proposed case study SVM and Naïve Bayes algorithm is used to mine the opinions of the users about automakers products and compare the accuracy and its performance to use in the software system. Below figure describes the steps involved in the machine learning approach.
In this case study, we are considering the tweets collected from Twitter social media platform where users can raise their opinions of a product by annotating ‘@’ and its manufacturer name (e.g.: @Volkswagen, @BMW, etc). Using Twitter API integrating it with python code we can generate search query to extract data based on the automakers name. Once these results are fetched it can be stored in a warehouse database for further text processing, since most of the data obtained is unstructured it needs further pre-processing of the data before it can be fed to the classifiers.
Data pre-processing involves distinguishing language, for this case study purpose we are considering only English language hence filtering out tweets in English. Once tweets are obtained data needs to be cleansed from unwanted data such as usernames, hashtags, links, stop words, punctuations. For this, we use pre-processing techniques such as Tokenization to remove all tabs, unnecessary whitespaces, and punctuations. Filtering to filter out stop words and repetitive words. It is necessary so that these unwanted data can hamper accuracy and the performance of the Support Vector Machines and Naïve Bayes model.
In order to avoid overfitting and underfitting problem dataset is classified into three parts. Train data which is fed to the models is about 60% of the dataset. Test data is 20% of dataset to evaluate if the algorithm is not overfitting or underfitting. Validation data is also 20% used to evaluate the overall performance of the algorithm.
As discussed earlier here we are considering two type of algorithm for our analysis of how well it can perform and provide better outcomes. Main goal of the models is to identify which category the tweet belongs to within the three polarities positive, negative or neutral.
Naïve Bayes algorithm uses Bayes rule of probability to detect polarity of the tweet. Bayes theorem
Support Vector Machines, on the other hand, classifies the polarities by finding the optimal hyperplane having maximum margin.
The results can be represented using graph, bar, histograms etc. Performance tuning will be done before the release of the algorithm. Once performance tuning is completed the algorithm can be deployed to real-time scenario of extracting tweets for sentiment analysis.
3. Survey of possible modeling methodologies
Sentiment analysis is a process of extracting and analyzing consumer feedback from authentic resources such as Twitter, Facebook, and other microblogging sites. As the name suggest it detects the sentiments from the texts such as joy, anger, etc which is classified into polarities such as positive, negative and neutral. By this, they can find the reasons behind fluctuations in sales of their products. Which can be used to rectify for future products. In sentiment analysis, there are four steps to be followed data extraction, data refinement, classification, and accuracy. Classification is one of the step where the predictive analysis is being done that is given a text it should be able to classify to which category does that text belong to such as positive, negative or neutral. There are wide range of classification techniques, some widely used classification algorithms are:
· Naïve Bayes Classifier
· Random Forest Classifier
· Support Vector Machines
· Max Entropy Classifier
· Boosted trees Classifier
Naïve Bayes Classifier:
As discussed in design phase it uses Probabilistic model to predict the polarity of the text. Bayes rule of Probability is used to calculate the probability of a text in the document belongs to which classification. Much of the document extracted from twitter is tokenized into statement so that the classifier can classify it to appropriate class. It uses Max Posterior Probability rule to classify the tokens.
Random Forrest Classifier:
“Random forests are an ensemble learning method for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees”. In simple terms, it’s like making prediction based on different resources, where it uses decision tree to predict and each decision tree in the forest takes into random set of feature and random set of training data. Providing more robust predictions by taking the average of all the decisions in the decisions trees.
Max Entropy Classifier:
This is another type of probabilistic classifier but not same as Naïve Bayes Classifier. In Max Entropy prior assumptions are not made, unlike Naïve Bayes where it considers the features to be independent. It uses Principle of Maximum Entropy select the one which has maximum entropy. It has no prior starting knowledge like Naïve Bayes. Given the testable data, the maximum entropy mechanism is to obtain the distribution of probabilities that maximizes entropy of information (this is the average amount of data that a data set comprises), subject to information constraints. It can be combined with Naïve Bayes to get better efficiency more accurate results. It requires more training data than other algorithms hence require more time to train the classifier .
Boosted Tree Classifier:
Boosted Trees classifier is similar to random forest having decision trees a learning algorithm called Boosting. Boosting in machine learning algorithm refers to converting weak algorithms to strong ones. It is used to improve the prediction of any given machine learning algorithm. Boosted tree uses adaptive boosting where it combines all the weak learners into a single strong learner. Where each weak learner is represented as decision tree. Weightage is added to each weak learner gets updated every time it corrects its previous error, increases if incorrectly classified.
Support Vector Machines:
4. Comparative analysis
Main focus of the case study is to built a model and implement that to software system for the automakers to analyze consumer opinions about their products. Currently, there are many algorithms and methods to predict the sentiment of the consumer, it makes it hard to choose the best without any prior knowledge of how the algorithm works and how well is the accuracy and its performance. For this case study, I will be choosing Support Vector Machines and Naïve Bayes Classifiers to classify the polarity of a statement to positive, negative and neutral since both the algorithms can be applied to both binary and multi-class classification so these two are best choices. The similarities end there both the algorithms have a different approach in solving the problem which I will brief in detailed.
Naïve Bayes Classifier:
Naïve Bayes Classifier is a classification algorithm under Machine Learning approach, it can classify both binary and multiclass classification problems. Prior assumption is that the classes is independent of classification. Once data has been extracted and cleansed texts have be classified into three polarities positive, negative and neutral for that we consider the words in the text in order to classify it to appropriate class.
This is achieved by using “Maximum A Posterior”(MAP) decision rule as shown from below figure.
From the above equation tk represents tokens or words in the text, c represents the classes that is the polarity positive, negative and neutral, P(c | d) is the conditional probability of the class of given text. P(c) is the prior probability of class c and P(tk | c) is the conditional probability of token/work tk of given class c. This means that in order to decide which class a text/document will be classified, we should estimate the product of the probability of each word in the text given a particular class (likelihood) multiplied by the probability of the particular class (prior). After calculating the above for all C classes, we select the most likely one.
Main objective is to train the text so that it can classify based on the polarity of the text that is positive, negative or neutral statement.
As discussed earlier 75% of the dataset will be used for training the classifier, 25% to test.
Consider an example above where three different statements are considered having unique polarities fetched from Twitter data to explain the algorithm working. Before calculating prior following assumptions are made
1) Position of the word in the document does not matter.
2) All classes are independent.
Both these assumption does not hold good for real-world scenarios for sentence like “@CooperMiniLtd Mini fixed these issues over time, but the interiors on the first new Minis were pretty damn annoying” because while pretty represents a positive but consequent word are negative ‘annoying’ making it dependent. Therefore, we calculate the frequencies in the data using MAP formula.
Before training the classifier, we need to first tokenize the text into words, phrases or characters this process is called tokenization. Here let us consider token as words
E.g. “#BMW Nice car, you can try it?” is converted to an array of words like
Since we have an array of words that appears in the text we can now predict the polarity by using probability. For example, let us to find the probability of the tweet “#BMW Nice car, you can try it?” is positive, we first find the frequency of the word ‘Nice’ in the positive reviews and divide it by number of words in the positive reviews to predict the probability. Same would be repeated for negative sentiment, and then the polarity is assigned to the class the one with highest probability.
Pros of using this method:
1) Algorithm is very simple to implement
2) Needs less training data to train the model due to the fact it learns conditional probability for each feature of the dataset.
3) Naïve Bayes is highly scalable overall runtime scales linearly with the number of features(dimensions) and classes. This makes it more efficient for text classification.
4) Can be used for both binary and multiclass classification
Cons of using this method:
1) Prior assumption made by the algorithm does not holds where it assumes features of dataset are independent but in real-world scenarios, features are dependent of one another. This can result in false outputs.
2) If a feature value(test data) which was not observed in train dataset, Naïve Bayes will assign it a probability of 0 and thus not able to predict. To solve this smoothing techniques are used such as Laplace estimation.
Support Vector Machines:
SVM is a nonprobabilistic model and supervised machine learning algorithm which separates data linearly and non linearly. Represents features as points in space. It has a defined format of input and output. Input is of type vector and output is represented in binary as 0 (Positive) and 1(negative) The main goal of the SVM is to classify positive and negative sentiment by finding its hyperplane.
Texts directly from Twitter API is not suitable for training. It needs to be transformed into the format that SVM is most suitable with. Preprocessing of text document is done to match the input format. Classification are binary but in general text documents will not be classified precisely, some text may belong both the categories not fitting well in any of the classifications. Basically is projects the data into a kernel and builds up a linear space in that kernel. SVM has many kernel functions such as RBF, polynomial kernel, Linear. It is able to handle linear classification on higher dimensions with nonlinear dataset by choosing appropriate kernel functions.
Choosing the right hyperplane
Here consider three hyperplanes A, B and C to goal is to identify the right hyperplane so that the data can be classified. This is done by selecting the hyperplane which classifies the two classes better. This by achieved by selecting the maximum distance between the closest data point and the hyperplane, distance is called Margin. Margin for A is greater than B and C.
The function which defines the classification is represented as
f(x) = sign (wT.x + b)
W – vector
b – bias vector
X – feature vector (Positive and negative words)
W and b together constitutes the hyperplane which defines the linear classification
To select the best hyperplane SVM can be defined as
Where represents LaGrange multiplier and yi, xi are data present in the dataset.
SVM we assign labels to each sentiment 0 for positive, 1 for negative and 2 for neutral. The feature vector and the labels are given to SVM algorithm in order to classify those tweets.
Algorithm will be implemented using Python language since it already has SVM libraries available ready to use
Cons of using this method:
1)Performance drops given larger dataset to SVM since it takes longer time to train the data.
2)Performance decreases if data has noise in it. More noise in the data will affect the classification of dataset to overlap.
3) Choosing the correct kernel function is not easy and takes time.
Pros of using this method:
1)SVM provides a better performance than other classifiers when the dataset is comparatively smaller.
2) SVM provides unique solution.
3) SVM has complexity of O(N^2*K) where K is number of support vectors.
4)A text classification dataset can have more than 1000 features, one has to deal with that many features.SVM has overfitting protection that is it can handle high dimensional data without any risks since it does not depend on the number of features in the data.
Naïve Bayes Classifier:
Time Complexity- Naïve Bayes has O(Np) where N represents the number of train dataset and p represents number of features in the dataset.
Space Complexity – O(pqr) where p represents number of features in the dataset, q represent value of each feature, r represents the class values.
Support Vector Machines:
Time Complexity – O(n2) increase in dataset will increase the complexity.
Space Complexity – O(K) where K represents the number of clusters.
Comparison of Twitter analysis using SVM and Naïve Bayes
From the above comparative analysis, both the model for sentimental analysis provide the required solutions. But for our application scenario, some functionalities such as accuracy, speed, memory, and efficiency are guaranteed by Support Vector Machines since unlike Naïve Bayes no prior assumption is required and in real-world scenarios, features are not independent rather dependent of one another. Proposed model of Support Vector Machines best fits in providing better insights in analyzing customer sentiments based on Twitter data for automobile industry.
5. Application of selected approaches
This segment provides techniques in developing strategies for the proposed software system for sentiment analysis, providing a in detailed steps in life cycle of software system covering all the phases of software development cycle.
1. Initiation: Primary stage of SDLC life cycle where a proposal is created based on the problem identification and its opportunities. Planning of techniques or models to be used in building the sentimental analysis software begins from here.
2. Requirement Analysis: To analyze user needs and develop user requirements. A complete functional and non-functional requirement is listed out in detail can be through documents, client interviews etc. Identify any existing problems in the current system and avoid any potential risks during the initial stage. Gather expected solutions and also providing a time constraint.
3. Design Phase: The stage provides a detailed functional working of the system. It specifies both functional and non-functional requirement for the proposed system. Proposed models will be compared against the required solution by gathering datasets, providing insights of the outputs and comparing the results.
4. Development Phase: From the design phase proposed model will be finalized and team of developers will start the developing the software system for selected platform and programming language.
5. Testing Phase: See to that the developed software system is in accordance with requirements as specified in the functional requirements. This is to check if the accuracy of the developed model if the model is predicting according to the required solution if not check for any possible defects or bugs with the system. Also calculating the complexity of the model in order to improve the performance if required. Produced a detailed report of the data using graphs and plots to verify the client requirements has been satisfied.
6. Implementation: Final phase of SDLC life cycle, developed software system is implemented in the production system and pre-analysis if the system is working according the requirements if not the testing process is repeated.
Once the sentimental analysis of Twitter data for automobile industry system has been successfully completed, industries can get better insights on how consumers react to their vehicles and also provide useful insights about their products and their competitors. Providing a better product strategies for better profits in the industry.
6. Guidelines for deployment
Automobile industry which has the maximum share in the economic industry, having more than 1 billion vehicles and increasing around the world. This makes Automotive industries very competitive, thus it needs to account for what the consumer has to say about their vehicles. they can find the reasons behind fluctuations in sales of their products. Which can be used to rectify for future products Sentiment analysis is machine learning approach in analysis consumer sentiment from Twitter data. From the case study, we have used two models to describe our approach of analyzing consumer data using Naïve Bayes classifier and Support Vector Machines. Comparison of both the models have been done using those insights conclusion has been made that SVM is better with complexity and also handling large dimension datasets, providing better accuracy than Naïve Bayes. Design phase provides a detailed process of gathering, transforming and analyzing the data which should be met once the system has been developed.
From the case study, we have compared two models and gathered useful insights on their strength and weakness of analyzing consumer tweets for sentiment analysis. Naïve Bayes Classifier and Support Vector Machine are the two models used for analyzing the data. Comparative analysis of both the algorithm have revealed that SVM performs better in terms of accuracy also complexity in terms of time and space is comparatively better than Naïve Bayes and also it is protected from overfitting. Since huge amounts of datasets are being generated from Twitter memory management should be considered in terms of achieving better performance. SVM having its own drawbacks these datasets needs be chunked into smaller datasets for better efficiency. Time is also a constraint in training the model since for better accuracy the algorithm takes in more time to analyze and provide a better outcome. For future development, hybrid models can be used by combining the accuracy of SVM and feature handling of Naïve Bayes for even better performance and accuracy.