Last updated on 15.07.2020
Internet and social media
Internet and social media made the access to the news much easier and convenient. Mass media has a huge influence on society, and as it often happens, there is someone who wants to take advantage of this fact. It is cheap to provide news online and much faster and easier to disseminate through social media, a large volume of fake news. Sometimes to achieve some goals mass media may manipulate the information in different ways.
This lead producing of the news articles that are not completely true or even false. This paper shows a simple approach for fake news detection using multiple algorithms of classification. Using a dataset obtain from BuzzFeed and Kaggle, we apply TF-IDF (Term frequency-inverse document frequency) and N-gram detection to 20″,000 above articles. We test our dataset on multiple classification algorithms Support Vector Machine, Decision Tree, Logistic Regression, Naive Bayes and Perceptron. We achieved 98.02 % accuracy on Kaggle dataset using Logistic Regression.
In the recent years, online content has been playing a significant role in affecting user perception. A large number of users access the news articles from social media rather than traditional organization. They trust on that without knowing whether that is real or fake. There are many online sources that produce fake articles for misleading the users. For example, open sources web sites, Facebook, Twitter and other social media vendors.
Fake news and lack of trust in media are increasingly becoming a menace of our society. There exist a sizeable body of research on the fake news Detection, where Facebook and WhatsApp are also trying to find out the content which is becoming viral on the internet is valid.
There has been case of mob lynching, violence, crimes and most of this Occurrence was linked to WhatsApp . So, WhatsApp had launched a nationwide camp called “Share Joy, Not rumours”. This camp was telecasted in multiple formats including television, online, and radio to help prevent the spread of rumours and Fake News. WhatsApp as a Messaging App cannot have access to users chats so they started to ban users, who are suspect of being suspicious .
Facebook as a major source of rumours and Fake News they are also taking a step forward to stop fake news. Facebook is trying to locate Fake News in each of its department by applying Machine Learning in detecting fraud and adding such accounts in their spam Account List . Various Other Community are also working on Fake News by conducting Competition on Fake News like “Fake News Challenge”, where people from various department provide solutions for the problem. The spread of fake news is major concerned and is an important aspect.
In this paper, we compare the performance of models using TF-IDF (Term Frequency and Inverse Document Frequency) and NGram model as Feature Extraction which is performed on Kaggle Dataset of size 20800 and BuzzFeed Data . The aim of the research is to examine how different Algorithms works on the dataset. We applied TF-IDF on the dataset and then TF-IDF using NGram. We did not find out much difference between result of TF-IDF using NGram and TF-IDF.
The issue of Fake News and spam detection in any field is the studied problem. Spam detection is one of interesting area of researchers. Arushi Gupta  in their study proposed a framework to detect spam in an online social network (OSN). The main task of their work was to detect spam of any one social networking that could be identified across all the online social networks. They used three algorithms that was Naive Bayes Classifiers, Clustering and Decision Tree Classifier. Shlok Gilda in Evaluating Machine Learning Algorithm for Fake News Detection , they worked on Signal Media dataset in which they have made used of TF-IDF, TFIDF using bigram and Probabilistic context-free grammar (PCGF) for feature extraction.
They gain 77.2% accuracy on the test set using stochastic Gradient descent for classification and PCFG and TF-IDF using bigram  for feature generation. Mykhailo Granik in their Fake News detection Using Naïve Bayes Classifier  in this BuzzFeed data were used for training their Naïve Bayes Classifier where they tested their data on Facebook post where they achieved 74% of accuracy. Kai Shu in their research study of Fake News Detection on social Media: A Data Mining Perspective , in this they explored fake news problem by reviewing existing literature in two ways that is characterization and detection.
In the characterization phase, they discovered the principles of fake news in both traditional media and social media. In the detection phase, they explored the existing fake news detection approach from a data mining perspective, including feature extraction and model selection. In this paper, they discussed dataset and evaluation metrics. Shivam B. Parikh in their Media Rich Fake News Detection , in this paper they discovered the characteristics of fake news, types of fake news, Linguistic features based methods and gave an idea about the text-based classification. They also discovered the popular dataset which are (BuzzFeed News, LALR, PHEME, CREDBANK) on which we can perform classification or other techniques. Samir Bajaj in Fake News Detection Using Deep Learning , in this Kaggle dataset  were used for the implementation of a classifier that can predict based on only the news contents. They used NLP (Natural Language Processing) and different models trained that are Logistic Regression, Two-Layered Feed Forward Neural Network, recurrent Neural Network, Long-short Term Memories, Gated Recurrent Units, Bidirectional RNN and CNN with Max Pooling.
Basic steps of finding the misleading news or false news includes the following steps:
- collecting Dataset;
- preprocessing of News Content;
- information Extraction Module;
- model Construction;
- evaluation Metrics.
We used Kaggle and BuzzFeed  dataset for detecting the fraudulent news. Kaggle  dataset consist of four attributes Title of News, Content, Author and label 1 and 0. 1 indicates news is fake and 0 indicates news is true. BuzzFeed Data consists of two folders which contain 91 JSON files in each folder. The folder were named as FakeNewsContent and RealNewsContent respectively. JSON file contains Author, Title, Text and source of the News. In Kaggle dataset we have used 10% data for testing and remaining for training the model whereas in BuzzFeed dataset we used 20% for testing and 80% for training purpose.
Pre-processing of News Content:
Pre-processing involves preparing a text document for the analysis. This stage involves the following steps:
- sentence segmentation;
- sentence Tokenization;
- stop word Removal;
In a linguistic analysis of a natural language text, it is necessary to clearly define what constitutes a word and a sentence. Thus, segmentation of a text into sentences is a necessary prerequisite for many NLP tasks. It is the process of determining longer processing units consisting of one or more words. Frequently referred to as sentence boundary disambiguation, or sentence boundary recognition.
Tokenization is a critical activity in any information retrieval model. It is the process of breaking up the sequence of characters in a text by locating the word boundaries, the points where one word ends and another begins. In other words, it simply segregates all the words, numbers, and other characters are called tokens. Along with token generation, this process also evaluates the frequency value of all these tokens present in the input documents.
Stop Word Removal
Many of the most frequently used words in English are worthless during many natural language processing tasks these words are called stop words.
Stemming is one technique to provide ways of finding morphological variants of search terms. Syntactically similar words, such as plurals, verbal variation etc. are considered similar, the purpose of this procedure is to obtain the stem or root of each word which emphasize its semantics. Porter’s algorithm is most commonly used for stemming English words.
Information Extraction Module
TFIDF (Term Frequency Inverse Document Frequency):
TF-IDF is numerical statistic which is used to find the importance of word in the corpus. Term Frequency (TF) used to calculate the occurrence of a word in a given text relative to total number of documents. Inverse Document Frequency (IDF) is total number of documents relative to the number of documents that contains a specific word.
- TF = (Number of times term t appears in a document) / (Total number of terms in the document);
- IDF j = log (1+n/df j);
- df j = Number of Document that contain specific word;
- n = Total number of documents;
- by multiplying both the score we get TF-IDF score for the documents.
TFIDF Using N-Gram:
We used N-Gram to find a sequence of words in a given text corpus. We set the N-Gram size to two that is we have used bigram to extract the content from the text. We applied TF-IDF to the extracted content.
In this paper, we used multiple algorithms and applied those algorithms on BuzzFeed  and Kaggle  dataset. we obtained different accuracy on both the dataset. We have applied algorithms on BuzzFeed dataset as shown in TABLE I and we have obtained maximum accuracy of 91.89% on Decision Tree and Support Vector Machine using TF-IDF. Whereas TF-IDF using N-gram gives accuracy of 91.89% on Decision Tree and 86.48 % on Support Vector Machine. Similarly, we have applied on Kaggle we concluded that Support Vector Machine (SVM) gives maximum accuracy of 96.92% with a precision of 97.25% using TF-IDF. Whereas, TF-IDF using N-gram gives accuracy of 98.02 % on Logistic Regression with precision of 97.84%.
To calculate the accuracy, we used confusion matrix. Confusion matrix is useful tool for analysing how well a classifier can recognize tuples of different classes. True Positive (TP) and True Negative (TN) tells us when the classifier is getting the right things. False Positive (FP) and False Negative (FN) tells us where the classifier is getting the wrong things.
It is a number of tuples that is correctly classified by the classifier.
Error rate or misclassification is a number of tuples that are wrongly classified by the classifier. It is simply (1-Accuracy).
Precision is a good measure to determine when the cost of False Positive is high.
Recall calculates how many of the actual positives our model capture through labelling its positive.
F1 Score might be a better measure to use if we need to seek a balance between Precision and Recall and there is an uneven class distribution.
We were working on the News content for classifying the news as fake or real, but the major source of Misleading news comes from various social media which is difficult to process. For example, people sharing messages, videos, images are difficult to identify which is real or fake. People sharing quotes or message regarding the unauthenticated information on chats are misleading other users. So, the information forwarded through text, images, video and identifying those text as Misleading information can be added into the future scope. To validate web articles in real-time whether that is fake or real.
With the increasing influence of social media on people on a daily basis. Social media has been most frequently used nowadays to disseminate fake or inappropriate news. The spread of rumours and misleading news can create havoc in the surrounding. This ambiguity has a strong negative impact on the mass and society. In this paper, we explored this ambiguity by reviewing previously researched literatures. We reviewed existing detection approaches including feature extraction and model construction we further discussed about dataset being used in our approach and estimated evaluation metrics and future direction in fake news detection of digital content. Our proposed model is used to resolve this ambiguity by using the multiple models which gives accuracy of 98.02%.