In this analysis, there is going to be investigation of Online Lead Conversion for a Life Insurance organization and finding responsible features and a good algorithm to predict the conversion. From the survey of the company, overview about the customer’s information like income, age, house owner, marital status etc. has been collected. And some of the survey takers move towards becoming customers from the survey. Organization have all the past data of who progressed toward becoming customers from lead. Thought is to take in features/attributes from this data and find the influential features and an accurate algorithm to find out an inclination of him/her changing over to a customer dependent on qualities asked in the study.
To start with feature selection, it’s a very important role in classification algorithms which comes under data pre-processing . Usually the system based on classifying contains lots of if else conditions. But this method turns out to be tedious and time consuming. From the large data sets, some unnecessary features can be removed or reduced which are not essential for prediction. But this may lead to information loss which is very likely to happen while segregating the unnecessary features . A dependence level of features will be set against the target labelling in classification. So that lost information have no impact on the model accuracy .
Next section is about reviewing the models for classification. Each model is used in certain situations. We will find out which models will run best for the data set. Marketing companies are trying to figure out the best way to market, trying to catch attention of the customer with the help of predictions. Prediction tells about what a customer might be attracted to.
In 1959, Arthur Samuel (IBM) was the first Person to use the term Machine learning and work on it. Still after 5 decades, the research is being is done to find some of the suitable algorithms to analyse the historical data. In data mining, it’s almost certain that one algorithm suitable for one particular dataset may not be suitable for any other dataset.
The crucial step of analysing the dataset is to clean the data from unnecessary information. Whenever any new feature is introduced to an analyser, he/she has to determine whether this feature is necessary for the further process and need to be removed . In the field of feature extraction, there has been invented some very good methods according to the situation and demands. With the help of feature extraction process we can determine the best possible combination of required features who will be responsible for the prediction. Feature extraction increase the performance of the model in less time complexity . In the proposed dataset, there are features with categorical values, and to determine which feature is much more influential and which to be ignored the test used is Chi- Square Test. By this test the independence of any 2 categorical features can be determined easily .To find the dependence in between any two continuous values or features, correlation can be determined . The Values vary from -1 to 1 determining the independence between two features.
According to Mr. Kotsiantis, the critical step in the process of data analysing is to choose the correct prediction algorithm . There are many factors to consider for choosing the correct learning algorithm. The predicted outcome has to be in categorical format. So it’s vividly understandable that the algorithms required for prediction has to be able to do that. There will be 5 algorithms which we are going to test on the dataset to determine whose performance is best than the others.
In 1992, an algorithm was introduced to split the pool of the data with the help of hyperplanes. In 1994, an active learning approach was introduced by Lewis and Gale . This was called Support vector machine (SVM). Some other categorical algorithms cam also perform the same actions but cannot match with the accuracy the SVM . But this algorithm require very high amount of resources, so in that way some less complex algorithm comes into play like random forest, decision tree classifier and k-nearest neighbour.