Credit Risk Valuation Using an Efficient Machine Learning Algorithm
1Kovvuri Ramya Sri, 2Ch Ramesh
1M. Tech Student, Dept. of IT, G. Narayanamma Institute of Technology and Science, Shaikpet, Hyderabad, Telangana, India.
2Assistant Professor, Dept. of IT, G. Narayanamma Institute of Technology and Science, Shaikpet, Hyderabad, Telangana, India
Abstract. The automation process helps in improving the efficiency of the detection process, and it may also provide higher detection accuracy by removing the internal subjective human factors in the process. If machine learning can automatically identify bad customers, it will provide considerable benefits to the banking and financial system. The goal is to calculate the credit score and categorize customers into good or bad. Algorithms of machine learning library is used to classify the data sets of finance sectors. A large volume of multi structured customer data is generated. When the quality of this data is incomplete the exactness of study is reduced. In the proposed system, we provide machine learning algorithms for effective prediction of various occurrences in societies. We experiment the altered estimate models over real-life bank data collected. Compared to several typical estimate algorithms, the calculation exactness of our proposed algorithm is high.
Keywords: Machine Learning, Credit Scoring, Logistic Regression, Random forest, CRISP DM Framework.
Hundreds of banks in the United States alone suffer from non-payment or late-payment of loans. Predicting such customers earlier facilitates preventive banking interventions, which in turn can lead to enormous cost savings and improved outcomes. Algorithms are developed for predicting customer behavior by drawing from ideas and techniques in the field of machine learning. Standard classification methods are explored such as logistic regression and random forest, as well as more sophisticated sequence models, including recurrent neural networks. We focus especially on the use of banking code data for customer behavior prediction and explore different ways for representing such data in our prediction algorithms. A problematic information assortment mechanism is intended and therefore the correlation analysis of this collected knowledge is performed. A stochastic prediction model is designed to foresee the future condition of the most correlated customers based on their current account status. In banking and finance communities, a large volume of multi structured customer data is generated from the transactions, account statements and online purchases.
Imagine a system where banks can quickly go through millions of anonymized customer records to find people with good credit scores and bank experiences. Through this massive, searchable database, banks could determine whom to offer a loan, based on what has worked effectively for others with similar behavior and characteristics.
1.1 Precision Banking
What makes precision banking unique is that it goes beyond predicting for existing customers and conditions to predicting and preventing debts from new customers before they manifest. It stands at the convergence of finance, technology and big data, offering new ways to keep banks profitable. Precision banking is a way of translating data into information that can make way to prevent losses for banks in a way that we might not have done before. We are composed and self-assured to have a whole new level of precision in maintaining banks.
In shifting through this data, researchers can better predict individual credit score, develop approaches to early detection and prevention, with information to help them make real-time decisions about the best way to offer loans for customers.
Large-scale data analysis also is enabling researchers to develop more targeted and cost-effective methods for early prediction of credit score before transactions are made.
The publicly available banking customer data is used to identify specific attributes associated with the defaulter condition, laying the basic work for a simple test for defaulters.
1.2 Data Access
Banking has long been a data-rich field. With so many moving parts, banking providers and financers have no shortage of variables to measure. The captured data have many important uses. They keep tabs on credits and debits. They track the activity of transactions and savings. Crucially, the data record the states of people at a microscopic and macroscopic level. It is difficult to exaggerate the importance of data in banking, when it comes to improving banking systems. Although in this work we focus on using banking data in credit score prediction, there are many other angles of banking sector that can be enhanced and even revolutionized through intelligent use of data. Obtaining access to banking data is often a fraught endeavor. Unfortunately, this can halt the progress of researchers unaffiliated with finance companies or banking systems. Despite these difficulties, the potential rewards of better understanding and utilizing banking data to improve banking sector far outweigh the frustrations of data access.
1.3 Data Capture
There are various systems in place for capturing banking data. Modern banking systems use tools for systematically and digitally storing a wide range of data, including customer demographics and account history, purchases, transactions, deposits, and more. Systems also facilitate data access and visualization, allowing bankers and customers to better inform themselves. Although these records only capture the activities that occur within a particular facility or set of facilities, they provide a vivid account of an individual’s state. Insurance claims form another rich repository of banking data. Claims data 3 center on individuals enrolled in the insurance policy and their interactions with the banking system. These records typically include basic demographic information about customers, along with purchases, transactions, deposits, savings and associated taxes. Because insurance claims data are so customer-centric and can also capture banking activity of customers across a variety of banks, and financial organizations, they paint a rather comprehensive picture of an individual’s account history and current account state
2 Data Understanding
2.1 Sample Variables
Much of banking data consists of simple numerical and categorical variables. These include demographic variables such as age, sex, and ethnicity. Employment variables such as salary, job type, designation, work experience, and many others are also straightforward. These types of simple data are suited for standard analytical and statistical methods (such as linear or logistic regression). To stop with just the simple variables, however, would be to miss out on potentially valuable insight provided by more complex sources of data.
Two data sets are required for the analysis, Demographic data and Credit bureau data.
Demographic Data: Demographic data has simple variables
Credit Bureau data: Credit bureau data has variables obtained from previous history of the customer.
Both datasets are provided by the bank.
Table 1. Demographic Data.
Variables Description Description
Application ID Unique ID of the customers
Age Age of customer
Gender Gender of customer
Marital Status Marital status of customer (at the time of application)
No of dependents No. of children’s of customers
Income Income of customers
Education Education of customers
Profession Profession of customers
Type of residence Type of residence of customers
No of months in current residence No of months in current residence of customers
No of months in current any company No of months in current company of customers
Performance Tag Status of customer performance (“1” represents “Default”)
Table 2. Credit Bureau Data.
Application ID Customer application ID
No of times 90 DPD or worse in last 6 months Number of times customer has not payed dues since 90days in last 6 months
No of times 60 DPD or worse in last 6 months Number of times customer has not payed dues since 60 days last 6 months
No of times 30 DPD or worse in last 6 months Number of times customer has not payed dues since 30 days last 6 months
No of times 90 DPD or worse in last 12 months Number of times customer has not payed dues since 90 days last 12 months
No of times 60 DPD or worse in last 12 months Number of times customer has not payed dues since 60 days last 12 months
No of times 30 DPD or worse in last 12 months Number of times customer has not payed dues since 30 days last 12 months
Average CC Utilization in last 12 months Average utilization of credit card by customer
No of trades opened in last 6 months Number of times the customer has done the trades in last 6 months
No of trades opened in last 12 months Number of times the customer has done the trades in last 12 months
No of PL trades opened in last 6 months No of PL trades in last 6 month of customer
No of PL trades opened in last 12 months No of PL trades in last 12 month of customer
No of Inquiries in last 6 months (excluding home and auto loans) Number of times the customers has inquired in last 6 months
No of Inquiries in last 12 months (excluding home and auto loans Number of times the customers has inquired in last 12 months
Presence of open home loan Is the customer has home loan (1 represents “Yes”)
Outstanding Balance Outstanding balance of customer
Total No of Trades Number of times the customer has done total trades
Presence of open auto loan Is the customer has auto loan (1 represents “Yes”)
Performance Tag Status of customer performance (” 1 represents “Default”)
Data contain a variable performance tag which represents whether the applicant has gone default after getting a credit card. Data is having some records where the performance tag is not present. These records are considered as rejected. After keeping aside rejected records there are 69″,867 records remain. Among these 4% of the records are default. Also, company doesn’t know whether rejected are also contain right customers or not.
Fig. 1. Percentage of non-default and default customers.
3 Data Cleaning and Exploratory Data Analysis
Preliminary checks like checking structure, summary of data have been done. Checked for duplicates in data and removed 3 duplicates with same App.ID. Merged Demographic and Credit Bureau data. Missing value treatment is taken care by the WOE analysis which is done further. Outlier treatment has been done for variables Age, Income, No.of.months.in.current.company etc.”,. Below charts are an example:
Fig. 2. Data cleaning graph.
Fig. 3. Data cleaning Graphs.
EDA has been done on all the variables by deriving a variable called Default Rate.
Fig. 4. Plots with all variables of credit bureau data.
4 Weight of Evidence Analysis
WOE analysis on the data has been performed and replaced demographic and credit data with WOE values. Sample plot on demographic data as follows. Similar way for credit data also been done.
Fig. 5. Plots with all variables of credit bureau data.
5 Model Building
Based on the analysis of data, we tried building models using Logistic Regression, Decision Trees and Random Forests and pick the model which is best for this data.
Table 3. Result of analysis using different models on data.
Model Data on which model was built Accuracy Sensitivity Specificity
Logistic Regression Demographic Data 53.54 60.4 53.24
Decision Trees Demographic Data-Overbalancing 52.6 60 59.7
Demographic Data-Under balancing 56.6 55.6 55.7
Demographic Data-Both 52.6 60 59.7
Demographic Data – Balancing with ROSE 61.42 49.97 50.46
Random Forests Demographic Data-Overbalancing 51.4 56.22 51.18
Demographic Data-Under balancing 52.8 53.1 52.8
Demographic Data-Both 52 54.4 51.8
Demographic Data – Balancing with ROSE 55 53.5 55.06
Logistic Regression Whole Data 67.49 58.71 67.87
Whole Data – Balanced 63.5 63.8 63.5
Decision Trees Whole Data – Overbalancing 50.79 76.01 49.67
Whole Data – Under balancing 59.9 67.3 59.57
Whole Data – Both 50.79 76.01 49.67
Whole Data – Balancing with ROSE 73.92 47.96 75.06
Random Forests Whole Data – Without Balancing 64.5 57.35 64.82
Whole Data – Overbalancing 55.22 62.33 54.9
Whole Data – Under balancing 61.74 61.99 61.72
Whole Data – Both 62.2 57.8 62.39
Whole Data – Balancing with ROSE 63.4 64.06 63.41
6 Model Evaluation
From the above metrics of all the different models one model must be chosen which is consistent across all the three metrics i.e., Accuracy, Sensitivity and Specificity. Although some models gave 70+ accuracy, they perform poor in Sensitivity. Finally left with Logistic Regression and Random Forest models which has equal numbers for all the three parameters. Chosen Random Forests because of two reasons: Sensitivity is slightly more compared to logistic regression, and as we know Random Forests will perform good on unseen data. So Random Forest with balanced data is our Final Model.
6.1 Important variables from the model:
From the model which is built only on demographic data, below are the important variables.
No.of.months.in.current.residence, Income, and No.of.months.in.current.company.
From the final model chosen i.e., Random forest, Important variables can derived from the below plot
Fig. 6 Variables plot.
7 Application Score Card
7.1 Implications of using the model
From the Model and built scorecard the cut-off score is set to be 355.2808. By applying this cutoff on rejected candidates there are 256 candidates rejected out of 1425 whose score is high. By using the model, the rejected population have been decreased thereby increasing the revenue for the company. Also using the model company can avoid manual process in approving the credit cards.
7.2 Financial Benefit
For assessing the Financial Benefit of the model, let’s assume loss per default is $1″,000. Assuming this if calculated the overall loss without using the model, loss comes to be around $8″,84″,000. In the same case if company would have used the model the loss would have been $3″,23″,000.
Total Loss Avoided = 884000-323000 = 561000
So, by using the model company will get benefit up to $5.6 Lakh.
7.3 Recommendations to business
Implementing this model in the place of manual process, company can avoid spending time and effort on “acquiring right customers”. Implementing this model will avoid unnecessary rejection of legitimate customers. Implementing this model will be able to predict the probability of default by finding hidden patterns from the previous data.
From the predicted model application scorecard has been built. The summary of score card vs log odds as follows:
Fig. 7 score card vs log odds.
A random forest model based multimodal credit score prediction algorithm using structured data from banking data set. Analyzing the factors based on the banking sector data. Missing values problem can be resolved using machine learning algorithms. Default customer prediction can be done based on the data and type of, region and risk level of the customer’s account status by the availability of the data.
Acknowledgments. I would like to express my special thanks of gratitude to my guide CH Ramesh , as well as our head of the department Information technology Dr I Ravi Prakash Reddy, who gave me the golden opportunity to do this wonderful project, which also helped me in doing a lot of Research and I came to know about so many new things I am really thankful to them.
Secondly I would also like to thank my parents and friends who helped me a lot in finalizing this paper within the limited time frame.
1. Yun Peng ; Ruzhi Xu ; Huawei Zhao ; Zhizheng Zhou ; Ni Wu ; Ying Yang : Random Walk Based Trade Reference Computation for Personal Credit Scoring. In: 2017 IEEE 13th International Symposium on Autonomous Decentralized System (ISADS)(2017)
2. Wei Li ;: An Empirical Study on Credit Scoring Model for Credit Card by Using Data Mining Technology. In2011 Seventh International Conference on Computational Intelligence and Security (2011)
3. Gabriel Rushin ; Cody Stancil ; Muyang Sun ; Stephen Adams ; Peter Beling : Horse race analysis in credit card fraud—deep learning, logistic regression, and Gradient Boosted Tree. In: 2017 Systems and Information Engineering Design Symposium (SIEDS) (2017)
4. Yulong Liu ; Jianlei Du ; Feng Wang : Non-negative matrix factorization with sparseness constraints for credit risk assessment. In: Proceedings of 2013 IEEE International Conference on Grey systems and Intelligent Services (GSIS) (2013)
5. Anne Kraus: Recent Methods from Statistics and Machine Learning for Credit Scoring