Press "Enter" to skip to content

Credit risk valuation using an efficient machine learning algorithm

Credit Risk Valuation Using an Efficient Machine Learning Algorithm

1Kovvuri Ramya Sri, 2Ch Ramesh

Haven't found the right essay?
Get an expert to write you the one you need

1M. Tech Student, Dept. of IT, G. Narayanamma Institute of Technology and Science, Shaikpet, Hyderabad, Telangana, India.

2Assistant Professor, Dept. of IT, G. Narayanamma Institute of Technology and Science, Shaikpet, Hyderabad, Telangana, India

Abstract. The automation process helps in improving the efficiency of the detection process, and it may also provide higher detection accuracy by removing the internal subjective human factors in the process. If machine learning can automatically identify bad customers, it will provide considerable benefits to the banking and financial system. The goal is to calculate the credit score and categorize customers into good or bad. Algorithms of machine learning library is used to classify the data sets of finance sectors. A large volume of multi structured customer data is generated. When the quality of this data is incomplete the exactness of study is reduced. In the proposed system, we provide machine learning algorithms for effective prediction of various occurrences in societies. We experiment the altered estimate models over real-life bank data collected. Compared to several typical estimate algorithms, the calculation exactness of our proposed algorithm is high.

Keywords: Machine Learning, Credit Scoring, Logistic Regression, Random forest, CRISP DM Framework.

1 Introduction

Hundreds of banks in the United States alone suffer from non-payment or late-payment of loans. Predicting such customers earlier facilitates preventive banking interventions, which in turn can lead to enormous cost savings and improved outcomes. Algorithms are developed for predicting customer behavior by drawing from ideas and techniques in the field of machine learning. Standard classification methods are explored such as logistic regression and random forest, as well as more sophisticated sequence models, including recurrent neural networks. We focus especially on the use of banking code data for customer behavior prediction and explore different ways for representing such data in our prediction algorithms. A problematic information assortment mechanism is intended and therefore the correlation analysis of this collected knowledge is performed. A stochastic prediction model is designed to foresee the future condition of the most correlated customers based on their current account status. In banking and finance communities, a large volume of multi structured customer data is generated from the transactions, account statements and online purchases.

Imagine a system where banks can quickly go through millions of anonymized customer records to find people with good credit scores and bank experiences. Through this massive, searchable database, banks could determine whom to offer a loan, based on what has worked effectively for others with similar behavior and characteristics.

1.1 Precision Banking

What makes precision banking unique is that it goes beyond predicting for existing customers and conditions to predicting and preventing debts from new customers before they manifest. It stands at the convergence of finance, technology and big data, offering new ways to keep banks profitable. Precision banking is a way of translating data into information that can make way to prevent losses for banks in a way that we might not have done before. We are composed and self-assured to have a whole new level of precision in maintaining banks.

In shifting through this data, researchers can better predict individual credit score, develop approaches to early detection and prevention, with information to help them make real-time decisions about the best way to offer loans for customers.

Large-scale data analysis also is enabling researchers to develop more targeted and cost-effective methods for early prediction of credit score before transactions are made.

The publicly available banking customer data is used to identify specific attributes associated with the defaulter condition, laying the basic work for a simple test for defaulters.

1.2 Data Access

Banking has long been a data-rich field. With so many moving parts, banking providers and financers have no shortage of variables to measure. The captured data have many important uses. They keep tabs on credits and debits. They track the activity of transactions and savings. Crucially, the data record the states of people at a microscopic and macroscopic level. It is difficult to exaggerate the importance of data in banking, when it comes to improving banking systems. Although in this work we focus on using banking data in credit score prediction, there are many other angles of banking sector that can be enhanced and even revolutionized through intelligent use of data. Obtaining access to banking data is often a fraught endeavor. Unfortunately, this can halt the progress of researchers unaffiliated with finance companies or banking systems. Despite these difficulties, the potential rewards of better understanding and utilizing banking data to improve banking sector far outweigh the frustrations of data access.

Other essay:   Requirement analysis using deep learning

1.3 Data Capture

There are various systems in place for capturing banking data. Modern banking systems use tools for systematically and digitally storing a wide range of data, including customer demographics and account history, purchases, transactions, deposits, and more. Systems also facilitate data access and visualization, allowing bankers and customers to better inform themselves. Although these records only capture the activities that occur within a particular facility or set of facilities, they provide a vivid account of an individual’s state. Insurance claims form another rich repository of banking data. Claims data 3 center on individuals enrolled in the insurance policy and their interactions with the banking system. These records typically include basic demographic information about customers, along with purchases, transactions, deposits, savings and associated taxes. Because insurance claims data are so customer-centric and can also capture banking activity of customers across a variety of banks, and financial organizations, they paint a rather comprehensive picture of an individual’s account history and current account state

2 Data Understanding

2.1 Sample Variables

Much of banking data consists of simple numerical and categorical variables. These include demographic variables such as age, sex, and ethnicity. Employment variables such as salary, job type, designation, work experience, and many others are also straightforward. These types of simple data are suited for standard analytical and statistical methods (such as linear or logistic regression). To stop with just the simple variables, however, would be to miss out on potentially valuable insight provided by more complex sources of data.

Two data sets are required for the analysis, Demographic data and Credit bureau data.

Demographic Data: Demographic data has simple variables

Credit Bureau data: Credit bureau data has variables obtained from previous history of the customer.

Both datasets are provided by the bank.

Table 1. Demographic Data.

Variables Description Description

Application ID Unique ID of the customers

Age Age of customer

Gender Gender of customer

Marital Status Marital status of customer (at the time of application)

No of dependents No. of children’s of customers

Income Income of customers

Education Education of customers

Profession Profession of customers

Type of residence Type of residence of customers

No of months in current residence No of months in current residence of customers

No of months in current any company No of months in current company of customers

Performance Tag Status of customer performance (“1” represents “Default”)

Table 2. Credit Bureau Data.

Variable Description

Application ID Customer application ID

No of times 90 DPD or worse in last 6 months Number of times customer has not payed dues since 90days in last 6 months

No of times 60 DPD or worse in last 6 months Number of times customer has not payed dues since 60 days last 6 months

No of times 30 DPD or worse in last 6 months Number of times customer has not payed dues since 30 days last 6 months

No of times 90 DPD or worse in last 12 months Number of times customer has not payed dues since 90 days last 12 months

No of times 60 DPD or worse in last 12 months Number of times customer has not payed dues since 60 days last 12 months

Other essay:   Workplace learning report

No of times 30 DPD or worse in last 12 months Number of times customer has not payed dues since 30 days last 12 months

Average CC Utilization in last 12 months Average utilization of credit card by customer

No of trades opened in last 6 months Number of times the customer has done the trades in last 6 months

No of trades opened in last 12 months Number of times the customer has done the trades in last 12 months

No of PL trades opened in last 6 months No of PL trades in last 6 month of customer

No of PL trades opened in last 12 months No of PL trades in last 12 month of customer

No of Inquiries in last 6 months (excluding home and auto loans) Number of times the customers has inquired in last 6 months

No of Inquiries in last 12 months (excluding home and auto loans Number of times the customers has inquired in last 12 months

Presence of open home loan Is the customer has home loan (1 represents “Yes”)

Outstanding Balance Outstanding balance of customer

Total No of Trades Number of times the customer has done total trades

Presence of open auto loan Is the customer has auto loan (1 represents “Yes”)

Performance Tag Status of customer performance (” 1 represents “Default”)

Data contain a variable performance tag which represents whether the applicant has gone default after getting a credit card. Data is having some records where the performance tag is not present. These records are considered as rejected. After keeping aside rejected records there are 69″,867 records remain. Among these 4% of the records are default. Also, company doesn’t know whether rejected are also contain right customers or not.

Fig. 1. Percentage of non-default and default customers.

3 Data Cleaning and Exploratory Data Analysis

Preliminary checks like checking structure, summary of data have been done. Checked for duplicates in data and removed 3 duplicates with same App.ID. Merged Demographic and Credit Bureau data. Missing value treatment is taken care by the WOE analysis which is done further. Outlier treatment has been done for variables Age, Income, etc.”,. Below charts are an example:

Fig. 2. Data cleaning graph.

Fig. 3. Data cleaning Graphs.

EDA has been done on all the variables by deriving a variable called Default Rate.

Fig. 4. Plots with all variables of credit bureau data.

4 Weight of Evidence Analysis

WOE analysis on the data has been performed and replaced demographic and credit data with WOE values. Sample plot on demographic data as follows. Similar way for credit data also been done.

Fig. 5. Plots with all variables of credit bureau data.

5 Model Building

Based on the analysis of data, we tried building models using Logistic Regression, Decision Trees and Random Forests and pick the model which is best for this data.

Table 3. Result of analysis using different models on data.

Model Data on which model was built Accuracy Sensitivity Specificity

Logistic Regression Demographic Data 53.54 60.4 53.24

Decision Trees Demographic Data-Overbalancing 52.6 60 59.7

Demographic Data-Under balancing 56.6 55.6 55.7

Demographic Data-Both 52.6 60 59.7

Demographic Data – Balancing with ROSE 61.42 49.97 50.46

Random Forests Demographic Data-Overbalancing 51.4 56.22 51.18

Demographic Data-Under balancing 52.8 53.1 52.8

Demographic Data-Both 52 54.4 51.8

Demographic Data – Balancing with ROSE 55 53.5 55.06

Logistic Regression Whole Data 67.49 58.71 67.87

Whole Data – Balanced 63.5 63.8 63.5

Decision Trees Whole Data – Overbalancing 50.79 76.01 49.67

Whole Data – Under balancing 59.9 67.3 59.57

Whole Data – Both 50.79 76.01 49.67

Whole Data – Balancing with ROSE 73.92 47.96 75.06

Random Forests Whole Data – Without Balancing 64.5 57.35 64.82

Whole Data – Overbalancing 55.22 62.33 54.9

Whole Data – Under balancing 61.74 61.99 61.72

Whole Data – Both 62.2 57.8 62.39

Whole Data – Balancing with ROSE 63.4 64.06 63.41

6 Model Evaluation

From the above metrics of all the different models one model must be chosen which is consistent across all the three metrics i.e., Accuracy, Sensitivity and Specificity. Although some models gave 70+ accuracy, they perform poor in Sensitivity. Finally left with Logistic Regression and Random Forest models which has equal numbers for all the three parameters. Chosen Random Forests because of two reasons: Sensitivity is slightly more compared to logistic regression, and as we know Random Forests will perform good on unseen data. So Random Forest with balanced data is our Final Model.

Other essay:   Second learning portfolio

6.1 Important variables from the model:

From the model which is built only on demographic data, below are the important variables., Income, and

From the final model chosen i.e., Random forest, Important variables can derived from the below plot

Fig. 6 Variables plot.

7 Application Score Card

7.1 Implications of using the model

From the Model and built scorecard the cut-off score is set to be 355.2808. By applying this cutoff on rejected candidates there are 256 candidates rejected out of 1425 whose score is high. By using the model, the rejected population have been decreased thereby increasing the revenue for the company. Also using the model company can avoid manual process in approving the credit cards.

7.2 Financial Benefit

For assessing the Financial Benefit of the model, let’s assume loss per default is $1″,000. Assuming this if calculated the overall loss without using the model, loss comes to be around $8″,84″,000. In the same case if company would have used the model the loss would have been $3″,23″,000.

Total Loss Avoided = 884000-323000 = 561000

So, by using the model company will get benefit up to $5.6 Lakh.

7.3 Recommendations to business

Implementing this model in the place of manual process, company can avoid spending time and effort on “acquiring right customers”. Implementing this model will avoid unnecessary rejection of legitimate customers. Implementing this model will be able to predict the probability of default by finding hidden patterns from the previous data.

From the predicted model application scorecard has been built. The summary of score card vs log odds as follows:

Fig. 7 score card vs log odds.

8 Conclusion

A random forest model based multimodal credit score prediction algorithm using structured data from banking data set. Analyzing the factors based on the banking sector data. Missing values problem can be resolved using machine learning algorithms. Default customer prediction can be done based on the data and type of, region and risk level of the customer’s account status by the availability of the data.

Acknowledgments. I would like to express my special thanks of gratitude to my guide CH Ramesh , as well as our head of the department Information technology Dr I Ravi Prakash Reddy, who gave me the golden opportunity to do this wonderful project, which also helped me in doing a lot of Research and I came to know about so many new things I am really thankful to them.

Secondly I would also like to thank my parents and friends who helped me a lot in finalizing this paper within the limited time frame.


1. Yun Peng ; Ruzhi Xu ; Huawei Zhao ; Zhizheng Zhou ; Ni Wu ; Ying Yang : Random Walk Based Trade Reference Computation for Personal Credit Scoring. In: 2017 IEEE 13th International Symposium on Autonomous Decentralized System (ISADS)(2017)

2. Wei Li ;: An Empirical Study on Credit Scoring Model for Credit Card by Using Data Mining Technology. In2011 Seventh International Conference on Computational Intelligence and Security (2011)

3. Gabriel Rushin ; Cody Stancil ; Muyang Sun ; Stephen Adams ; Peter Beling : Horse race analysis in credit card fraud—deep learning, logistic regression, and Gradient Boosted Tree. In: 2017 Systems and Information Engineering Design Symposium (SIEDS) (2017)

4. Yulong Liu ; Jianlei Du ; Feng Wang : Non-negative matrix factorization with sparseness constraints for credit risk assessment. In: Proceedings of 2013 IEEE International Conference on Grey systems and Intelligent Services (GSIS) (2013)

5. Anne Kraus: Recent Methods from Statistics and Machine Learning for Credit Scoring

Be First to Comment

Leave a Reply

Your email address will not be published.

Share via
Copy link

Spelling error report

The following text will be sent to our editors: