Business Analytics and Big Data (ACC73002)
Assessment 2 – Calculations and short written responses (30%)
Perform the required calculations, present your findings, and prepare short written responses to the following questions.
Question 1. Predicting Boston Housing Prices using multiple linear regression analysis (9 marks)
The file BostonHousing.xls contains information collected by the U.S. Bureau of the Census concerning housing in the area of Boston, Massachusetts. The dataset includes information on 506 census housing tracts in the Boston area. The goal is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms. The dataset contains 12 predictors, and the response is the median house price (MEDV).
a. Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for? (3 marks)
b. Run the R code and fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. Write the equation for predicting the median house price from the predictors in the model. Interpret the estimated coefficients (CRIM, CHAS, and RM). (4 marks)
c. Using the estimated regression model, what median house price is predicted for a tract in the Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6? (2 marks)
Question 2 Calculating Distance with Categorical Predictors (9 marks)
Calculating Distance with Categorical Predictors. This exercise with a tiny dataset illustrates the calculation of Euclidean distance, and the creation of binary dummies. The online education company Statistics.com segments its customers and prospects into three main categories: IT professionals (IT), statisticians (Stat) and other (Other). It also tracks, for each customer, the number of years since first contact (years). Consider the following customers; Information about whether they have taken a course or not (the outcome to be predicted) is included:
Customer 1: Stat, 1 year, did not take course
Customer 2: Other, 1.1 year, took course
a. Consider now the following new prospect: Prospect 1: IT, 1 year
Using the above information on the two customers and one prospect, create one dataset for all three with the categorical predictor variable transformed into 2 binaries, and a similar dataset with the categorical predictor variable transformed into 3 binaries. (3 marks)
b. For each derived dataset, calculate the Euclidean distance between the prospect and each of the other two customers. (Note: while it is typical to normalize data for k NN, this is not an ironclad rule and you may proceed here without normalization. (3 marks)
c. Using k NN with k = 1, classify the prospect as taking or not taking a course using each of the two derived datasets. Does it make a difference whether you use 2 or 3 dummies? (3 marks)
Question 3. Financial Condition of Banks - Logistic regression analysis (12 marks)
The file Banks.xls includes data on a sample of 20 banks. The “Financial Condition” column records the judgment of an expert on the financial condition of each bank. This dependent variable takes one of two possible values— weak or strong —according to the financial condition of the bank. The predictors are two ratios used in the financial analysis of banks: TotLns&Lses/Assets is the ratio of total loans and leases to total assets and TotExp/Assets is the ratio of total expenses to total assets. The target is to use the two ratios for classifying the financial condition of a new bank.
Run a logistic regression model (on the entire dataset) that models the status of a bank as a function of the two financial measures provided. Specify the success class as weak (this is similar to creating a dummy that is 1 for financially weak banks and 0 otherwise), and use the default cut-off value of 0.5.
a. Write the estimated equation that associates the financial condition of a bank with its two predictors in three formats:
i. The logit as a function of the predictors (2 marks)
ii. The odds as a function of the predictors (2 marks)
iii. The probability as a function of the predictors (2 marks)
b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and total expenses/assets ratio = 0.11. From your logistic regression model, estimate the following four quantities for this bank (use Excel to do all the intermediate calculations; show your final answers to four decimal places): the logit, the odds, the probability of being financially weak, and the classification of the bank (use cut-off=0.5). (3 marks)
c. The cut-off value of 0.5 is used in conjunction with the probability of being financially weak. Compute the threshold that should be used if we want to make a classification based on the odds of being financially weak, and the threshold for the corresponding logit. (3 marks)