Business Analytics and Big Data (ACC73002)

Assessment 2 – Calculations and short written responses (30%)

INSTRUCTIONS

Perform the required calculations, present your findings, and prepare short written responses to the following questions.

Question 1. Predicting Boston Housing Prices using multiple linear regression analysis (9 marks)

The file BostonHousing.xls contains information collected by the U.S. Bureau of the Census concerning housing in the area of Boston, Massachusetts. The dataset includes information on 506 census housing tracts in the Boston area. The goal is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms. The dataset contains 12 predictors, and the response is the median house price (MEDV).

a. Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for? (3 marks)

b. Run the R code and fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. Write the equation for predicting the median house price from the predictors in the model. Interpret the estimated coefficients (CRIM, CHAS, and RM). (4 marks)

c. Using the estimated regression model, what median house price is predicted for a tract in the Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6? (2 marks)

Question 2 Calculating Distance with Categorical Predictors (9 marks)

Calculating Distance with Categorical Predictors. This exercise with a tiny dataset illustrates the calculation of Euclidean distance, and the creation of binary dummies. The online education company Statistics.com segments its customers and prospects into three main categories: IT professionals (IT), statisticians (Stat) and other (Other). It also tracks, for each customer, the number of years since first contact (years). Consider the following customers; Information about whether they have taken a course or not (the outcome to be predicted) is included:

Customer 1: Stat, 1 year, did not take course

Customer 2: Other, 1.1 year, took course

a. Consider now the following new prospect: Prospect 1: IT, 1 year

Using the above information on the two customers and one prospect, create one dataset for all three with the categorical predictor variable transformed into 2 binaries, and a similar dataset with the categorical predictor variable transformed into 3 binaries. (3 marks)

b. For each derived dataset, calculate the Euclidean distance between the prospect and each of the other two customers. (Note: while it is typical to normalize data for k NN, this is not an ironclad rule and you may proceed here without normalization. (3 marks)

c. Using k NN with k = 1, classify the prospect as taking or not taking a course using each of the two derived datasets. Does it make a difference whether you use 2 or 3 dummies? (3 marks)

Question 3. Financial Condition of Banks - Logistic regression analysis (12 marks)

The file Banks.xls includes data on a sample of 20 banks. The “Financial Condition” column records the judgment of an expert on the financial condition of each bank. This dependent variable takes one of two possible values— weak or strong —according to the financial condition of the bank. The predictors are two ratios used in the financial analysis of banks: TotLns&Lses/Assets is the ratio of total loans and leases to total assets and TotExp/Assets is the ratio of total expenses to total assets. The target is to use the two ratios for classifying the financial condition of a new bank.

Run a logistic regression model (on the entire dataset) that models the status of a bank as a function of the two financial measures provided. Specify the success class as weak (this is similar to creating a dummy that is 1 for financially weak banks and 0 otherwise), and use the default cut-off value of 0.5.

a. Write the estimated equation that associates the financial condition of a bank with its two predictors in three formats:

i. The logit as a function of the predictors (2 marks)

ii. The odds as a function of the predictors (2 marks)

iii. The probability as a function of the predictors (2 marks)

b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and total expenses/assets ratio = 0.11. From your logistic regression model, estimate the following four quantities for this bank (use Excel to do all the intermediate calculations; show your final answers to four decimal places): the logit, the odds, the probability of being financially weak, and the classification of the bank (use cut-off=0.5). (3 marks)

c. The cut-off value of 0.5 is used in conjunction with the probability of being financially weak. Compute the threshold that should be used if we want to make a classification based on the odds of being financially weak, and the threshold for the corresponding logit. (3 marks)

Assessment 2 – Calculations and short written responses (30%)

INSTRUCTIONS

Perform the required calculations, present your findings, and prepare short written responses to the following questions.

Question 1. Predicting Boston Housing Prices using multiple linear regression analysis (9 marks)

The file BostonHousing.xls contains information collected by the U.S. Bureau of the Census concerning housing in the area of Boston, Massachusetts. The dataset includes information on 506 census housing tracts in the Boston area. The goal is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms. The dataset contains 12 predictors, and the response is the median house price (MEDV).

a. Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for? (3 marks)

b. Run the R code and fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. Write the equation for predicting the median house price from the predictors in the model. Interpret the estimated coefficients (CRIM, CHAS, and RM). (4 marks)

c. Using the estimated regression model, what median house price is predicted for a tract in the Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6? (2 marks)

Question 2 Calculating Distance with Categorical Predictors (9 marks)

Calculating Distance with Categorical Predictors. This exercise with a tiny dataset illustrates the calculation of Euclidean distance, and the creation of binary dummies. The online education company Statistics.com segments its customers and prospects into three main categories: IT professionals (IT), statisticians (Stat) and other (Other). It also tracks, for each customer, the number of years since first contact (years). Consider the following customers; Information about whether they have taken a course or not (the outcome to be predicted) is included:

Customer 1: Stat, 1 year, did not take course

Customer 2: Other, 1.1 year, took course

a. Consider now the following new prospect: Prospect 1: IT, 1 year

Using the above information on the two customers and one prospect, create one dataset for all three with the categorical predictor variable transformed into 2 binaries, and a similar dataset with the categorical predictor variable transformed into 3 binaries. (3 marks)

b. For each derived dataset, calculate the Euclidean distance between the prospect and each of the other two customers. (Note: while it is typical to normalize data for k NN, this is not an ironclad rule and you may proceed here without normalization. (3 marks)

c. Using k NN with k = 1, classify the prospect as taking or not taking a course using each of the two derived datasets. Does it make a difference whether you use 2 or 3 dummies? (3 marks)

Question 3. Financial Condition of Banks - Logistic regression analysis (12 marks)

The file Banks.xls includes data on a sample of 20 banks. The “Financial Condition” column records the judgment of an expert on the financial condition of each bank. This dependent variable takes one of two possible values— weak or strong —according to the financial condition of the bank. The predictors are two ratios used in the financial analysis of banks: TotLns&Lses/Assets is the ratio of total loans and leases to total assets and TotExp/Assets is the ratio of total expenses to total assets. The target is to use the two ratios for classifying the financial condition of a new bank.

Run a logistic regression model (on the entire dataset) that models the status of a bank as a function of the two financial measures provided. Specify the success class as weak (this is similar to creating a dummy that is 1 for financially weak banks and 0 otherwise), and use the default cut-off value of 0.5.

a. Write the estimated equation that associates the financial condition of a bank with its two predictors in three formats:

i. The logit as a function of the predictors (2 marks)

ii. The odds as a function of the predictors (2 marks)

iii. The probability as a function of the predictors (2 marks)

b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and total expenses/assets ratio = 0.11. From your logistic regression model, estimate the following four quantities for this bank (use Excel to do all the intermediate calculations; show your final answers to four decimal places): the logit, the odds, the probability of being financially weak, and the classification of the bank (use cut-off=0.5). (3 marks)

c. The cut-off value of 0.5 is used in conjunction with the probability of being financially weak. Compute the threshold that should be used if we want to make a classification based on the odds of being financially weak, and the threshold for the corresponding logit. (3 marks)

This above price is for already used answers. Please do not submit them directly as it may lead to plagiarism. Once paid, the deal will be non-refundable and there is no after-sale support for the quality or modification of the contents. Either use them for learning purpose or re-write them in your own language. If you are looking for new unused assignment, please use live chat to discuss and get best possible quote.

Q1.a) Figure below contains a depiction of an embossing press in which a single-acting cylinder is used for ejecting the processed parts. With the help of circuit diagrams, describe how this cylinder can...Assessment 2 | ReportVulnerable groups diabetes profileIntent:In this assessment, we ask you to examine the demographic and current health status profiles of the 2770 postcode geography and identify which...ASSESMENT 3 SITXCCS008 Develop and manage quality customer service practicesPlease read the following:You are Frank Street. On May 4, 2021, you went to Sunshine Gardens to pick up flowers for your wife’s birthday. You called your order in the day before, and were told the flowers...Part 1: PCR Critical Analysis (10%)Individual AssessmentLength: 500 words maximumSubmission Details: Electronic submission through TURNITINFeedback Details Each student will receive a total mark that is...300811 SCIENTIFIC LITERACY – SPRING 2021Writing Task Part 2This assessment contributes 35% to your overall mark for this unit.Assessment DescriptionWriting Task Part 2 is a 1,000 word popular-level article,...300811 SCIENTIFIC LITERACYWriting Task Part 1This assessment contributes 15% to your overall mark for this unit.Assessment DescriptionWriting Task Part 1 develops the scientific topic that will be the...**Show All Questions**