Business Analytics and Big Data (ACC73002)

Assessment 2 – Calculations and short written responses (30%)

INSTRUCTIONS

Perform the required calculations, present your findings, and prepare short written responses to the following questions.

Question 1. Predicting Boston Housing Prices using multiple linear regression analysis (9 marks)

The file BostonHousing.xls contains information collected by the U.S. Bureau of the Census concerning housing in the area of Boston, Massachusetts. The dataset includes information on 506 census housing tracts in the Boston area. The goal is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms. The dataset contains 12 predictors, and the response is the median house price (MEDV).

a. Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for? (3 marks)

b. Run the R code and fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. Write the equation for predicting the median house price from the predictors in the model. Interpret the estimated coefficients (CRIM, CHAS, and RM). (4 marks)

c. Using the estimated regression model, what median house price is predicted for a tract in the Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6? (2 marks)

Question 2 Calculating Distance with Categorical Predictors (9 marks)

Calculating Distance with Categorical Predictors. This exercise with a tiny dataset illustrates the calculation of Euclidean distance, and the creation of binary dummies. The online education company Statistics.com segments its customers and prospects into three main categories: IT professionals (IT), statisticians (Stat) and other (Other). It also tracks, for each customer, the number of years since first contact (years). Consider the following customers; Information about whether they have taken a course or not (the outcome to be predicted) is included:

Customer 1: Stat, 1 year, did not take course

Customer 2: Other, 1.1 year, took course

a. Consider now the following new prospect: Prospect 1: IT, 1 year

Using the above information on the two customers and one prospect, create one dataset for all three with the categorical predictor variable transformed into 2 binaries, and a similar dataset with the categorical predictor variable transformed into 3 binaries. (3 marks)

b. For each derived dataset, calculate the Euclidean distance between the prospect and each of the other two customers. (Note: while it is typical to normalize data for k NN, this is not an ironclad rule and you may proceed here without normalization. (3 marks)

c. Using k NN with k = 1, classify the prospect as taking or not taking a course using each of the two derived datasets. Does it make a difference whether you use 2 or 3 dummies? (3 marks)

Question 3. Financial Condition of Banks - Logistic regression analysis (12 marks)

The file Banks.xls includes data on a sample of 20 banks. The “Financial Condition” column records the judgment of an expert on the financial condition of each bank. This dependent variable takes one of two possible values— weak or strong —according to the financial condition of the bank. The predictors are two ratios used in the financial analysis of banks: TotLns&Lses/Assets is the ratio of total loans and leases to total assets and TotExp/Assets is the ratio of total expenses to total assets. The target is to use the two ratios for classifying the financial condition of a new bank.

Run a logistic regression model (on the entire dataset) that models the status of a bank as a function of the two financial measures provided. Specify the success class as weak (this is similar to creating a dummy that is 1 for financially weak banks and 0 otherwise), and use the default cut-off value of 0.5.

a. Write the estimated equation that associates the financial condition of a bank with its two predictors in three formats:

i. The logit as a function of the predictors (2 marks)

ii. The odds as a function of the predictors (2 marks)

iii. The probability as a function of the predictors (2 marks)

b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and total expenses/assets ratio = 0.11. From your logistic regression model, estimate the following four quantities for this bank (use Excel to do all the intermediate calculations; show your final answers to four decimal places): the logit, the odds, the probability of being financially weak, and the classification of the bank (use cut-off=0.5). (3 marks)

c. The cut-off value of 0.5 is used in conjunction with the probability of being financially weak. Compute the threshold that should be used if we want to make a classification based on the odds of being financially weak, and the threshold for the corresponding logit. (3 marks)

Assessment 2 – Calculations and short written responses (30%)

INSTRUCTIONS

Perform the required calculations, present your findings, and prepare short written responses to the following questions.

Question 1. Predicting Boston Housing Prices using multiple linear regression analysis (9 marks)

The file BostonHousing.xls contains information collected by the U.S. Bureau of the Census concerning housing in the area of Boston, Massachusetts. The dataset includes information on 506 census housing tracts in the Boston area. The goal is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms. The dataset contains 12 predictors, and the response is the median house price (MEDV).

a. Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for? (3 marks)

b. Run the R code and fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. Write the equation for predicting the median house price from the predictors in the model. Interpret the estimated coefficients (CRIM, CHAS, and RM). (4 marks)

c. Using the estimated regression model, what median house price is predicted for a tract in the Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6? (2 marks)

Question 2 Calculating Distance with Categorical Predictors (9 marks)

Calculating Distance with Categorical Predictors. This exercise with a tiny dataset illustrates the calculation of Euclidean distance, and the creation of binary dummies. The online education company Statistics.com segments its customers and prospects into three main categories: IT professionals (IT), statisticians (Stat) and other (Other). It also tracks, for each customer, the number of years since first contact (years). Consider the following customers; Information about whether they have taken a course or not (the outcome to be predicted) is included:

Customer 1: Stat, 1 year, did not take course

Customer 2: Other, 1.1 year, took course

a. Consider now the following new prospect: Prospect 1: IT, 1 year

Using the above information on the two customers and one prospect, create one dataset for all three with the categorical predictor variable transformed into 2 binaries, and a similar dataset with the categorical predictor variable transformed into 3 binaries. (3 marks)

b. For each derived dataset, calculate the Euclidean distance between the prospect and each of the other two customers. (Note: while it is typical to normalize data for k NN, this is not an ironclad rule and you may proceed here without normalization. (3 marks)

c. Using k NN with k = 1, classify the prospect as taking or not taking a course using each of the two derived datasets. Does it make a difference whether you use 2 or 3 dummies? (3 marks)

Question 3. Financial Condition of Banks - Logistic regression analysis (12 marks)

The file Banks.xls includes data on a sample of 20 banks. The “Financial Condition” column records the judgment of an expert on the financial condition of each bank. This dependent variable takes one of two possible values— weak or strong —according to the financial condition of the bank. The predictors are two ratios used in the financial analysis of banks: TotLns&Lses/Assets is the ratio of total loans and leases to total assets and TotExp/Assets is the ratio of total expenses to total assets. The target is to use the two ratios for classifying the financial condition of a new bank.

Run a logistic regression model (on the entire dataset) that models the status of a bank as a function of the two financial measures provided. Specify the success class as weak (this is similar to creating a dummy that is 1 for financially weak banks and 0 otherwise), and use the default cut-off value of 0.5.

a. Write the estimated equation that associates the financial condition of a bank with its two predictors in three formats:

i. The logit as a function of the predictors (2 marks)

ii. The odds as a function of the predictors (2 marks)

iii. The probability as a function of the predictors (2 marks)

b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and total expenses/assets ratio = 0.11. From your logistic regression model, estimate the following four quantities for this bank (use Excel to do all the intermediate calculations; show your final answers to four decimal places): the logit, the odds, the probability of being financially weak, and the classification of the bank (use cut-off=0.5). (3 marks)

c. The cut-off value of 0.5 is used in conjunction with the probability of being financially weak. Compute the threshold that should be used if we want to make a classification based on the odds of being financially weak, and the threshold for the corresponding logit. (3 marks)

This above price is for already used answers. Please do not submit them directly as it may lead to plagiarism. Once paid, the deal will be non-refundable and there is no after-sale support for the quality or modification of the contents. Either use them for learning purpose or re-write them in your own language. If you are looking for new unused assignment, please use live chat to discuss and get best possible quote.

SUBJECT CODE:MAC007ADUE DATE: Sunday Week 11INSTRUCTIONS FOR STUDENTS:• Clearly write your name, your group member's names and id when you submit your report.• Only one submission per group is needed.•...Requirements in the descriptionThis is a “take home” exam. It is open book, open notes, open Internet, but you may not discuss any part of the exam with any person other than the instructor. It is, if...As discussed pls find attached proposal format and some samples and relevant articles1) Do use MBA Dissertation proposal template ( approximately 2000 words to propose the research )The area I want is...ASSESSMENT BRIEFSubject Code and Title STAT6001: Public Health InformaticsAssessment Assessment 1: Report – Summarising Public Health Informatics (including evaluation)Individual/Group IndividualLength...As per instructionsIn January of 2006, two Burleson High School girls went to school with purses that had the confederate flag on them. They were told that the purses with the confederate flag violated the school dress code...UNIVERSALBUSINESS SCHOOLSYDNEYMCR003 Management Attributes and Skills Individual AssessmentGrade 25%Due Week 8 - Sunday 12th July at 11.55pmLength - 1500 words ( +/- 10%)You must analyse and research on...**Show All Questions**