MAT10251 STATISTICAL ANALYSIS
PROJECT – PART B
Due: Week 11 Sunday 19 May 2019
Objectives: 1 to 5
Topics: 5 to 9
Purpose: To apply your knowledge of statistical inference and regression to answer questions about used cars for sale by analysing the data and communicating the results.
Part B Submission
You should submit a single word document consisting of:
• Part B coversheets
• Written answer as components of a report. This should follow the format given on pages 4 and 5 of Part B coversheets
• Appendices for Part B which contain full statistical working for the required statistical tasks. This should follow the format given on pages 5 and 6 of Part B coversheets
Part B Preparation
The graphs, plots and interpretations in Part A may be required in the statistical and written answers in Part B. Therefore, check these and make any required corrections.
While the submission date for Part B is Sunday 19 May 2019, you should be working on Part B during Weeks 6 to 11.
It is recommended that you follow the following timetable:
• Question 1, covering Topic 5, should be completed in Week 6
• Question 2, covering Topic 6, should be completed in Week 8
• Question 3 covering Topic 7 should be attempted in Week 9
• Question 4 covering Topic 8 should be attempted in Week 10
• Question 5 covering Topic 9 should be attempted in Week 11
Task 1 Part B - Appendices Statistical Inference and Regression and Correlation Tasks (38 marks)
The following statistical tasks should appear as appendices to your written answers. These should include all necessary steps and appropriate Excel output.
These appendices should come after your written answer within your single Word document for Part B.
Choose a level of significance for any hypothesis tests and a level of confidence for any confidence intervals. Enter these values on page 2 of the Part B coversheets along with the sample number from Part A.
For used cars of the make and model for sale in the state specified by your sample answer the following questions using appropriate statistical inference and regression techniques.
Question 1 – Topic 5 (5.5 marks)
Since many buyers wish to purchase a two or three year old used car Oz-Price-Watch has asked you to provide information on the average price of 2016 and 2017 cars of the make and model for sale in the state specified by your sample.
To enable you to answer this use Price (7th column of your data) for 2016 and 2017 cars only, your output from Part A and an appropriate statistical inference technique to:
Estimate the population mean price of two and three year old used cars of the make and model for sale in the state specified by your sample.
Note: The required data for 2016 and 2017 cars is in the first rows of your sample.
Question 2 – Topic 6 (7.5 marks)
Many buyers believe that white cars are safer since they are more visible. Therefore, they wish to purchase a white car. Oz-Price-Watch has asked you to explore if restricting a purchase to white cars will limit a buyer’s choice. Past research by Oz-Price-Watch has shown that if a search is restricted to a feature, for example colour or transmission, which at most 30% of cars for sale have then buyer choice is limited.
To provide a justified answer to the question use White (6th column of data, where Yes = car for sale is white and No = car for sale is not white) for ALL 115 cars in your sample and an appropriate statistical inference technique to answer the following question
Are more than 30% of used cars of the make and model for sale in the state specified by your sample white?
Hint: Sort data on White to enable you to easily count the number of white cars in your sample.
Question 3 Topic 7 (8 marks)
Oz-Price-Watch wishes to know if there is a difference in price between cars for sale privately and those for sale by a used car dealer.
To provide a justified answer to this question use Price (7th column of data) and Seller (5th column of data) for all 115 cars in your sample, your output from Part A and an appropriate statistical inference technique to answer the following question
Is there a difference in the average price of cars, of the specified make and model for sale in the specified state, for sale privately and by a used car dealer?
Hint: Sort data on Seller to easily obtain two samples – Prices for private sellers and for used car dealers.
Questions 4 and 5 Simple and Multiple Linear Regression (17 marks)
Oz-Price-Watch asks you how the value of a used car, of the specified make and model, depreciates.
To answer this you develop a simple linear regression model to predict price from age or odometer reading and a multiple linear regression model to predict price from age, odometer reading and transmission type. Then, to provide a justified answer to Oz-Price-Watch, choose and interpret the linear model that best fits your data.
Question 4 Simple Linear Regression Model Topic 8
From your results in Part A choose either Age or Odometer as an independent variable, to predict Price.
To explore the relationship between the age or odometer reading of a used car and its price, use your output from Part A and Age or Odometer (2nd or 3rd column of data) as an independent variable with Price (7th column of data) as the dependent variable, for all 115 cars in your sample, to develop and then explore a simple linear relationship between the two variables by:
• Calculating the least squares regression line, correlation coefficient and coefficient of determination.
• Interpreting the gradient and vertical intercept of the simple linear regression equation.
• Interpreting the correlation coefficient and coefficient of determination. Are these values consistent with your scatter plot?
Note: You can choose either Age or Odometer as the independent variable in this model.
Question 5 Multiple Linear Regression Model Topic 9
To explore what other factors may have an influence on the value of a used car use your output from Part A and Age, Odometer and Transmission (2nd, 3rd and 4th columns of data) as three independent variables with Price (7th column of data) as the dependent variable for all 115 cars in your sample, to develop and then explore the relationship between these four variables by:
• Calculating the multiple regression equation, multiple correlation coefficient, and coefficient of multiple determination.
• Interpreting the values of the multiple regression coefficients.
• Interpreting the values of the multiple correlation coefficient and coefficient of multiple determination. Compare these values with the corresponding values for the simple linear regression model.
Then determine the best model to predict the price of a used car by:
• Using appropriate tests to determine which independent variables make a significant contribution to the regression model.
• Give or calculate the simple or multiple regression equation which best fits the data.
• You may need to transform or manipulate the given data, before using Excel for the corresponding statistical calculations.
• Use Excel for all statistical calculations. You do not need to repeat any Excel calculations by hand. However, make sure that you define your random variables and include any steps not given by Excel. For example, in a hypothesis test include the null and alternative hypotheses, along with the decision to reject or not reject the null hypothesis.
• Mention any assumptions you need to make, where appropriate justify these from Part A output.
• In Question 4 fit a linear model even if from your scatter plot you decide that a non-linear relationship better fits the data or that no apparent relationship exists. However, mention this in your written answer and/or corresponding appendix.
• Comment on why a test or confidence interval has been chosen. Where appropriate include and refer to Part A output.
• Make sure you interpret confidence intervals and write conclusions to hypothesis tests.
Task 2 - Written Answer – Components of a report (12 marks)
For Questions 1, 2, 3 and Questions 4 and 5 combined present the results of your calculations, with your interpretation and conclusions as components of a longer report on used car prices.
Use the instructions given on pages 4 and 5 of the Part B coversheets.
This should be 500 to 1100 words and three to seven pages.
It should be submitted as a Word file with Excel output included.
Make sure you:
• Introduce each question and put it in context
• Answer each question in non-statistical language.
• Present the result of your calculations and tests without unnecessary statistical jargon
• Include a conclusion which answers the given question.
In particular, for Questions 4 and 5
• Mention or explain your choice of independent and dependent variables
• Include and justify the best model.
• Discuss and interpret the values of the regression and correlation coefficients of the best model.
Marking Criteria – Part B
Read these marking criteria carefully and consider them when preparing Part B.
See the marking and feedback sheet, page 3 Part B coversheets, for allocation of marks.
• For statistical inference calculations (Questions 1, 2, 3 and 5) marks will be given for:
• Choice of appropriate statistical technique/s.
• Random variable/s defined.
• Correct hypotheses for tests.
• Correct Excel output.
• Correct interpretation of results.
• For regression and correlation coefficients (Questions 4 and 5) use either:
• The Regression command in Data Analysis and copy resultant tables.
• Or the simple/multiple regression command in PhStat and copy the resultant tables.
• Or the Simple Linear and Multiple Regression workbooks and copy the resultant tables.
• For regression and correlation coefficients (Questions 4 and 5) marks will be deducted if Excel is not used and also for incorrect equations or coefficients, so check:
• Your independent and dependent variables.
• Your sample size.
Written Answer – Components of a Report
• 500 to 1100 words and four to nine pages - marks will be deducted if this is greatly exceeded.
• To obtain full marks must:
• Be well structured and analysed
• Clearly communicate the results of the Excel output in language appropriate for your audience
• Include an introduction to each question and a conclusion
• Include appropriate Excel output
• Answer the questions in non-statistical language.
• Marks will be deducted if:
• There is little or no comment on, or interpretation of, the Excel output
• Unnecessary statistical jargon and equations appear
• It is confusing or not readable
• For each major spelling and/or grammatical error half a mark will be deducted, up to a maximum of two marks
• Also up to two marks may be deducted for poor structure and presentation.
• For Questions 1 to 3, and Questions 4 and 5 combined in (), the following rubric will be used
Poor 0 (0) Question not introduced and/or results not presented.
Confused response. Incorrect and/or inconsistent comments and conclusions.
Unnecessary statistical jargon, especially symbols, equations and definitions (copied from the textbook)
Acceptable 1 (2)
Question introduced and results presented.
Minimal interpretation and/or conclusions on how to use the information and/or only minimally relates information obtained to car colour or price.
Only minor errors and inconsistencies in comments and conclusions.
More than acceptable 2 (4) Results presented and questions introduced and answered, clearly and concisely.
Includes interpretation and/or conclusions on how to use the information and/or relates information obtained to car colour or price.
No errors and inconsistencies in comments and conclusions.
Questions answered, and answers justified