Description Possible Marks / Wtg(%) Word Count Due Date
Assignment 3 Written Practical Report 100 marks 40% Weighting 4000Co 08/10/18
Modules 4–11 are particularly relevant for this assignment. Assignment 3 relates to the specific course learning objectives 1, 2, 3 and 4:
1. apply knowledge of people, markets, finances, technology and management in a global context of business intelligence practice (data warehousing and big data architecture, data mining process, data visualisation and performance management) and resulting organisational change and understand how these apply to the implementation of business intelligence in organisation systems and business processes
2. identify and solve complex organisational problems creatively and practically through the use of business intelligence and critically reflect on how evidence based decision making and sustainable business performance management can effectively address real-world problems
3. comprehend and address complex ethical dilemmas that arise from evidence based decision making and business performance management
4. communicate effectively in a clear and concise manner in written report style for senior management with the correct and appropriate acknowledgment of the main ideas presented and discussed.
Note you must use RapidMiner Studio for Task 2 and Tableau Desktop for Task 3 in this Assignment 3. Failure to do so may result in Task 2 and/or 3 not being marked and zero marks awarded. Your Assignment 3 submission is automatically submitted to and checked in Turnitin for academic integrity when you submit your Assignment 3 via the course study Assignment 3 submission link. Note carefully University policy on Academic Misconduct such as plagiarism, collusion and cheating. If any of these occur they will be found and dealt with by the USQ Academic Integrity Procedures. If proven, Academic Misconduct may result in failure of an individual assessment, the entire course or exclusion from a University program or programs.
Assignment 3 consists of three main tasks and a number of sub tasks Task 1 (Worth 30 marks)
Task 1 Critically review and discuss My Health Record system
(https://www.myhealthrecord.gov.au/about/privacy-policy) and recent changes to the My Health Record Act which will be brought into line with the existing Australian Digital Health Agency policy.
Your review and discussion of My Health Record system and its privacy provisions for patients should be guided by the following:
(1) Australian Privacy Principles (APPs) in the Privacy Act
(2) Requirements of the (2) My Health Records Act
(3) Healthcare Identifiers Act (https://www.legislation.gov.au/Details/C2017C00239
(about 1500 words).
Task 2 (Worth 30 Marks)
The goal of Task 2 is to predict the likelihood of a customer becoming a loan delinquency and forfeiting on a loan for ACME Bank (see Table 1 Data Dictionary for loan-delinq.csv data set below). It is important you understand this data set in order to complete Task 2 and four sub tasks.
Task 2.1 Conduct an exploratory data analysis of the training data set loan-delinq.csv using RapidMiner Studio data mining tool.
Provide the following for Task 2.1:
(i) A screen capture of your final EDA process and briefly describe your final EDA process
(ii) Summarise the key results of your exploratory data analysis in a table named Table
2.1 Results of Exploratory Data Analysis for loan-delinq.csv
(iii) Discuss the key results of your exploratory data analysis and provide a rationale for selecting your top 5 variables for predicting loan delinquency as the outcome based on the results of your exploratory data analysis and a review of the relevant literature on key factors contributing to a loan delinquency
Note: Table 2.1 should include the key characteristics of each variable in the loandelinq-train.csv data set such as maximum, minimum values, average, standard deviation, most frequent values (mode), missing values and invalid values etc
(About 500 words) .
Hint: The Statistics Tab and the Chart Tab in RapidMiner provide a lot of descriptive statistical information and the ability to create useful charts like Barcharts, Scatterplots etc for the EDA analysis. You might also like to look at running some correlations or chi sq tests whichever is appropriate for the loan-delinq.csv data set to indicate which variables are the top 5 key variables and contribute most to predicting a loan delinquency as an outcome.
Task 2.2 Build a Decision Tree model for predicting loan delinquency based on the data set loan-delinq.csv using RapidMiner and an appropriate set of data mining operators and a reduced set of variables from loan-delinq.csv determined by your exploratory data analysis in Task 2.1. Provide the following for Task 2.2:
(i) (1) Final Decision Tree Model process, (2) Final Decision Tree diagram, and (3) Decision tree rules.
(ii) Briefly explain your final Decision Tree Model Process, and discuss the results of the Final Decision Tree Model drawing on the key outputs (Decision Tree Diagram, Decision Tree Rules) for predicting loan delinquency. This discussion should be based on the contribution of each of the top five variables to the Final Decision Tree Model and relevant supporting literature on the interpretation of decision trees
(About 250 words).
Table 1 Data dictionary: loan-delinq.csv data set
Variable Name Description Type
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N
RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by sum of credit limits
age Age of borrower in years integer
NumberOfTime30-59DaysPastDueNotWorse Number of times borrower 30-59 days past due but no worse in last 2 years. integer
DebtRatio Monthly debt payments, alimony, living costs divided by monthly gross income percentage
MonthlyIncome Monthly income real
NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer
NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer
NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in last 2 years.
NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer
Task 2.3 Build a Logistic Regression model for predicting loan delinquency based on the loan-delinq.csv data set using RapidMiner and an appropriate set of data mining operators and a reduced set of variables determined by your exploratory data analysis in Task 2.1.
Provide the following for Task 2.3:
(i) (1) Final Logistic Regression Model process and (2) Coefficients, and (3) Odds Ratios. Hint you can RapidMiner Studio Logistic Regression operator or you can to install the Weka Extension in RapidMiner Studio and use Logistic Regression Operator for this Task 2.3.
(ii) Briefly explain your final Logistic Regression Model Process and discuss the results of the Final Logistic Regression Model drawing on the key outputs (Coefficients, Odds Ratios) for predicting loan delinquency. This discussion should be based on the contribution of each of the top five variables to the Final Logistic Regression Model and relevant supporting literature on the interpretation of logistic regression models
(About 250 words). Task 2.4 Conduct a comparative performance evaluation of your Final Decision Tree Model with your Final Logistic Regression Model for predicting loan delinquency. Note you will need to use the Cross Validation Operator; Apply Model Operator and Performance (Binominal Classification) Operator in your final data mining process models (Decision Tree, Logistic Regression) to generate the required model performance metrics (Accuracy, Miscalculation Rate, True Positive Rate, False Positive Rate, Area under Roc Chart (AUC), Precision, Recall, Lift, Sensitivity, F Measure) required for Task 2.4.
Provide the following for Task 2.4:
(i) A screen snapshot of the Confusion Matrix and AUC for each Final Model
(Decision Tree, Logistic Regression)
(ii) A table named Table 2.2 Results of Model Performance Evaluation (Decision Tree, Logistic Regression) that compares the key results of the performance evaluation for the Final Decision Tree Model and Final Logistic Regression Model in terms of Model Accuracy, Miscalculation Rate, True Positive Rate, False Positive Rate, Precision, Recall, Lift, Sensitivity, F Measure.
(iii) Discuss and compare the key results of your performance evaluation of two final models (Decision Tree, Logistic Regression) presented in parts i and ii of the Task
2.4, indicate which model is better and explain why
(About 500 words). The important outputs from data mining analyses conducted using RapidMiner for Task 2 should be included in your Assignment 3 report to provide support for conclusions reached regarding each analysis conducted for Task 2.1, Task 2.2, Task 2.3 and Task 2.4. Note export the important outputs from RapidMiner as jpg image files and include these screenshots in the relevant Task 2 sections and/or appendices of your Assignment 3 Report.
Note you will find the Sharda et al. 2018 and North Text books useful references for the data mining process activities conducted in Task 2 in relation to the exploratory data analysis, decision tree analysis, logistic regression analysis and evaluation of the comparative performance of the Final Decision Tree model and the Final Logistic Regression model.
Task 3 (Worth 30 marks)
Australian Weather dataset (see Data Dictionary Table 3.1) contains over 145,000 daily observations from January 2008 through to June 2017 from 49 Australian weather station locations for rainfall and evaporation recorded. Note for some weather station locations such as Uluru, the data set is incomplete. The daily observations are available from http://www.bom.gov.au/climate/data Bureau of Meteorology. Variable definitions adapted from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml.
Table 3.1 Data dictionary for Australian Weather Data set variables
Variable Name Data Type Description
Date Date Date of weather observation
Location Text Common name of location of weather station.
Rainfall Real Amount of rainfall recorded for day in mm.
Evaporation Real So-called Class A pan evaporation (mm) in 24 hours to 9am.
Task 3 requires you build a Tableau Dashboard Australian Weather by Location (AWL) which includes four different views of the weatherAus.csv data set as specified in sub Tasks 3.1, 3.2, 3.3 and 3.4. An additional data set weatherAus-locations.csv is provided which will need to be joined with weatherAus.csv on the common variable/field location in order to provide location specific data views in the AWL dashboard.
See first record of weatherAUS-locations.csv data set
stnID Location stnNum latitude longitude postcode state
2002 Albury 72160 -36.069 146.9509 2640 nsw
It’s a simple operation in Tableau to join two different files on a common variable/field name – for Assignment 3 it is locations variable/field
Task 3.1 Create a Tableau View of rainfall by day for each location and a specific state and related locations. Provide a screen capture of and describe the Tableau view you have created and comment on the rainfall over one month across state locations and does this differ much for the different states (About 125 words).
Task 3.2 Create a Tableau View of total rainfall by year for each location and a specific state. Provide a screen capture of and describe the Tableau view you have created and comment on variation of total rain across locations for a specific state (about 125 words) Task 3.3 Create a Tableau View that compares locations by total evaporation for a specific state over months. Provide a screen capture of and comment on the levels of evaporation for different locations and states (about 125 words).
Task 3.4 Create a Tableau GeoMap View of all Australian weather stations that displays the provides latitude and longitude and total rainfall for a selected year. Provide a screen capture of and describe the Tableau Geomap view you have created and comment on one selected location for state. (About 125 words).
Note: you need copy the four Text Table / Graph views and the dashboard you have created in Tableau using the Worksheet Menu Copy or Export Image option and include in the Task 3 section where relevant or in Appendix 3 of Assignment 3 report.
Task 3.5 Provide screen snapshot of your AWL Dashboard and an accompanying rationale (drawing on the relevant literature for good dashboard design) for the graphic design and functionality that is provided by your AWL Dashboard for the four specified Tableau views for sub Tasks 3.1, 3.2, 3.3 and 3.4 (About 500 words).
Note Stephen Few is considered to be the Guru for good Dashboard Design and has wrote a number of books on this topic. Worth having a look at his website https://www.perceptualedge.com/about.php and in particular his examples of poorly designed dashboard views and his suggestions for better dashboard views.
Report presentation writing style and referencing (worth 10 marks) Presentation: Cover page, table of contents, page numbers, headings, sub headings, tables and diagrams, use of formatting, spacing, paragraphs,
Writing style: Use of English (Correct use of language and grammar. Also, is there evidence of spelling-checking and proofreading?)
Quality of research evident by appropriate referencing: Appropriate level of referencing in text where required for a sub task, reference list provided, used Harvard Referencing Style correctly
Assignment 3 Report should be structured as follows:
Assignment 3 Cover page
Table of Contents
Task 1 Main Heading
Task 1 Sub Tasks – Sub headings for Tasks 1.1 and 1.2
Task 2 Sub Tasks – Sub headings for Task 2.1, 2.2, 2.3 and 2.4
Task 3 Sub Tasks – Sub headings for Task 3.1, 3.2, 3.3, 3.4 and 3.5
List of References
List of Appendices
You must submit two files for Assignment 3:
1. Assignment 3 Report for Tasks 1, 2 and 3 in Word document format with extension .docx
2. Tableau packaged workbook with the extension .twbx which must contain required four Text Table / Graph views and a dashboard which consolidates these four Tableau views for Task 3
You must use the following file naming convention:
You must use Harvard referencing style – Harvard referencing resources
Install a bibliography referencing tool – Endnote which integrates with your word processor. http://www.usq.edu.au/library/referencing/endnote-bibliographic-software or alternatively use an online citation tool such as Zetoro or You Cite This For Me USQ Library - how to reference correctly using Harvard referencing system https://www.usq.edu.au/library/referencing/harvard-agps-referencing-guide