SESSION 2 FORMAL EXAMINATIONS – NOVEMBER 2020
Unit Code: COMP2200/COMP6200
Unit Name: Data Science
Duration of exam: 3 hours in a 6 hour window
Total number of questions: 8
Total number of pages: 5 (incl. this cover sheet)
Total number of marks: 100
Answer ALL questions in a single word processor file and upload your answers to the provided Turnitin submission page by the due time. You can upload a Word or PDF file.
Collaboration with others in completing this exam is not allowed. The work you submit should be your own. Any evidence of copying or collusion will be referred to the Faculty Discipline Committee. Note that your submissions will be passed through Turnitin to identify copying from the Internet or from other students.
1. (10 marks) You are working as a Data Scientist in a big retail store, say Woolworths, and your task is to optimise various retail processes such as inventory management, product placement, and customised offers. Using the CRISP-DM model, can you explain what you will do in each stage of the data science project life cycle, what your input will be, and what you will deliver at each stage? (Write no more than 500 words in total)
2. The following graph shows the relationship between the US spending on science and the number of suicides (by hanging, strangulation, and suffocation). Based on this graph, answer the following questions.
(a) (5 marks) What does the correlation mean in this context? What does the R2 value mean? (Write no more than 200 words in total)
(b) (5 marks) One of your friends Mr. Citizen thinks that this correlation is because of the increasingpressure on researchers to continuously produce output. How would you evaluate this explanation? Looking at the numbers in the data displayed, can you determine whether this explanation could account for the effect shown? (Write no more than 200 words in total)
3. (a) (5 marks) For the following data scenarios, which chart should you use to visualise? Justify your answers. (Write no more than 200 words in total)
(1) Bureau of Meteorology data having average monthly rainfall in Sydney from 2016 to 2020.
(2) Hospital data having systolic pressure and weight of 2000 patients.
(3) Australian Bureau of Statistics data having yearly household expenses (grocery, transport, education, rent/mortgage, and entertainment) for Australian population
(4) Australian Bureau of Statistics providing Census data showing population density for each suburb across New South Wales.
(5) Bureau of Meteorology weather data having multiple weather conditions in Sydney with features including date, precipitation, max temperature, min temperature, wind speed, and weather (drizzle, rain, sunny, snow, and fog).
(b) (5 marks) You are working on a project that analyses the census data provided by Australian Bureau of Statistics. Table 1 shows a sample dataset. What data cleaning and normalisation techniques should you apply on this data so that you can apply unsupervised learning methods? (Write no more than 200 words in total)
Table 1: Sample Census dataset from Australian Bureau of Statistics
Census Code Suburb State Area sqkm
CED101 Berowra NSW 78644.32
CED101 wentworthville New South Wales 89232.53645
CED101 north sydney nsw 10324.45
CED101 mt. druitt 10583.12
CED105 st. Kilda Vic. 8524.96762
CED105 South melb. vic 45321.87
CED105 gelong Victoria 24534.2534
4. (a) (5 marks) I have data on different laptops from different brands with features for weight (grams), size (cm), RAM (GB), Hard Drive (GB), Processor (Intel core i5, Intel core i7, Intel core i3, AMD Ryzen, AMD Athlon, etc), and price (Australian Dollars). I want to cluster similar laptops based on their specifications. Discuss your approach to applying a clustering algorithm on this data. What transformations would be needed before you could work with this data and why? (Write no more than 200 words in total)
(b) (5 marks) You built a regression model to predict baby length based on mother’s height and mother’s age. Based on the training regression model using training data, the model coefficient’s for mother’s height and mother’s age are [0.2539,-0.0075] and intercept is 4.7623. What is your interpretation from these coefficients and intercept values? Can you figure out how change in variables effect the baby’s length? (Write no more than 200 words in total)
5. You plan to build a machine learning model to predict whether a patient in a hospital is “healthy” or“not healthy” based on the patient’s medical measurements. The dataset is highly imbalanced where “not healthy” outnumbered “healthy” individuals.
(a) (5 marks) To evaluate the performance of a trained model, you can create a confusion matrix for the comparison between the predicted results and the testing data class labels. From the confusion matrix, you calculated accuracy score. Explain why reporting accuracy score on such dataset is not indicative of model’s true performance. What measures you should take to mitigate any inflated results. What other metrics can you formulate from confusion matrix which are true indicative of model’s robust performance. (Write no more than 200 words in total)
(b) (5 marks) If the training data size is very big (e.g., 1 billion data instances) and the testing datasethas 1000 instances, which model do you prefer to use, KNN (k-Nearest Neighbors) classifier or Na¨ive Bayes classifier? Justify your answer. (Write no more than 200 words in total)
6. There is a robot in an animal shelter which needs to learn to discriminate Dogs and Cats based onthe fur and colour features. You are required to train the robot with classification models on the following dataset (Table 2) and make a prediction on a testing data instance. The feature Fur takes one of the two possible values (Coarse and Fine), and Colour also takes one of the two possible values (Brown and Black). For denotation convenience, you can use X1 and X2 to represent the two features respectively, and Y to represent the prediction target during the inference.
Table 2: Animal Data
Index Fur Colour class
#1 Coarse Brown Dog
#2 Fine Black Cat
#3 Coarse Black Cat
#4 Coarse Black Dog
#5 Fine Brown Cat
(a) (5 marks) You are required to build a KNN (k-Nearest Neighbors) classification model and predict the class label for the following data instance (#6 in Table 3). You can randomly choose k from its possible value range to consider the k-nearest neighbors. The distance between two data instances is calculated as the number of features having different values. For example, the distance between the 1st and the 2nd data instances is 2 because they differ from each other on both features ‘Fur’ and ‘Colour’. Specify the value of k you will use, and show the details of learning and prediction. Table 3: Testing Dataset
Index Fur Colour class
#6 Fine Brown
(b) (10 marks) You are required to build a Na¨ive Bayes classifier from the dataset and predictthe class label for the data instance #6, using the Laplacian correction technique if the zeroprobability issue occurs. Show the details of learning and prediction.
7. (a) (5 marks) The linear regression model can be regarded as a simple type of artificial neural network. From the perspective of artificial neural networks, what activation function corresponds to the linear regression model? Specify the mathematical form of the activation function. Is it a good idea to build multi-layer neural network models with this activation function? Justify your answer. (Write no more than 200 words in total)
(b) (10 marks) As the gradient descent method can be used to learn model parameters in neuralnetwork models, you can use it to estimate the parameters in a linear regression model. You are required to perform the initial steps of gradient descent on the following dataset (Table 4) to estimate the parameters w0 and w1 for the linear regression model y = w0 + w1x. The sum of squared errors is used for the loss function. Concretely, you need to formulate the loss function
L(w0,w1) and derive its gradient ). Then, pick a pair of values randomly to initialize w0 and w1, and evaluate the gradient with the w0 and w1 values. Show the key steps of inference and calculation.
Table 4: 2-Dimensional Data
Index X Y
#1 1 1
#2 2 3
(c) (5 marks) Based on the gradient obtained in the above step, update the estimate for w0 and w1. Assume that the learning rate ? is 0.5. Show the key steps of inference and calculation.
8. The following dataset (Table 5) describes COVID-19 testing records for 5 people. You want to builda decision tree classification model from the dataset to predict if a person suffers from COVID-19 or not according to the two symptoms Cough and Fever. Both the feature Cough and Fever take one of the two possible values yes (having a symptom) and no (not having a symptom). The target attribute COVID-19 also takes one of the two possible values yes (infected) and no (normal). For denotation convenience, you can use X1 and X2 to represent the two features respectively, and Y to represent the prediction target.
Table 5: COVID-19 Data
Index Cough Fever COVID-19
#1 no no no
#2 yes yes yes
#3 no yes yes
#4 no yes no
#5 yes no no
(a) (10 marks) You are required to build a decision tree with the Gini impurity heuristic. Show thekey steps of inference and calculation.
(b) (5 marks) Which issue might the decision tree model built above suffer from, overfitting or underfitting? Propose two different strategies to mitigate the possible issue with justification. (Write no more than 200 words in total)