Recent Question/Assignment

Kindly Answer just 4 questions out of 5 and send the necessary files via my Email. Thanks
UNIVERSITY OF GHANA
(All rights reserved)
BSc. COMPUTER SCIENCE, FIRST SEMESTER EXAMINATIONS: 2020/2021
DEPARTMENT OF COMPUTER SCIENCE
CSCD 409: DATA MINING AND WAREHOUSING (3 CREDITS)
INSTRUCTIONS:
There are five (5) Questions in this Exam with each Question worth a total of 25 Marks.
Read the Questions carefully and attempt Question 1 & Question 2, and any other Two (2) Questions.
In all you are answering four Questions out of the five provided.
You are expected to type your answers in MS Word, and save in pdf with your StudentID as the file name.
Submit your saved pdf together with all relevant files by uploading them to SAKAI LMS via the Exam thread created.
Remember Questions 1 & 2 are mandatory to answer.
The following toolkits are required for answering Questions 1 & 2.
-WEKA
-RapidMiner
-SPSS
TIME ALLOWED:
TWENTY FOUR (24) HOURS
Question 1. [25 Marks]
This Question requires the use of WEKA and SPSS toolkits. Consider the Absenteeism dataset saved as a .txt file with the name absenteeism.txt as found on SAKAI LMS with the attribute information provided at Appendix A at the end of Question 5. Alternatively, I have provided the dataset in Appendix B for your consideration. The dataset was created with records of absenteeism at work from July 2007 to July 2010 at a courier company in a given country.
You are expected to perform the following tasks:
a) Create both .arff and .sav files from the given absenteeism.txt file. Save your file with the names absenteeism.arff and absenteeism.sav
You need to make sure you consider all the necessary details needed when saving your file as .arff i.e., @relation, @attribute and @data before calling it in WEKA. Again, provide the necessary variable names and details for the .sav file before calling it in
SPSS. You need to submit both .arff and .sav files. [4 marks]
b) You are to call your .arff file in WEKA and consider a relevant feature selection attribute to select your features bearing in mind the label or target feature as shown in Appendix A. Consider using the attribute evaluator and search method functions. Report on your selected features as well as the feature selection algorithm used. That is provide the total number of features selected and their respective names. [3 marks]
c) At the preprocess tab in WEKA, select the features reported in (b) using the invert and remove buttons. Report a data visualization of the selected features together with the target feature using the visualize all button. [1 mark]
d) Based on the label feature, identify whether the given dataset can be used for a classification or clustering problem. [2 marks]
e) With regard to your response in (d), use any suitable classification or clustering technique to train and validate the dataset in WEKA. Report and explain your result with respect to significant information. [5 marks]
f) Call the dataset in SPSS and provide a descriptive statistics for all features. A single table will do for this part. Report on relevant statistics (Mean, mode, median, min, max, range, standard deviation) based on the features. [3 marks]
g) Out of the selected features, you are to discretize all the continuous features or variables. You can consider using the recode into different variables function in SPSS. You are to save and send the updated version of the .sav dataset bearing the recoded variable names. [2 marks]
h) For each of your discretized variables, provide either a bar chart or pie chat. [2 marks]
i) Using the discretized variables, make any comparison between the target variable and any of your discretized variables. You can consider using the cross tabulation functionality in SPSS. Explain your result. [3 marks]
Question 2. [25 Marks]
This Question requires the use of RapidMiner toolkit. Consider the dataset in Table 1 to be used to train a decision tree. The dataset comprises of the following attributes, namely age, income, student, credit_rating and buys_computer. The buys_computer attribute is considered as the dependent variable and the remaining attributes considered as the independent variables. Imagine you are asked to setup a decision tree for training the dataset, briefly explain how you will address the following issues:
a) How many features are in the dataset presented in the table below? [1 mark]
b) How many tuples are in the dataset? [1 mark]
c) Which feature of the dataset makes it suitable for considering a supervised learning algorithm such as decision tree? [1 mark]
d) Aside of the decision tree, list any two supervised learning algorithms that can also be used for training the dataset. [1 mark]
e) Out of the categorical variables, list two (2) dichotomous variables. [1 mark]
f) Compute the information gain for each of the independent attributes. [5 marks]
g) With reference to the information gains computed in (f), determine which attribute can be considered as the root node for the decision tree. [1 mark]
h) Complete the construction of the decision tree showing how you arrived at the tree.
[5 marks] i) Assume there were missing values in the dataset, discuss two ways of handling them. [2 marks]
j) Assume there were outliers in the dataset, discuss two ways of handling them.
[2 marks] k) Call the Purchase Computer dataset (Table 1) in RapidMiner and construct the tree using the various operators. Report on the step by step procedure you considered in constructing the tree based on your selected operators in RapidMiner. [5 marks]
Table 1. Purchase Computer Dataset
age income student credit_rating buys_computer
=30 high no fair no
=30 high no excellent no
31-40 high no fair yes
40 medium no fair yes
40 low yes fair yes
40 low yes excellent no
31-40 low yes excellent yes
=30 medium no fair no
=30 low yes fair yes
40 medium yes fair yes
=30 medium yes excellent yes
31-40 medium no excellent yes
31-40 high yes fair yes
40 medium no excellent no
Question 3. [25 Marks]
a) Given a set 100,000 medical drug products each emerging from two (2) different pharmaceutical companies, namely Company A and Company B to be provided to patients at a hospital in Tema. Imagine that out of the total products, only a sample of 300 products were tested by the Food and Drugs Authority (FDA). The FDA found at least two defective/fake products resulting in discarding the total set of products. After prior testing done by the two companies on each of the 100,000 drugs, it was found that 0.5% of the products emerging from Company A were defective and none was defective from the perspective of Company B. As a research student abreast with Data Mining techniques, you are presented with the dataset from these two companies on a spreadsheet and need to make analytical deductions and predictions from it. Assume there are 5 input features located on A2:E100001 and one target feature located on F2:F100001 on the spreadsheet respectively.
i Per the information given above about the medical products, which type of classification algorithm will you use to perform your mining - supervised or unsupervised classification? Provide a reason. [2 marks]
ii Mention any four algorithms you can use per your recommended type of classification in (i) above. [2 marks]
iii Explain any three major tasks you will undertake during preprocessing of the data. [3 marks]
iv Explain how you will normalize your dataset with any suitable normalization technique. [2 marks]
v Explain how you will separate the dataset into the right percentages or partitions before subjecting it to your chosen algorithm. [3 marks]
vi Do you think prediction or forecasting can be made from your chosen model implemented from the algorithm used? If yes, how can prediction be made for new input values. [3 marks]
vii Give with valid evidence the type of probability model used to subject the
300 sampled products to test. [2 marks]
b) Consider the following set of frequent 3-itemsets
{1,2,3}, {1,2,4}, {1,3,4}, {1,3,5}, {2,3,4} i List all candidate 4-itemsets obtained using the candidate generation step of the Apriori algorithm. [4 marks]
ii List all candidate 4-itemsets that survive the candidate pruning step of the
Apriori algorithm before support counting. [4 marks]
Question 4. [25 Marks]
a) Consider a dataset, namely weather with four input features – outlook, temperature, humidity and windy. The target for the given dataset is play which is a dichotomous variable with labels yes and no. The dataset has 14 instances with the target variable having 9 instances in the yes class and 5 instances in the no class.
In the attempt of setting up a classification model for the given dataset, two main classification algorithms, namely Naïve Bayes and Logistic Regression were set up in WEKA and their outputs are given below in Fig. 1 and Fig. 2 respectively.
i Comparing the two outputs from LHS and RHS above, which model will you recommend as optimal for classification of the given dataset. [2 marks]
ii Justify your answer for the best model in (i) above with valid reasons based on the outputs presented. [4 marks]
iii Explain the Confusion Matrix for your model selected in (i). [5 marks]
iv Imagine the yes class has 4 instances instead of 9 and the no class has 10 instances instead of 5, which technique can be considerd to increase the success (yes) instances while maintaining the failure (no) instances. [2 marks]
Fig. 1 Fig. 2
b) Construct a neural network for solving the Exclusive-OR problem, showing the values of all the weights and biases of the network. You may assume that a threshold
function is used for all the neurons. [6 marks]
c) For each of the Boolean functions below, determine whether the problem is linearly separable.
i (NOT A) AND B [3 marks] ii (A XOR B) AND (A OR B) [3 marks]
Question 5. [25 Marks]
A. Assume that the support vector machine (SVM) classifier is applied on a given dataset and the output from the classifier benchmarked against the actual labels of the dataset is depicted in the following table:
Actual Label Y Y Y N N N N Y N N N
SVM Output Y Y N Y N Y N Y Y Y Y
a) Provide a general overview of the confusion matrix in a tabular form showing the true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN).
[2 marks]
b) Create a confusion matrix in a tabular form for the output from the classifier and the actual dataset labels. [6 marks] c) From the confusion matrix in (b), compute the following performance measures by showing the step-by-step procedure involved in arriving at your results.
i. Accuracy [2 marks] ii. Precision [2 marks] iii. Recall [2 marks] iv. F-measure [2 marks]
B. Consider the following set of one-dimensional points: {6, 12, 18, 24, 30, 42, 48}.
a) For each of the following sets of initial centroids
i. {18, 45}
ii. {15, 40},
create two clusters by assigning each point to the nearest centroid, and then calculate the sum squared error for each set of two clusters after updating the centroids.
[6 marks]
b) Do both sets of centroids in B(a) represent stable solutions, i.e., if the K-means algorithm is applied to this set of points using the given centroids as the starting centroids, would there be any change in the clusters generated? [3 marks] Appendix A: Attribute Information of Absenteeism Dataset
1. Individual identification (ID)
2. Reason for absence (ICD).
Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:
I Certain infectious and parasitic diseases
II Neoplasms
III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism
IV Endocrine, nutritional and metabolic diseases
V Mental and behavioural disorders
VI Diseases of the nervous system
VII Diseases of the eye and adnexa
VIII Diseases of the ear and mastoid process
IX Diseases of the circulatory system
X Diseases of the respiratory system
XI Diseases of the digestive system
XII Diseases of the skin and subcutaneous tissue
XIII Diseases of the musculoskeletal system and connective tissue
XIV Diseases of the genitourinary system
XV Pregnancy, childbirth and the puerperium
XVI Certain conditions originating in the perinatal period
XVII Congenital malformations, deformations and chromosomal abnormalities
XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified
XIX Injury, poisoning and certain other consequences of external causes
XX External causes of morbidity and mortality
XXI Factors influencing health status and contact with health services.
And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6)) 5. Seasons
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Work load Average/day
11. Hit target
12. Disciplinary failure (yes=1; no=0)
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
14. Son (number of children)
15. Social drinker (yes=1; no=0)
16. Social smoker (yes=1; no=0)
17. Pet (number of pet)
18. Weight
19. Height
20. Body mass index
21. Absenteeism time in hours (target)

Looking for answers ?


Recent Questions