Intelligent Web News Mining
Project Description – In this project, students are to design intelligent web news mining software. Students will be provided with initial web news data taken from an online newspaper, namely the age but later on software should be able to handle new data. The software should have two operation modes, namely training and testing phases. The training phase is a step to build a model and, in this step, a model can be updated, when new data are collected. The testing phase is a step, where a model is fixed and only classifies the web news based on given category. The software should be equipped with basic visualization tools that demonstrate both training and testing results. It is recommended for you to develop your program in Matlab platform but you are more than welcome to use other software packages. Detailed descriptions of the software are outlined as follows:
Machine Learning Algorithms – In this project, your software will be built upon three popular machine learning algorithms, namely linear regression, neural network, and adaptive network based on fuzzy inference system algorithms. All algorithms accept the same input variables and output the same target variable. They are distinguished by the training process in generating a predictive model. All these models should be trained using given data during the training process and during the testing process, you have three models ready to predict the house price. The software should be able to perform features that exist in the MATLAB toolbox: curve fitting, neural network, and neuro-fuzzy. In the training process, your program can accept new data and retrain models using an up-to-date dataset. The training process can be done using both cross-validation and periodic hold-out procedures.
1) LR: your program should present features of curve fitting toolbox in MATLAB. The weight of LR is determined using the least square (LS) estimation method.
2) NN: your program should present features of NN fitting tool in MATLAB. In the training process, the user can set the network architecture and training algorithms. The training algorithms are as with in the MATLAB NN fitting tool: Lavenberg Marquardt, Conjugate Gradient, Bayesian Regulation. The user can also set data proportions as well.
3) ANFIS: As with previous two algorithms, your algorithm should present features of the MATLAB ANFIS toolbox. The user can opt to use either the grid partitioning approach or the sub-clustering approach. Furthermore, the user can choose either the hybrid method or the backpropagation method for parameter optimization method.
Data description – students will be given text data, collected from the real online newspaper, namely the age. Because text data cannot be fed directly into learning algorithms, your software should have a pre-processing step, namely term extraction method. Data consists of 6 classes: business, entertainment, lifestyle, politic, and technology. Your program accepts inputs from a text box or text file in your computer.
Data pre-processing – because the input of your program is text data, your software should be embedded with data pre-processing step that converts text data into numeric data, which can be processed by machine learning algorithms. This process is called term extraction, which consists of two steps: term generation, term filtering. Term generation encompasses stopword removal, tokenization and stemming words and outputs term document matrix. The next step is called term filtering, which processes term-document matrix to numeric values. To this end, two phases, namely TF-IDF calculation and term selection, are undertaken. TF-IDF computes the frequency of the terms, which are used as a basis for term selection. You will be given the MATLAB code of term extraction. Your task is to integrate it with your main software.
Content – your program should have GUI, featuring training and testing modes. Your program consists of three learning algorithms, namely LR, NN, and ANFIS. The user can choose either to use one of them or to run them concurrently. For NN and ANFIS, the user can manipulate the network size and the training techniques to construct the model. Your program should cover basic visualization tools: error, prediction, network architecture, etc. The program should be able to show the fuzzy rule and membership functions of the ANFIS. Your program should also compare predictive results of the three algorithms.
1) Experience with MATLAB will be an advantage 2) Good programming skills.
3) Team player but can work independently as well.
Note that these selection criteria are not exhaustive. If you have no experience on machine learning, do not worry. You will be given adequate supervision and guidance for developing machine learning projects.