× Limited Time Offer ! FLAT 30-40% off - Grab Deal Before It’s Gone. Order Now
Connect With Us
Order Now

ITECH1103 Big Data and Analytics Assignment Sample

IT - Report

You will use an analytical tool (i.e. WEKA) to explore, analyse and visualize a dataset of your choosing. An important part of this work is preparing a good quality report, which details your choices, content, and analysis, and that is of an appropriate style.

The dataset should be chosen from the following repository:
UC Irvine Machine Learning Repository https://archive.ics.uci.edu/ml/index.php

The aim is to use the data set allocated to provide interesting insights, trends and patterns amongst the data. Your intended audience is the CEO and middle management of the Company for whom you are employed, and who have tasked you with this analysis.


Task 1 – Data choice. Choose any dataset from the repository that has at least five attributes, and for which the default task is classification. Transform this dataset into the ARFF format required by WEKA.

Task 2 – Background information. Write a description of the dataset and project, and its importance for the organization. Provide an overview of what the dataset is about, including from where and how it has been gathered, and for what purpose. Discuss the main benefits of using data mining to explore datasets such as this. This discussion should be suitable for a general audience. Information must come from at least two appropriate sources be appropriately referenced.

Task 3 – Data description. Describe how many instances does the dataset contain, how many attributes there are in the dataset, their names, and include which is the class attribute. Include in your description details of any missing values, and any other relevant characteristics. For at least 5 attributes, describe what is the range of possible values of the attributes, and visualise these in a graphical format.

Task 4 – Data preprocessing. Preprocess the dataset attributes using WEKA's filters. Useful techniques will include remove certain attributes, exploring different ways of discretizing continuous attributes and replacing missing values. Discretizing is the conversion of numeric attributes into "nominal" ones by binning numeric values into intervals 2 . Missing values in ARFF files are represented with the character "?" 3 . If you replaced missing values explain what strategy you used to select a replacement of the missing values. Use and describe at least three different preprocessing techniques.

Task 5 – Data mining. Compare and contrast at least three different data mining algorithms on your data, for instance:. k-nearest neighbour, Apriori association rules, decision tree induction. For each experiment you ran describe: the data you used for the experiments, that is, did you use the entire dataset of just a subset of it. You must include screenshots and results from the techniques you employ.

Task 6 – Discussion of findings. Explain your results and include the usefulness of the approaches for the purpose of the analysis. Include any assumptions that you may have made about the analysis. In this discussion you should explain what each algorithm provides to the overall analysis task. Summarize your main findings.

Task 7 – Report writing. Present your work in the form of an analytics report.


Data choice

In order to perform the brief analysis on the selected data set, it is important to make the data suitable for analysis using WEKA tool. This tool for assignment help support only ARFF format of data and arff viewer in this tool helps to view and transfer data set into appropriate form of data (Bharati, Rahman & Podder, 2018). In this task, a data set has been downloaded which is related to the phone call campaign of Portuguese banking institution. The data set was initially in the csv format which is imported into the analytical platform. In the below figure, the csv file has been shown which was initially imported into the analytical tool. Based on the tool requirement, this data file need to be transformed into arff format.


Figure 1: Original csv data

After importing the csv file into the arff viewer, the file has been saved as arff data format to convert the data set type.


Figure 2: Transformed ARFF data

In the above figure, it has been seen that the data set has been transformed into arff format and all the attributes of the data frame is present into the dataset. The data set conversion has been successfully done to do further analysis on the selected data set.
Background information

In this project, a data set has been chosen which is related to the bank customer details. Data has been generated after getting information from clients through phone calls to predict if they are interested to invest into term policies. There are several attributes are available into the data frame. Here, the data types and other information related to the client are given. On the other hand, a brief analysis on the data frame could be conducted into the WEKA data mining tool. There are different analytical algorithms have been associated that could be utilized to classify data by considering a particular class variable. The data set has been downloaded from UCI machine learning repository. A large number of data sets are available into this website that consist of variety of topics and categories. The data set is mainly about the client details and based on the attributes present into the data frame, all the necessary analysis could be done.

In order to get proper insights and findings from the analysis, clients could be classified into two categories. Persons who are interested to invest in term policies and who are not interested in investment are the two major categories. In this project, a bank market analysis will be done to get knowledge on the investment pattern by the clients. The project is mainly focused on the data analysis by using WEKA analytical tool. Based on the given data and statistic, several essential insights could be extracted by using the analytical platform. Here, the Portuguese banking institute will be able to make crucial decision on the client engagement and investment issues. Based on the client data analysis, attractive offers and terms could be given to the potential investors. On the other hand, a complete statistics of the pervious campaign and their output could also be achieved by the analysts. In order to enhance the subscription rate into the organization, this analysis will help the organization through statistical analysis. All the major findings will be documented into this report.

In order to fulfil the project aim and objectives, WEKA data mining tool will be used that gives major features to pre-process and analysis data set. There are several benefits could be achieved by the users by using the data mining tool. Based on the data types and project objective, data mining tools could be used for multiple purposes. Business managers can obtain information from a variety of reputable sources using data mining tools and methodologies. After conducting a brief study on the selected large dataset, industry professionals can gain a number of important observations. On the other side, with analytics solutions like WEKA, a significant volume of data and information may be readily handled and controlled. Furthermore, the policy maker could make a variety of critical judgments after evaluating the information utilizing data mining methods, which could lead to positive outcomes business expansion.

Data description

A bank campaign data has been selected in this project and it will be analyzed to get vital information on the clients. It is important to understand the data set properly to make better analysis on the analytical platform. The data set consist all the major attributes related to the clients including age, marital status, loan details, campaign details, campaign outcome and some other indexes. However, all the attributes are categorized into client data, campaign data, social and economic attributes. All the attributes could be pre-processed to conduct the analysis after considering the class attribute and categories. The last attribute of the data frame is desired_target that will be considered as the target class. If the clients are interested to invest in the bank on term policies is the main focus of the entire analysis. Here, a major concern of the project is to make a brief analysis on the given data frame. On the other hand, some data pre-processing will also be done to prepare the data frame suitable for the analysis. There are several filtering features have been given by the analytical platform.

There are 19 attributes are available into the data frame and all the necessary attributes will be considered for this analysis. Here, five major attributes have been evaluated based on the importance for this analysis:

• Job: This is a major attribute that gives overview on the work filed of the client. Income level of the client could also be assumed through the job profile which could play a vital role on investments strategy of the client. In terms of business perspective, job profile of the client could play a better role to customize offers and policies of the client.

• Marital status: Based on the marital status of the client, investment tendency could be assumed. For each relationship, different financial investments are sometime done by the consumers. One the other hand, expenses of the clients also varies on the relationship status.

• Loan: Previous loan consumption or financial transaction of the client should also be considered by the business analysts of the banking institutes. This attribute tells if the client has consumed any previous loans or not.

• Poutcome: This feature is another vital thing that should also be considered to predict if the client is interested to invest or not. After each campaign, output is considered as success or failure. This would play a vital role in this analysis.

• Cons.conf.idx: On the other hand, consumer confidence index is another essential aspect that must be analyzed to predict if the client in interested in investment or not.

The above five attributes are the most essential aspects that must be analyzed to get insights on the campaign data and its possibilities. However, the campaign strategy could be changed based on the previous results and outcomes.

Data pre-processing

Data pre-processing is the primary stage that must be performed by the analysts to prepare data suitable for the analysis. During the analysis, some of the major issues are faced that must be mitigated by using different pre-processing techniques. Data cleaning, transformation and other some other operations are performed during the data pre-processing. In this task, different data processing steps have been followed to make the data frame suitable for the analysis.
Removing attribute from the data frame


Figure 3: Unnecessary attributes removed

In the above figure, two attributes including euribor3m and nr.employes have been removed from the data frame. These are the two attributes that will not provide any vital insights on campaign data.

Discretizing attributes


Figure 4: Discretizing attributes

In the above figure, four attributes have been selected that have been transformed from numeric to nominal data type. This will make the analysis easier by selecting the class data type. On the other hand, the selected analytical tool is not comfortable with the numeric data types and it gives betted visualization on the nominal values.

Removing duplicated values

Duplicated values gives wrong analytical result on the data frame. For this reason, it is important to remove the duplicated values from the data frame. In the below figure, all the attributes have been selected and then a filters has been applied to remove duplicated values from the data frame.


Figure 5: Removing duplicated values

After removing duplicated values, it has been seen that the count of each columns or categories have been reduced. After removing duplicated value, only distinct type data are present into the data frame.

These three data pre-processing steps have been introduced in this project to make the data set appropriate for the analysis. After preparing data with some pre-processing steps, all the necessary analysis and insights have been built.

Data mining

There are several data mining techniques are there that could be introduced into the data set to get proper insights and visualization on current business operations and activities. Based on the business requirement, classification algorithm could be implemented into the data frame. On the other hand, a brief analysis on the given problem context could be introduced by the users after successfully implementing the algorithms. In this task, three different algorithms have been selected and executed on the data frame.

Random Forest algorithm

Random Forest algorithm is a major type of classification algorithm that take decision on the given data set by classifying data frame into different categories. By selection random data from the data frame, decision tree is created and then based on the accuracy of each branch, decision tree gives result. In this project, decision tree has been implemented into the data frame to classify clients into potential subscribers or non-subscribers. In order to improve the accuracy of the model, average of sub samples are calculated by this algorithm.

Figure 6: RandomForest algorithm

In the above figure, a random forest algorithm has been executed on the campaign data set in order to classify the clients. All the attributes have been included in this execution and based on the cross-validation, 10 folds have been tested.


Figure 7: Output of RandomForest model

After implementing the classification algorithm, a complete statistic of the model performance have been given in the above figure. All the necessary parameters have been given in the given statistic. The developed model given about 85% of accuracy. The model has been built within 5.66 seconds. A confusion matrix on the selected model has also been created that classifies the data frame into two categories. 

Naive Bayes algorithm

Naïve Bayes classification algorithm is a supervised algorithm that is simple and effective for the classification and prediction of a particular feature. It's termed Nave because it considers that the appearance of one feature is unrelated to the appearance of others. As a result, each aspect helps to identifying that it is a fetaure without relying on the others. In order to classify the identified features from the data frame, the naïve bayes classifier algorithm will set some pre-defined identifications (Hawari & Sinaga, 2019). However, this algorithm has some independences variable on the given data frame and prediction of feature is made with some probabilistic assumptions.


Figure 8: Naïve Bayes classifier

In the above figure, a naive bayse algorithm has been executed into the given data frame. The data frame has been classified into two categories that are yes and no. 10 fold cross validation has been selected as testing option. Based on some particular features, this model will classify the data frame into class variables.


Figure 9: Output of Naive Bayes model

After the implementation of the naive bayes algorithm, the above statistic has been achieved that shows all the essential parameters of the model. However, this model is able to classify the data frame with more than 83% of accuracy. Here, the confusion matrix has also been demonstrated that gives overview on the classification capability of the model.

K-nearest neighbor

In this model, new data are classified after checking the similarity with the previous data. Knn algorithm can easily classify the given data into categories based in the previous data records. Both the regression and classification models could be developed by using knn algorithm. Assumption on the underlying data are not triggered in this algorithm. However, the action performed in the training data are not quickly implemented into the test data set.


Figure 10: K-nearest neighbor

In the above figure, a lazy classifies algorithm has been executed into the data frame. The nearest neighbor will be identified based on some per-defined characteristics. Here, 10 fold cross validation process has been used.


Figure 11: Output of K-nearest neighbor

In the above figure, several performance parameters have been illustrated as output of the model. On the other hand, the confusion matrix of the model has also been introduced in the above figure.

Discussion of findings

After conducting brief analysis on the given data set, a number of vital insights have been achieved that have been discussed in this section with proper evidences.


(Figure 12: Age variable vs. desired_target)

In the above figure, it has been seen that the clients with age between 33 to 41 are highly interested to invest in term policies. The rate of investment is decreasing with increase in age of the clients.


Figure 13: Job variable vs. desired_target

On the other hand, clients with job profile in administration have maximum probability of getting subscription or non-subscription.


Figure 14: Marital status variable vs. desired_target

Here, the analysis has been done based on the marital status of the client. This shows that the married persons are highly interested to make investments.

Figure 15: Loan variable vs. desired_target

Those clients who have already taken any loans are interested to make investments. On the other hand, percentage of non-subscribers is lower who have not taken any loan.


Figure 16: Poutcome variable vs. desired_target

However, the output of the phone campaign gives a statistic that most of the campaign have not given any particular result or assumption.

Figure 17: Cons.conf.idx variable vs. desired_target

The confidence index of the consumers is another essential aspect that has also been analyzed in the above figure.

Figure 18: Cons.price.idx variable vs. desired_target

Consumer price index has been illustrated in the above figure. After categorizing the feature in terms of desired_target, some vital insights have been introduced.


Fill the form to continue reading

Download Samples PDF