COMP1702 Big Data Assignment Sample
coursework Learning Outcomes:
1 Explain the concept of Big Data and its importance in a modern economy
2 Explain the core architecture and algorithms underpinning big data processing
3 Analyze and visualize large data sets using a range of statistical and big data technologies
4 Critically evaluate, select and employ appropriate tools and technologies for the development of big data applications
All material copied or amended from any source (e.g. internet, books) must be referenced correctly according to the reference style you are using.
Your work will be submitted for plagiarism checking. Any attempt to bypass our plagiarism detection systems will be treated as a severe Assessment Offence.
Coursework Submission Requirements
• An electronic copy of your work for this coursework must be fully uploaded on the Deadline Date of 29th Mar 2021 using the link on the coursework Moodle page for COMP1702.
• For this coursework submit all in PDF format. In general, any text in the document must not be an image (i.e. must not be scanned) and would normally be generated from other documents (e.g. MS Office using "Save As ... PDF"). An exception to this is handwritten mathematical notation, but when scanning do ensure the file size is not excessive.
• There are limits on the file size (see the relevant course Moodle page).
• Make sure that any files you upload are virus-free and not protected by a password or corrupted otherwise they will be treated as null submissions.
• Your work will not be printed in colour. Please ensure that any pages with colour are acceptable when printed in Black and White.
• You must NOT submit a paper copy of this coursework.
• All course works must be submitted as above. Under no circumstances can they be accepted by academic staff
The University website has details of the current Coursework Regulations, including details of penalties for late submission, procedures for Extenuating Circumstances, and penalties for Assessment Offences. See http://www2.gre.ac.uk/current- students/regs
You are expected to work individually and complete a coursework that addresses the following tasks. Note: You need to cite all sources you rely on with in-text style. References should be in Harvard format. You may include material discussed in the lectures or labs, but additional credit will be given for independent research.
PART A: Map Reduce Programming [300 words ±10% excluding java codes] (30 marks)
There is a text file (“papers.txt” is uploaded in Moodle) about computer science bibliography. Each line of this file describes the details of one paper in the following format: Authors|Title|Conference|Year. The different fields are separated by the “|” character, and the list of authors are separated by commas (“,”). An example line is given below: D Zhang, J Wang, D Cai, J Lu|Self-Taught Hashing for Fast Similarity Search|SIGIR|2010
You can assume that there are no duplicate records, and each distinct author or conference as a different name.
PART B: Big Data Project Analysis [2000 words ±10% excluding references] (70 marks)
Precision agriculture (PA) is the science of improving crop yields and assisting management decisions using high technology sensor and analysis tools. The AgrBIG company is a leading provider of agronomy services, technology and strategic advice. They plan to develop a big data system. The users can be farmers, research laboratories, policy makers, public administration, consulting or logistic companies, etc. The sources of data will be from various Sensors, Satellites, Drones, Social media, Market data, Online news feed, Logistic Corporate data, etc.
You need to design a big data project by solving the following tasks for the AgrBIG company:
Task1 (25 marks): Produce a Big Data Architecture for the AgrBIG company with the following components in detail for assignment help
- Data sources,
- Data extraction and cleaning,
- Data storage,
- Batch processing,
- Real time message ingestion,
- Analytical Data store
For each of the above, discuss various options and produce your recommendation which best meets the business requirement.
Task2 (10 marks): The AgrBIG company needs to store a large collection of plants, corps, diseases, symptoms, pests and their relationships. They also want to facilitate queries such as: "find all corn diseases which are directly or indirectly caused by Zinc deficiency". Please recommend a data store for that purpose and justify your choice.
Task3 (10 marks): MapReduce has become the standard for performing batch processing on big data analysis tasks. However, data analysists and researchers in the AgrBIG company found that MapReduce coding can be quite challenging to them for data analysis tasks. Please recommend an alternative way for those people who are more familiar with SQL language to do the data analysis tasks or business intelligence tasks on big data and justify your recommendation.
Task 4 (15 marks): The AgrBIG company needs near real time performance for some services such as soil moisture prediction service. It has been suggested the parallel distributed processing on a cluster should use MapReduce to process this requirement. Provide a detailed assessment of whether MapReduce is optimal to meet this requirement and If not, what would be the best approach.
Task 5 (10 marks): Design a detailed hosting strategy for this Big Data project and how this will meet the scalability, high availability requirements for this global business.
For a distinction (mark 70-79) the following is required:
1. An excellent/very good implementation of the coding task, all components are working and provide a very good result.
2. An excellent/very good research demonstrating a very good/ excellent understanding of big data concepts and techniques.
Note: In order to be eligible for a very high mark (80 and over) you will need to have:
As the Map Reduce system in Hadoop mainly works on distributed server environment which helps in parallel execution of different processes and handle the communications among the different systems.
The model is a unique methodology of split-apply-join technique which helps in information retrieval. Mapping of the data is finished by the Mapper class and lessens the assignment is finished by Reducer class. On the other hand, MapReduce comprises of two stages – Map and Reduce.
As the name MapReduce recommends, the reducer stage happens after the mapper stage has been finished. Thus, the first is the guide work, where huge amount of data is perused and prepared to create key-value sets as middle of the results.
The yield of a Mapper or guide work (key-value sets for the process) is contribution to the Reducer. The reducer gets the key- value pair from different guide occupations.
At that point, the reducer totals those moderate information tuples (middle key-esteem pair) into a more modest arrangement of tuples or key-esteem sets which is the last yield. MapReduce expands on the perception that numerous data preparing errands have a similar essential construction: a calculation is applied over an enormous number of records (e.g., Web pages) to produce incomplete outcomes, which are then collected in some style. Normally, the per-record calculation and total capacity differ as per task, however the fundamental design stays fixed. Taking motivation from higher-request capacities in utilitarian programming, MapReduce gives a deliberation at the place of these two activities.
Following is the algorithm used for the program;
Input: ((word, datafilename), ( N,n, m))
consider D is known ()
Output ((word, datafilename), TF*IDF)
Just the considered identity function
Big Data Architecture is used by AgrBIG company in order to manage the business and the data related to the business can be analysed effectively. The architecture framework comes with different infrastructure solution for data storage, providing an opportunity to a business to hold a large amount of information and data (Ponsard et al., 2017). Data analytical tools are used in order to fetch the required result from a big set of data using the various components of big data architecture.
In order to ensure the effective running of the business and acquiring the desired outputs and profits, the company has to ensure that all the data that is to be operated or processed must be available to the company in time of requirement. The company has thus used different sources for utilising different data. The data sources come in different form from the company. Application data stores are used by the company in order to store the relational databases about different information about vendors and clients as well as the employees (Lai and Leu, 2017). Static files are produced by the company in order to run the applications, such as the server files, files associated with the website handling. Real-time data sources are also used by the company like devices which uses the technology of the Internet of Things.
Data extraction and Cleaning
Tokuç, Uran and Tekin (2019) discussed, the system has to include all data cleaning features in order to clear the junk data from the system for an effective running of the data tools that has been used by the management of the company. The company has to think of different approaches to ensure the cleaning of data is undertaken where the junk data has no requirement in the database. The data can be extracted from the system using any query language. Here, the company is dependent on NoSQL database for extracting information regarding different requirements of the business and associated stakeholders.
For the process of data storage, the data is typically stored by AgrBIG company contains a huge number of information that is stored in the form of distributed files which contain a large volume of files and data that is stored in different formats. This huge volume of data can be stored in the storage called data lake. The data lake is considered to be the storage area used by the company in order to storage the vital information associated to the company and can be implemented for use anytime (Grover et al., 2018). The company is using Azure Data Lake Store for the storage of big data.
As opined by Kim (2019), since the data used by the company comes in huge volume, they require to be provided with a solution of big data processing in order to filter the data and extracted in the time of requirement. Typically, long-running batch jobs are used to process the data files stored in big data. These data can be then prepared for the analysis of different jobs. The jobs related to analysis can include sourcing of files. processing the files and writing any new outputs on the files. U-SQL jobs is used in combination with Azure Data Lake Analytics in order to extract the required information. The company has also approached Map Reduce job for clustering the data using Java program.
Real-Time message ingestion
The solution of the data which is being extracted by the company if consists of real-time data sources then it has to implement the real-time message ingestion system in order to make the data for stream process. This is a way of simple storage of data where the incoming messages are handled and dropped into the folder which will be used for further processing. Different solution needs message ingestion according to eh requirements in order to act as buffer for the message an support the delivery of the message on forming semantics with queue.
Analytical Data store
Different kind of big data solution re now being prepared and used for analysis of data and the tools serves the data for processing in a sematic and structured format. Various analytic tools have been implemented in order to bring out the analysis through queries form the relation data warehouse. NoSQL technology or Hive database has been used for providing metadata abstraction features over the data files in the distributed storage of the system. The most traditional system being used by the business intelligence to implement the analytical storage of data through Azure Synapse analytics which is largely used by the company in order to provide a managed service in big data handling (Govindan et al., 2018). garbage company also uses cloud-based data warehousing for supporting interactive database management in order to serve the purpose of data analysis.
garbage company has decided to implement large storage of bid data using a NoSQL database. This approach has been used by the company in order to handle a large volume of data which can be effectively extracted and used in higher speed and the scalability of the architecture being high (Eybers and Hattingh, 2017). In order to implement a scale-out architecture, NoSQL databases has been used by the company in order to communicate the queries even with cloud computing technologies. Moreover, it enables to process the data in large clusters and thus increases the capacity of the computers when added to the clusters.
Another reason for using the NoSQL database is to store the every unstructured and structured data with the help of some predefined schemas., These schemas can be easily used to transform into the data and can be loaded into the database. Few transformations are required in order to store retrieve information. In addition to that, the preference of choosing the NoSQL database by the AgrBIG company is that the database has the flexibility feature in order to be easily controlled by the developers and can be adaptable to holds different forms of data.
This technology has also features to update the data of the database easily by making transformation in the structure of the data. The value for rows and columns in the database can be updated with new values without disrupting the existing structure. The database comes with developer-friendly characteristics, which enables the developer to keep the control of the system and its associated structures of the data (Ekambaram et al., 2018). It thus helps to store the data where it can be closely observed when used in different application by the company. Moreover, the data in and data out is useful and easier though this technology used by the business.
There are different methods that the company can use for data analysis tasks. The company has found that MapReduce is through an effective technology for the data analysis requirements of AgrBIG company, but the technology is posing different challenges in implementation. MapReduce uses different stages for the extraction of the data requirements. Though the technology is highly approached and effective in mapping a large set of data into cluster and creating several small elements of data by breaking them into tuples. Yet, the company prefers to use some other technology in order to implement the data processing. Hive can be used by AgrBIG company in order to run the parallel distributed system along with being familiar with the SQL language. This can be implemented by the company to run in parallel with MapReduce in order to distribute the data and sending these, mapper programs to the required location, locating the failures for handling. Hive is an alternative to MapReduce in order to reduce the line of the codes and making it easier to understand.
The coding approach of Hadoop comes with difficult functionality making it more complex for the business as it is time consuming. According to Bilal and Oyedele (2020), for advanced programming interfaces, the AgrBIG company can use Hive for the handling any large data. the technology also comes with special tools for the execution of data and manipulation of data. Hive also comprise of such technology which comes with different processing concepts for the selection, filtering and ordering of data according to provided syntax and has the flexibility to make conceptual adjustments Thus in most cases those who are familiar with SQL language prefers to have Hive over MapReduce as the data analysis tool for big data.
Garbage company is concerned about implementing a scalable architecture that has the capacity to handle a large amount of distributed data in different servers across the cloud. The large data sets can be effectively handles and can be operated in parallel technology using MapReduce. Thus, this is chosen as the one of the best and optimal solution to the requirement of the business, which can handle the data effectively and process the huge amount of data at a very cost-effective solution. For real-time processing of data, the cost-effective approach used by MapReduce helps to meet the requirements of the business effectively as the technology allows the storage of data and processing of the data at a very affordable cost (Becker, 2017). The business of AgrBIG company has found that the programming through MapReduce has made an effective access to different sources of the data which helps to generate and extract values according to the requirements and thus flexible for the data processing and storage.
The business has also got advantages by MapReduce programming as it supports faster access to the distributed file system, which uses different mapping structure in order to locate data that is stored in clusters. The tools that MapReduce offers allows faster processing of big data. The essential feature for using MapReduce in order to recommend the technology as the optimal solution for the requirement of the business is the security which is the vital aspect of this application. Only approved users can gain the access to the data storage of the system and processing of the data.
MapReduce enables the AgrBIG company to provide the feature of parallel distributed processing through which the tasks can be divided in an effective manner and the execution can be done in parallel techniques. The parallel processing technique for the programming ensure the use of multiple process which can effectively handle the tasks by diving them and the programs can be computed and executed in faster approach (Walls and Barnard, 2020). Moreover, the data availability using this application is secured as the data is being forwarded to various node in the network. In case of failure of any of the node, data can be processed from other node which has the access of the data. This is offered by the fault tolerance feature of the application, which is very vital to meet the requirements of the business and make it more sustainable.
Defining a big data strategy for hosting the requirements of the business it requires to synchronize the data for handling the objectives of the business in order to implement for big data. The strategy should be implemented in such a way that it can aligning the quality of the organisation long with approaching the performance goals focusing on different measurable outcomes. The strategy should be implemented with high scalability so that it possesses the decision making using data resources. Moreover, the data increases its volume with the expansion of the organisation, so it is required to choose the right data in order to find the solution to the various problems of the business. In the next approach, effective tools for handling the big data must be used by AgrBIG company in order to address the problems of the business. As stated by Ajah and Nek (2019), the Hadoop is being extensively used for an efficient handling of unstructured and structured data of the company. For the optimisation of the data, different analytical tools can be used by the company to meet the requirements and make the predictions based on the assumption of consumer behaviour. The entire process can ensure high availability of the information to meet the requirements of the business globally by synchronizing the entire flow of data with the help of public and private cloud provision which also comes with the feature of backups and data security. Thus, a hosting strategy for implementing Big Data project can ensure the management of organisation with the minimisation of risks and helps the project team to discover unexpected outcomes and examine the effects of the analysis.
Ajah, I.A. and Nweke, H.F., (2019). Big data and business analytics: Trends, platforms, success factors and applications. Big Data and Cognitive Computing, 3(2), p.32. https://www.mdpi.com/2504-2289/3/2/32/pdf
Becker, D.K., (2017, December). Predicting outcomes for big data projects: Big Data Project Dynamics (BDPD): Research in progress. In 2017 IEEE International Conference on Big Data (Big Data) (pp. 2320-2330). IEEE. https://ieeexplore.ieee.org/abstract/document/8258186/
Bilal, M. and Oyedele, L.O., (2020). Big Data with deep learning for benchmarking profitability performance in project tendering. Expert Systems with Applications, 147, p.113194. https://uwe-repository.worktribe.com/preview/5307484/Manuscript.pdf
Ekambaram, A., Sørensen, A.Ø., Bull-Berg, H. and Olsson, N.O., (2018). The role of big data and knowledge management in improving projects and project-based organizations. Procedia computer science, 138, pp.851-858. https://www.sciencedirect.com/science/article/pii/S1877050918317587/pdf?md5=fb25e51566ae00860fc3831ce4088ce0&pid=1-s2.0-S1877050918317587-main.pdf
Eybers, S. and Hattingh, M.J., (2017, May). Critical success factor categories for big data: A preliminary analysis of the current academic landscape. In 2017 IST-Africa Week Conference (IST-Africa) (pp. 1-11). IEEE. https://www.academia.edu/download/55821724/Miolo_RBGN_ING-20-1_Art7.pdf
Govindan, K., Cheng, T.E., Mishra, N. and Shukla, N., (2018). Big data analytics and application for logistics and supply chain management. https://core.ac.uk/download/pdf/188718529.pdf
Grover, V., Chiang, R.H., Liang, T.P. and Zhang, D., (2018). Creating strategic business value from big data analytics: A research framework. Journal of Management Information Systems, 35(2), pp.388-423. https://files.transtutors.com/cdn/uploadassignments/2868103_1_bda-2018.pdf
Kim, S.H., (2019). Risk Factors Identification and Priority Analysis of Bigdata Project. The Journal of the Institute of Internet, Broadcasting and Communication, 19(2), pp.25-40. https://www.koreascience.or.kr/article/JAKO201914260900587.pdf
Lai, S.T. and Leu, F.Y., (2017, July). An iterative and incremental data preprocessing procedure for improving the risk of big data project. In International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (pp. 483-492). Springer, Cham. https://link.springer.com/chapter/10.1007/978-3-319-61542-4_46
Ponsard, C., Majchrowski, A., Mouton, S. and Touzani, M., (2017). Process Guidance for the Successful Deployment of a Big Data Project: Lessons Learned from Industrial Cases. In IoTBDS (pp. 350-355). https://www.scitepress.org/papers/2017/63574/63574.pdf
Tokuç, A.A., Uran, Z.E. and Tekin, A.T., (2019). Management of Big Data Projects: PMI Approach for Success. In Agile Approaches for Successfully Managing and Executing Projects in the Fourth Industrial Revolution (pp. 279-293). IGI Global. https://www.researchgate.net/profile/Ahmet_Tekin7/publication/331079533_Management_of_Big_Data_Projects/links/5c86857ba6fdcc068187e918/Management-of-Big-Data-Projects.pdf
Walls, C. and Barnard, B., (2020). Success Factors of Big Data to Achieve Organisational Performance: Theoretical Perspectives. Expert Journal of Business and Management, 8(1). https://business.expertjournals.com/23446781-801/