- 17th Jul 2024
- 12:19 pm
In this assignment, you will implement the SON algorithm using the Apache Spark Framework. You will develop a program to find frequent itemsets in two datasets, one simulated dataset and one real-world dataset generated from Yelp dataset. The goal of this assignment is to apply the algorithms you have learned in class on large datasets more efficiently in a distributed environment.
Programming Requirements
- You must use Python to implement all tasks.
- You are required to only use Spark RDD in order to understand Spark operations more deeply. You will not get any points if you use Spark DataFrame or DataSet.
In this assignment, you will implement the SON algorithm to solve all tasks (Task 1 and 2) on top of Apache Spark Framework. You need to find all the possible combinations of the frequent itemsets in any given input file within the required time. You can refer to Chapter 6 from the Mining of Massive Datasets book and concentrate on section 6.4 – Limited-Pass Algorithms. In task2, you will explore the Yelp dataset to find the frequent business sets (only case 1). You will jointly use the business.json and review.json to generate the input user-business CSV file yourselves.
Apply SON algorithm - The requirements for task 2 are similar to task 1. However, you will test your implementation with the large dataset you just generated. For this purpose, you need to report the total execution time. For this execution time, we take into account also the time from reading the file till writing the results to the output file. You are asked to find the frequent business sets (only case 1) from the file you just generated.
Implementing SON Algorithm Using Map Reduce In PySpark RDD - Get Assignment Solution
Please note that this is a sample assignment solved by our Python Programmers. These solutions are intended to be used for research and reference purposes only. If you can learn any concepts by going through the reports and code, then our Python Tutors would be very happy.
- Option 1 - To download the complete solution along with Code, Report and screenshots - Please visit our Python Assignment Sample Solution page
- Option 2 - Reach out to our Python Tutors to get online tutoring related to this assignment and get your doubts cleared
- Option 3 - You can check the partial solution for this assignment in this blog below
Free Assignment Solution - Implementing SON Algorithm Using Map Reduce In PySpark RDD
Spark context is the main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.
It is madatory to set a Java system property, such as spark.executor.memory. This must must be invoked before instantiating SparkContext.
The frozenset() is an inbuilt function in Python which takes an iterable object as input and makes them immutable. Simply it freezes the iterable objects and makes them unchangeable. In Python, frozenset is the same as set except the frozensets are immutable which means that elements from the frozenset cannot be added or removed once created. This function takes input as any iterable object and converts them into an immutable object. The order of elements is not guaranteed to be preserved.
I have used frozen set in the generate_k_candidates,son_implementation, fucntion to add the exact k candidate to the old candidate list. It initially froze the new candidates and make the list immutable. the main reason is to fix the elements so that in each iterable it does not change the values.
Actually to print the intermediate ones you can either use debugging mnode and select the checkpoints. In jupyter notebook I am not sure how to do that but in pycharm it can be done.
Get the best SON Algorithm Using Map Reduce In PySpark RDD assignment help and tutoring services from our experts now!
About The Author - Mathew Grey
Mathew Grey is a proficient Python developer with expertise in big data processing and distributed computing using Apache Spark. With a solid background in algorithm implementation and data mining, Mathew excels at solving complex problems efficiently. His experience spans working with large datasets and applying advanced algorithms like SON for real-world applications, such as analyzing Yelp data.