PHP-Fusion Powered Website - Articles: Apache Spark MCQ 1====

Users Online

· Guests Online: 5

· Members Online: 0

· Total Members: 232
· Newest Member: Zarfdrilhor

Forum Threads

Newest Threads

No Threads created

Hottest Threads

No Threads created

Latest Articles

· 11 Steps Kubernetes ...
· Calculate FAN Capaci...
· DHCP Port 67 and 68
· Network Switches Typ...
· WHERE vs HAVING in SQL

Oh no! Where's the JavaScript?
Your Web browser does not have JavaScript enabled or does not support JavaScript. Please enable JavaScript on your Web browser to properly view this Web site,
or upgrade to a Web browser that does support JavaScript; Firefox, Safari, Opera, Chrome or a version of Internet Explorer newer then version 6.

Articles Hierarchy

Articles Home » Big Data » Apache Spark MCQ 1====

Apache Spark MCQ 1====

41. What is coalesce transformation?

Ans: The coalesce transformation is used to change the number of partitions. It can trigger RDD shuffling depending on the second shuffle boolean input parameter (defaults to false ).

42. What is the difference between cache() and persist() method of RDD

Ans: RDDs can be cached (using RDD’s cache() operation) or persisted (using RDD’s persist(newLevel: StorageLevel) operation). The cache() operation is a synonym of persist() that uses the default storage level MEMORY_ONLY .

43. You have RDD storage level defined as MEMORY_ONLY_2 , what does _2 means ?

Ans: number _2 in the name denotes 2 replicas

44. What is Shuffling?

Ans: Shuffling is a process of repartitioning (redistributing) data across partitions and may cause moving it across JVMs or even network when it is redistributed among executors.

Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer.

45. Does shuffling change the number of partitions?

Ans: No, By default, shuffling doesn’t change the number of partitions, but their content

46. What is the difference between groupByKey and use reduceByKey ?

Ans : Avoid groupByKey and use reduceByKey or combineByKey instead.

groupByKey shuffles all the data, which is slow.

reduceByKey shuffles only the results of sub-aggregations in each partition of the data.

47. When you call join operation on two pair RDDs e.g. (K, V) and (K, W), what is the result?

Ans: When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key [68]

48. What is checkpointing?

Ans: Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system.

You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD.

49. What do you mean by Dependencies in RDD lineage graph?

Ans: Dependency is a connection between RDDs after applying a transformation.

50. Which script will you use Spark Application, using spark-shell ?

Ans: You use spark-submit script to launch a Spark application, i.e. submit the application to a Spark deployment environment.

Page 4 of 6: 1 2 345 6

Comments

No Comments have been Posted.

Post Comment

Please Login to Post a Comment.

Ratings

Rating is available to Members only.

Please login or register to vote.

No Ratings have been Posted.

Render time: 1.81 seconds

30,694,717 unique visits