PHP-Fusion Powered Website - Articles: Apache Spark MCQ 1====

Users Online

· Guests Online: 32

· Members Online: 0

· Total Members: 204
· Newest Member: meenachowdary055

Forum Threads

Newest Threads

No Threads created

Hottest Threads

No Threads created

Latest Articles

· Udemy – AI Agents ...
· Udemy – AI for Acc...
· Udemy – Clawdbot: ...
· Linkedin – Everyda...
· Udemy – Excel Copi...

Oh no! Where's the JavaScript?
Your Web browser does not have JavaScript enabled or does not support JavaScript. Please enable JavaScript on your Web browser to properly view this Web site,
or upgrade to a Web browser that does support JavaScript; Firefox, Safari, Opera, Chrome or a version of Internet Explorer newer then version 6.

Articles Hierarchy

Articles Home » Big Data » Apache Spark MCQ 1====

Apache Spark MCQ 1====

21. How do you define RDD?

Ans: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel. Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

· Resilient: Fault-tolerant and so able to recomputed missing or damaged partitions on node failures with the help of RDD lineage graph.

· Distributed: across clusters.

· Dataset: is a collection of partitioned data.

22. What is Lazy evaluated RDD mean?

Ans: Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.

23. How would you control the number of partitions of a RDD?

Ans You can control the number of partitions of a RDD using repartition or coalesce operations.

24. What are the possible operations on RDD

Ans: RDDs support two kinds of operations:

· transformations - lazy operations that return another RDD.

· actions - operations that trigger computation and return values.

25. How RDD helps parallel job processing?

Ans: Spark does jobs in parallel, and RDDs are split into partitions to be processed and written in parallel. Inside a partition, data is processed sequentially.

26. What is the transformation?

Ans: A transformation is a lazy operation on a RDD that returns another RDD, like map , flatMap , filter , reduceByKey , join , cogroup , etc. Transformations are lazy and are not executed immediately, but only after an action have been executed.

27. How do you define actions?

Ans: An action is an operation that triggers execution of RDD transformations and returns a value (to a Spark driver - the user program). They trigger execution of RDD transformations to return values. Simply put, an action evaluates the RDD lineage graph.

You can think of actions as a valve and until no action is fired, the data to be processed is not even in the pipes, i.e. transformations. Only actions can materialize the entire processing pipeline with real data.

28. How can you create an RDD for a text file?

Ans: SparkContext.textFile

29. What is Preferred Locations

Ans: A preferred location (aka locality preferences or placement preferences) is a block location for an HDFS file where to compute each partition on.

def getPreferredLocations(split: Partition): Seq[String] specifies placement preferences for a partition in an RDD.

30. What is a RDD Lineage Graph

Ans: A RDD Lineage Graph (aka RDD operator graph) is a graph of the parent RDD of a RDD. It is built as a result of applying transformations to the RDD. A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.

Page 2 of 6: 123 4 5 6

Comments

No Comments have been Posted.

Post Comment

Please Login to Post a Comment.

Ratings

Rating is available to Members only.

Please login or register to vote.

No Ratings have been Posted.

Render time: 2.81 seconds

26,109,548 unique visits