RDD is
the acronym for Resilient Distribution Datasets – a fault-tolerant collection
of operational elements that run parallel.
The
partitioned data in RDD is immutable and distributed. There are primarily two
types of RDD:
- Parallelized Collections : The existing RDD’s running parallel with one another.
- Hadoop datasets: perform function on each file record in HDFS or other storage system.