Some utilities in Scala which can be used over RDD's:-
First thing to note is that Scala also some collections like --> Array, List,Seq,Set etc.
Reduce --> This basically reduces all the elements in a collection to single value by performing some operation like multiplication, addition etc.
scala> var rdd4 = sc.parallelize(List(1,2,3,4,5,6))
rdd4: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24
scala> rdd4.reduce{(x,y)=>x*y}
res9: Int = 720
scala> var list = Array(1,2,3,4)
list: Array[Int] = Array(1, 2, 3, 4)
scala> list
res3: Array[Int] = Array(1, 2, 3, 4)
scala> list.reduce(_+_)
res4: Int = 10
scala> list.reduceLeft(_+_)
res5: Int = 10
scala> list.reduceRight(_+_)
res6: Int = 10
scala> list.reduceRight((a,b)=>{println(a+","+b);a+b})
3,4
2,7
1,9
res7: Int = 10
scala> list.foldLeft(0)(_+_)
res13: Int = 10
scala> list.foldLeft(2)(_+_)
res14: Int = 12
Persistence (caching):- To store the rdd in memory/disk so that we dont have to recompute the RDD again & again.
val
result
=
input
.
map
(
x
=>
x
*
x
)
result
.
persist
(
StorageLevel
.
DISK_ONLY
)
This can be memory only, both memory and disk etc
Also, one more question arise from persistence is that what will happen if we persist/cache too much data. Will out spark cluster run out of memory ? The answer to this question is no as spark automatically evict old partitions using a Least Recently Used (LRU) cache policy. Though we should not cache too much data as it can lead to eviction of useful data also.
We can also remove the data from cache by calling unpersist();