Saturday, October 7, 2017

Spark cluster


A Spark application is launched on a set of machines using an external service called acluster manager. As noted, Spark is packaged with a built-in cluster manager called the Standalone clustermanager. Spark also works with Hadoop YARN and Apache Mesos, two popular open sourcecluster managers

Spark Tutorial - Part 2

Some utilities in Scala which can be used over RDD's:-

First thing to note is that Scala also some collections like --> Array, List,Seq,Set etc.

Reduce --> This basically reduces all the elements in a collection to single value by performing some operation like multiplication, addition etc.

scala> var rdd4 = sc.parallelize(List(1,2,3,4,5,6))
rdd4: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> rdd4.reduce{(x,y)=>x*y}
res9: Int = 720

scala> var list = Array(1,2,3,4)
list: Array[Int] = Array(1, 2, 3, 4)

scala> list
res3: Array[Int] = Array(1, 2, 3, 4)

scala> list.reduce(_+_)
res4: Int = 10

scala> list.reduceLeft(_+_)
res5: Int = 10

scala> list.reduceRight(_+_)
res6: Int = 10

scala> list.reduceRight((a,b)=>{println(a+","+b);a+b})
3,4
2,7
1,9

res7: Int = 10

scala> list.foldLeft(0)(_+_)
res13: Int = 10

scala> list.foldLeft(2)(_+_)

res14: Int = 12


Persistence (caching):- To store the rdd in memory/disk so that we dont have to recompute the RDD again & again.

val result = input.map(x => x * x)
result.persist(StorageLevel.DISK_ONLY)
This can be memory only, both memory and disk etc

Also, one more question arise from persistence is that what will happen if we persist/cache too much data. Will out spark cluster run out of memory ? The answer to this question is no as spark automatically evict old partitions using a Least Recently Used (LRU) cache policy. Though we should not cache too much data as it can lead to eviction of useful data also.

We can also remove the data from cache by calling unpersist();



Thursday, October 5, 2017

Spark Tutorial - Part 1


The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding RDD data fails, Spark uses this ability to recompute the missing partitions, transparent to the user.

We can create a spark RDD by first opening a spark-shell on emr and then creating a RDD using spark context on spark-shell.

create a spark context on emr by typing sc on emr

On EMR, when we use sc.textFile("") --> This basically searches the file on the path /user/hadoop/



org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://ip-10-162-101-19.ec2.internal:8020/user/hadoop/johri.logs

So we first need to keep a file in HDFS at this location and for this, we run hdfs copyfromlocal command. Below is the command:-

hdfs dfs -copyFromLocal  derby.log /user/hadoop/

After copy the file from local file system into hdfs, we can check if the file is there or not by running below hdfs command:-

hdfs dfs -ls /user/hadoop/

Once the file is there, we can again open spark-shell on emr and run the below commands:-
Some common RDD operations:-


1) sc --> creates an spark context

2) var lines = sc.textFile("derby.log") -- this will basically create an rdd from the file on hdfs

3) lines.count() --> will count the total number for lines in the rdd






In spark, main thing is RDD and doing following on RDD :-

1) creating a new RDD

2) Transforming a new RDD from an existing RDD

3) Calling operations/performing actions on this new RDD



Under the hood, spark distributes the data in RDD across our cluster and performs parallel operations to process this data.



RDD --> resilient dietributed data is an immutable distributed data which is split into multiple partitions which may computed on different nodes.of the cluster. We can create a RDD in two ways:-

     (a)  loading an external dataset
     (b) Distributing collection of objects from the list in the driver program.

Now spark has many functions which can be used. Some examples are given below:-


sc
var inputRdd = sc.textFile("derby.log") 
inputRdd.count() --> this will give total number of lines.
var osdata = inputRdd.filter(line=>line.contains("os")) --> this will basically take the cursor to first place where os appears. First curson is because of lazy computation done by spark.
inputRdd.filter(line => line.contains("os")).count() --> this will give total count of lines having the word "os"
inputRdd.filter(line => line.contains("os")).first() --> this will contain the first line which has "os"
inputRdd.filter(line => line.contains("os")).take(4) --> will give array of 4 lines of havng word "os". If there are lesser number of lines, then that lesser number is result/output.


RDD.persist() --> If we want to store a RDD in the cluster memory so that we can re-use it mutiple times , it make sense to persist it. We can also persist on the disk instead of memory.

To create a RDD on the fly from spark context, we can create a using a function called parallelize(). Example is given below:-


var parRdd = sc.parallelize(List("Rahul","Johri","Johri123"))

and after this we can perform common RDD operations.

scala> var rdd1 = sc.parallelize(List("lion","tiger","tiger","peacock","horse"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> var rdd2 = sc.parallelize(List("lion","tiger"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> rdd1.distinct().collect()
res8: Array[String] = Array(peacock, lion, tiger, horse)

scala> rdd1.union(rdd2).collect()
res9: Array[String] = Array(lion, tiger, tiger, peacock, horse, lion, tiger)

scala> rdd1.intersection(rdd2).collect()
res10: Array[String] = Array(lion, tiger)

scala> rdd1.subtract(rdd2).collect()
res11: Array[String] = Array(peacock, horse)

scala> rdd1.cartesian(rdd2).collect()
res12: Array[(String, String)] = Array((lion,lion), (tiger,lion), (lion,tiger), (tiger,tiger), (tiger,lion), (peacock,lion), (horse,lion), (tiger,tiger), (peacock,tiger), (horse,tiger))

scala> rdd1.take(10).foreach(println)

Point to note is that the above mentioned examples, we are using rdd operations like distinct(),union(),intersection(),cartesian(). All of these operations are operated on 2 RDD's. In order to use these functions, RDD's should be of same type like String or Integer.

Collect() function is used to retrieve the entire RDD. To use collect() function on a RDD, we must make sure that the dataset is small as collect() brings all the RDD data into single machine/driver memory. So not suitable for a big dataset.

Function to save RDD are ==> saveAsTextFile() and saveAsSequenceFile()






Example of MAP:-
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))

We have used mkString() function in the above example. This is more of a scala function which is used to convert scala array/list/sequence into string.mkString needs seperator as parameter which can be new line, comma , space and also comtain prefix and suffix. It also converts an integer array into string. For more info refer -->
https://alvinalexander.com/scala/scala-convert-array-to-string-mkstring


Spark Map and Flat Map operation:-

Both these are RDD operations and basically compute on every single element in RDD. The difference is that Map take every single element in RDD, performs its operation and gives 1 output element. So mapping is one to one.

Flat Map on the other side do the same things as map, but for every single element, it can produce 0,1 or multiple output.

Apache Spark map transformation operation Apache Spark flatMap transformation operation





scala> var rdd1 = sc.parallelize(List("Rahul","Johri","is","james","bond"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> var fltmp = rdd1.flatMap(line=>line.split(" "))
fltmp: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at flatMap at <console>:26

scala> fltmp.first()
res3: String = Rahul

scala> var rdd2 = sc.parallelize(List("Rahul Rajiv","Johri","is","james","bond"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> var fltmp2 = rdd2.flatMap(line=>line.split(" "))
fltmp2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:26

scala> fltmp2.first()
res4: String = Rahul

scala> var fltmp3 = rdd2.map(line=>line.split(" "))
fltmp3: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[4] at map at <console>:26

scala> fltmp3.first()
res5: Array[String] = Array(Rahul, Rajiv)

scala> fltmp3.take(2)
res6: Array[Array[String]] = Array(Array(Rahul, Rajiv), Array(Johri))

scala> fltmp2.take(10)
res7: Array[String] = Array(Rahul, Rajiv, Johri, is, james, bond)

scala> fltmp3.take(10)
res8: Array[Array[String]] = Array(Array(Rahul, Rajiv), Array(Johri), Array(is), Array(james), Array(bond))




Sunday, May 14, 2017

Tower of hanoi

class TowerOfHanoi{

public static void main(String args[]){
TowerOfHanoi obj = new TowerOfHanoi();
int numOfDisk = 4;
obj.solveProb(4,'A','B','C');
}
public void solveProb(int num, char start, char interim, char end){
if(num==1)
System.out.println("Disk 1 from "+start+" to "+end);
else{
solveProb(num-1,start,end,interim);
System.out.println("Disk "+num+" from "+start+" to "+end);
solveProb(num-1,interim,start,end);
}
}

}

Sunday, February 12, 2017

Friday, February 3, 2017

Open Addressing in Hash Collision

Linear Probing:-

Here we find the index from hash function. If that index is already occupied than we go to the next index available vacant index sequentially. While searching for the next available vacant index, if we come across any cell storing the same value which we are trying to store, than we stop our operation right there. This is all we need to be carefull when doing insertion or search operation using linear probing.

Deletion is a special case here. Suppose we delete some value from a specific index, we need to perform one more operation in this case where we continue parsing the remaining table from the deletion point and if we come across any value whose hashfunction value is equal to or lesser than hash value of to be deleted value, than in that case we just move this new value with the value which is being deleted.

Saturday, January 28, 2017

Hashing & Hashtable

HashTable:-

Hash table is faster than trees in terms of search and insertion (and sometimes deletion also). 

The time complexity is constant for these operations O(1) , no matter how many data items are there.

Disadvantages of hash table:-

1) Hash Tables are based on array. Arrays are difficult to expand incase we do not have information on count of input.
2) Sorting the data in hash table is not very convinient.

Basically hash table is a table where each value has a key or we can say an id. This id can be treated as address for that value in a given data structure like array.

So if we are using an array for creating our hash table, then in that case the array index can be used as id/key. Now, we can use a function/method to calculate the key for a given value to be stored in the array acting as hash table.


Hashing:-

What prompted hashing:-

Though many scenarios prompted hashing, but one of the examples we can use is below:-

Want to create a dictionary of 50,000 words in computer memory. To do so, we can do the following:-

Approach 1 :-  (Not a good approach)

Create an array of 50,000 length and store every word in each cell. But problem will be that how we will know which word lives in which index. So searching a word will become very slow.

Approach 2 :- (Not a good approach)

We know that we generally follow different flavors of character encoding standards like ASCII, UTF-8 etc. In these character encoding standards, we basically represent character with a number. Now with this approach we can sum all the numbers of the characters of a word and use that sum as index value which can be used to store the word. But that will create problems for palindrome words as they will have same index value when stored in array.

Approach 3 or Hashing :- (Good approach)

In hashing, we basically convert a large range into a number which is in a small range. The smaller range corresponds to index numbers an array. This function this operation of converting large range into small range is called hash function.

We can achieve this by doing following:-

To squeeze a large range into a number is small range, we can use modulo operator (%) which finds remainder when one number is divided by another. Lets understand this with an example:-

Large range :- 0 to 199
Small range :- 0 to 9

Now, size/length of small range is 10 (0,1,2,3,4,5,6,7,8,9)

Use below formula --> to find small number for a number from large range, say 157

small number = (large number) % (size/length of small range)
                      =  157 % 10
                      = 7



Now we can store word corresponding to 157 at index 7 in an array with an array size/limit of 10 (0 to 9)

Similarly we can index value in a small range array to store 50,000 dictionary words which is a large range.

So, in this case, the above formula  small number = (large number) % (size/length of small range) can be called as hash function where we compute array index in the form of small number from the above formula.

HashTable :- An array into which data is inserted using a hash function is called a hash table.


An array where data is inserted using hash function is called a hash table.


Thursday, January 26, 2017

Red Black Trees

Red-Black Trees:-

In continuation to binary search trees. Searching, deleting and inserting are very quick in a BST given that the BST is balanced or somewhat balanced. But the problem comes when data to be inserted is a sorted data either ascending form or descending form.

For example :- {3,5,7,8,9,12,28,30}  — asc sorted
                         {30,28,12,9,8,7,5,3} — desc sorted

When we try to create a BST out of sorted data, the tree becomes unbalanced and starts growing in either left direction if desc sorted or in right direction if asc sorted. So basically, it will be safe to say that it kind of becomes a linked list and with linked list comes its own issues.

Searching, insertion & deletion becomes O(N) instead of O(LogN).

So for a BST or BT (binary tree) to remain balanced , it is important that data insertion happens in random order instead of data being sorted. Though random insertion can also make a BST/BT unbalanced if a sequence of input data in between is in sequential order — for example :- {6,4,7,5,2,8,9,10,11,34,23,55,43} where 8,9,10,11 is a sequence which will make the tree unbalanced towards right side in some way.

Now to avoid a BST/BT in becoming unbalanced at the time of insertion, there are certain techniques available while doing insert and red-black tree is one such technique.

RB-Tree is basically some set of rules which should be followed while we are doing insert operation on BST/BT. These rules will ensure that remains balanced.

1) Every node is either red or black
2) The root node is always black
3) If a node is red, its children must be black
4) Every path from root to a leaf or to null child must contain same number of black nodes.
5) Every nil node is always black

Some light on the above 4 rules:-

 (a) Node being red/black is basically to conceptually understand the technique. To identify a node as red/black we can add a boolean flag in Node class to set it as true or false for red/black.
 (b) “Null Child” refereed in point 4 is basically a potential child to non-leaf node.For example potential left child to non-leaf node with right child or the other way round.

Black Height :- Number of black nodes while traversing from root to leaf. Re-enforcing rule 4, black height for root to leaf node should be same.


Duplicate Keys :- 

Incase we come across a sequence like 50,50,50 while performing insert , these duplicate records will be insert in the order:-

1) Insert first node
2) Insert first duplicate as right child of the first node.
3) Insert second duplicate as left child of the first node.




Tuesday, January 24, 2017

Time Complexity of Algorithms (Big O also covered)

Time complexity of an algorithm is the time take by an algorithm to complete the process. Now we need a way to define/express (metric) the time complexity for an algorithm. For example, to define/express (metric) weight of something, we use Kg/Lbs. In the world of algorithms, we use Big O to define/express time complexity.

When we define time complexity for any algorithm, we take into account worst-case time complexity and not best-case time complexity. 

Worst-case time complexity :- Maximum time taken by an algorithm for any input size.

Best-case time complexity :- Minimum time taken by an algorithm for any input size.

How to calculate time complexity ?

Now we have take into account that we will be using Big O notation as the metric for calculating time complexity.

Big O -- What is that ?

Big O for any algorithm determines how an algorithm responds (take time) to different variety of inputs or may be how much slow an algorithm is if the input is very large.

Note:-
An algorithm is =>

(a) Linear -- If doing something with every item in one dimension.
(b) Quadratic -- If doing something with every item in two dimension.
(c) Logarithmic -- Dividing the item in half in every iteration.



Note:- We ignore constants in Big O

Lets jump on to coding examples to figure out Big O

Example 1 :-

public void printText (int N){
     
System.out.println("We are calculation big o");   //This line is also example 2 (O(1))

for(int i=0;i<N;i++){
     System.out.println(i);
}

}

In the above program, System.out.println("We are calculation big o");    is a statement which is constant and will not change based on input size. So we can ignore this statement and statement/variables like this when determining Big O and treat them as constants.

Time complexity for the for loop will be linear, which means time taken by for loop will be directly proportional to the size of input which is N in this case.

When we say linear, that means O(N) or Order of N. Linear is more in terms of graphical representation of O(N).

This is linear because line is straight.

Example 2 :-

Consider the below example:-

public int printText (int N){
     
     return N;
}

Time complexity or Big O for this program will be O(1). The reason is because irrespective of how large the input is, the time will always be same.


O(1) is constant time which is also considered as the best-case scenario



Example 3 :-

Consider the below example:-

public void performOperation(int N){
     
     for(int i=0 ; i<N ; i++){
              for(int j=0 ; j<N ; j++){
                  System.out.println("printing the value");
             }
      }
}


In the above case, there are two for loops iterating twice for give input. So time complexity in this case is quadratic. O(n^2) is the big O in this case.

Below is the comparison graph for the above three examples:-




Example 4 :-

Consider the below example:-

public void performDivideAndConquere( int target){
     while(low <= high) 
{
  mid = (low + high) / 2;
  if (target < list[mid])
    high = mid - 1;
  else if (target > list[mid])
    low = mid + 1;
  else break;
}
   }

The above algorithm break any given input into halves to search a particular field like we do in any divide and conquere type of algorithm. The time complexity of this algorithm will be logarithmic because this is opposite of growing exponentially 2 times and actually dividing the input into half with each iteration.

Big O for this algorithm will be Log2 (N) or just Log (N)




Example 5 :-

Example 4 has a time complexity of Log (N). What if the program in example 4 is in itself is also repeated N times . In that case the time complexity of such an algorithm will be  N * Log (N)

The running time consists of N loops (iterative or recursive) that are logarithmic, thus the algorithm is a combination of linear and logarithmic.

















Monday, January 23, 2017

Logarithms


In day to day to life, we often sometimes say that something grows exponentially. Say we invest some money somewhere, it grows exponentially every year. Now growing exponentially can mean money became double, triple etc

For example,  start color blueD, 2, end color blueD raised to the start color greenE, 4, end color greenE, start superscript, t, h, end superscript power equals start color goldD, 16, end color goldD. This is expressed by the exponential equation start color blueD, 2, end color blueD, start superscript, start color greenE, 4, end color greenE, end superscript, equals, start color goldD, 16, end color goldD.

2*2*2*2 = 16

What if the opposite happens. Opposite of exponential growth is logarithmic growth.

In plain terms, if we keep multiplying something with X times will be exponential and if we keep dividing something X times will be logarithmic.

24=16log2(16)=4


Both equations describe the same relationship between the numbers start color blueD, 2, end color blueDstart color greenE, 4, end color greenE, and start color goldD, 16, end color goldD, where start color blueD, 2, end color blueD is the basestart color greenE, 4, end color greenE is the exponent, and start color goldD, 16, end color goldD is the power
The difference is that while the exponential form isolates the power, start color goldD, 16, end color goldD, the logarithmic form isolates the exponent, start color greenD, 4, end color greenD.
Here are more examples of equivalent logarithmic and exponential equations. 
Logarithmic formExponential form
log, start subscript, start color blueD, 2, end color blueD, end subscript, left parenthesis, start color goldD, 8, end color goldD, right parenthesis, equals, start color greenD, 3, end color greenDstart color blueD, 2, end color blueD, start superscript, start color greenD, 3, end color greenD, end superscript, equals, start color goldD, 8, end color goldD
log, start subscript, start color blueD, 3, end color blueD, end subscript, left parenthesis, start color goldD, 81, end color goldD, right parenthesis, equals, start color greenD, 4, end color greenDstart color blueD, 3, end color blueD, start superscript, start color greenD, 4, end color greenD, end superscript, equals, start color goldD, 81, end color goldD
log, start subscript, start color blueD, 5, end color blueD, end subscript, left parenthesis, start color goldD, 25, end color goldD, right parenthesis, equals, start color greenD, 2, end color greenDstart color blueD, 5, end color blueD, start superscript, start color greenD, 2, end color greenD, end superscript, equals, start color goldD, 25, end color goldD

Definition of a logarithm

Generalizing the examples above leads us to the formal definition of a logarithm. 
logb(a)=cbc=a
Both equations describe the same relationship between  start color goldD, a, end color goldDstart color blueD, b, end color blueD, and start color greenE, c, end color greenE:
  • start color blueD, b, end color blueD is the start color blueD, b, a, s, e, end color blueD,
  • start color greenE, c, end color greenE is the start color greenE, e, x, p, o, n, e, n, t, end color greenE, and
  • start color goldD, a, end color goldD is the start color goldD, p, o, w, e, r, end color goldD.
log, start subscript, start color blueD, 2, end color blueD, end subscript, left parenthesis, start color goldD, 16, end color goldD, right parenthesis, equals, start color greenE, 4, end color greenE