Saturday, September 28, 2019

Binary search simplified

class Main {
public static void main(String[] args) {

// java signed and unsigned right shift
int test = Integer.MAX_VALUE;
System.out.println(test);
int t1 = test + 1;
System.out.println(t1);
int t2 = (test + test) >>> 1;
System.out.println(t2);
int t3 = (test + test) >> 1;
System.out.println(t3);

int[] a = {3,3,3,20,20,20};
int k = 3;

k = 3;
System.out.println("any(" + k + ")" + any(a,k));

k = 4;
System.out.println("any(" + k + ")" + any(a,k));

k = 20;
System.out.println("any(" + k + ")" + any(a,k));

k = 3;
System.out.println("firstGte(" + k + ")" + firstGte(a,k));

k = 20;
System.out.println("firstGte(" + k + ")" + firstGte(a,k));

k = 19;
System.out.println("firstGte(" + k + ")" + firstGte(a,k));

k = 21;
System.out.println("firstGte(" + k + ")" + firstGte(a,k));

k = 3;
System.out.println("firstGt(" + k + ")" + firstGt(a,k));

k = 20;
System.out.println("firstGt(" + k + ")" + firstGt(a,k));

k = 19;
System.out.println("firstGt(" + k + ")" + firstGt(a,k));

k = 3;
System.out.println("lastLte(" + k + ")" + lastLte(a,k));

k = 20;
System.out.println("lastLte(" + k + ")" + lastLte(a,k));

k = 19;
System.out.println("lastLte(" + k + ")" + lastLte(a,k));

k = 3;
System.out.println("lastLt(" + k + ")" + lastLt(a,k));

k = 20;
System.out.println("lastLt(" + k + ")" + lastLt(a,k));

k = 19;
System.out.println("lastLt(" + k + ")" + lastLt(a,k));

k = 21;
System.out.println("lastLt(" + k + ")" + lastLt(a,k));
}

private static Integer lastLt(int[] a, int k) {
// last LT means the element just at the left of k(or virtual k)
int l = -1;
int r = a.length;

Integer ans = null;
while(r > l + 1) { // r - l will overflow if r = Int.MAX and l = -1
int m = (r+l) >>> 1; // r + l will overflow if r = Int.Max and l > 0
//System.out.println("in " + l +" " + r +" " +m);
if(a[m] < k) {
l = m;
} else {
r = m;
}
//System.out.println("out " + l +" " + r +" " +m);
}
return l;
}

private static Integer lastLte(int[] a, int k) {
// last LTE means (the last k) or (element just left of virtual k)
int l = -1;
int r = a.length;

Integer ans = null;
while(r > l + 1) { // r - l will overflow if r = Int.MAX and l = -1
int m = (r+l) >>> 1; // r + l will overflow if r = Int.Max and l > 0
//System.out.println("in " + l +" " + r +" " +m);
if(a[m] <= k) {
l = m;
} else {
r = m;
}
//System.out.println("out " + l +" " + r +" " +m);
}
return l;
}

private static Integer firstGt(int[] a, int k) {
//firstGt means the element just right of k (or virtual k)
int l = -1;
int r = a.length;

Integer ans = null;
while(r > l + 1) { // r - l will overflow if r = Int.MAX and l = -1
int m = (r+l) >>> 1; // r + l will overflow if r = Int.Max and l > 0
//System.out.println("in " + l +" " + r +" " +m);
if(a[m] > k) {
r = m;
} else {
l = m;
}
//System.out.println("out " + l +" " + r +" " +m);
}
return r;
}

private static Integer firstGte(int[] a, int k) {
//firstGte means (the first k) or (the element just right of virtual k)
int l = -1;
int r = a.length;

//exact match
Integer ans = null;
while(r > l + 1) { // r - l will overflow if r = Int.MAX and l = -1
int m = (r+l) >>> 1; // r + l will overflow if r = Int.Max and l > 0
//System.out.println("in " + l +" " + r +" " +m);
if(a[m] >= k) {
r = m;
} else {
l = m;
}
//System.out.println("out " + l +" " + r +" " +m);
}
return r;
}

private static Integer any(int[] a, int k) {
int l = -1;
int r = a.length;

//exact match
Integer ans = null;
while(r > l + 1) { // r - l will overflow if r = Int.MAX and l = -1
int m = (r+l) >>> 1; // r + l will overflow if r = Int.Max and l > 0
//System.out.println("in " + l +" " + r +" " +m);
if(a[m] == k) {
ans = m; break;
}
else if(a[m] > k) {
r = m;
} else {
l = m;
}
//System.out.println("out " + l +" " + r +" " +m);
}
return ans;
}
}

Thursday, September 19, 2019

MapReduce paper Notes

Paper is here.

MR programming model - Map - written by the user - takes an input key/value pair (file_name/file_contents) and emits intermediate pairs like (word, 1) for Wordcount. MR library combines all intermediate pairs and gives them to a reduce task which gives the final output for that intermediate key.

Handling master failure - just abort the MR task since there is only one master and hence failure is unlikely. Other option is to periodically save state of the master and pick up from there.

Locality - to preserve bandwidth, the MR master tries to schedule map task on a machine having the copy of the data or near to a replica of the data.

Backup tasks for Stragglers - When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution completes.

Custom partitioning - default is hash(key) mod R, at times it's not enough. For e.g. if it's web links and we want all links with same hostname in same output, then hash(hostname(key)) mod R is right hashing approach.

Combiner function - after the Map task is executed a big payload is sent over the network to the reducer. To reduce the size we can combine the intermediate keys which come out from the map. Reduce and Combine will most likely have the same code - just that output of Reduce will be written to an output file whereas output of the Combiner will be sent over the network to Reduce.

Debugging - there is a way to sequentially execute the MR tasks to debug them better.

Counters - there are ways to keep counts of specific things. They are sent from worker machines to master (piggybacking on the ping response).

Implementing Sort - TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1.

Some numbers - Cluster of 1800 machines with each one having two 2GHz Intel Xeon processors with HyperThreading enabled, 4GB of memory, two 160GB IDE disks and a gigabit Ethernet link.
Grep which scanned through 10^10 100 bytes records (1TB in total) took 150 seconds.
Sorting was around 1000 seconds.

Advantages - 
Code to deal with fault tolerance/distribution and parallelization is removed from the core task code.
Conceptually simpler.
MapReduce exploits a restricted programming model to parallelize the user program automatically and to provide transparent fault-tolerance.

















Thursday, September 5, 2019

CAP theorem

http://ksat.me/a-plain-english-introduction-to-cap-theorem/ 

CA => Consistent and available

C => same state
A => every request should receive a response
P => any number of messages in intercom can be lost

If
C => same state
AND
A => every request should receive a response, so every write request should be Success or Failure
THEN
!P => since you can't lose intercom messages

CP
If
C => same state
AND
P => any number of messages in intercom can be lost
THEN
!A => can't respond to write requests since there is no assurance of being C as the messages are lost

AP
A=> take write requests and respond
AND
P => any number of messages in intercom can be lost
THEN
!C => in absence of communication they will be in different states.


Blog Archive