Monday, October 28, 2019

mathematical mindset notes


Books by Keith Devlin
A mathematician grappling with his century
Flannery 2002 - how the puzzles with her dad helped her math


KENKEN puzzles
Youcubed.org -> number sense/ Fluency without fear/
Mark Driscoll's books, Ruth Parker's mathematics problems, 2 curricula from England 1) Points of Departure 2) SMILE (Secondary mathematics individualized learning experience)
Multilink cubes
Given 36 sides, find shape with largest enclosure area, fencing, how does sin(x) help here. sin(10), circle.
Book ref - What's math got to do with it?

Don't seem useful: Apps and Games(All are paid seems like) - Motion Math, Number Rack(Math Learning Center), mathbreakers.com, Wuzzit Trouble,

Thursday, October 17, 2019

biggest streak of good numbers

Given a list of bad numbers and lo and hi range, find the biggest streak of good numbers.

class Solution {
    public int driver() {
        int[] bad = new int[]{1,15,27};
        int lo = 27;
        int hi = 30;
        return bigStreak(bad, lo, hi);
    }
   
    private int bigStreak(int[] bad, int lo, int hi) {
        int max = 0;
        int i = 0;
        for(; i<bad.length-1; i++) {
            int curr = 0;  
            if (hi < bad[i]) {
                   curr = hi - lo + 1;
                   if(curr > max) max = curr;
                   break;
            } else if (hi == bad[i]) {
                curr = hi - lo;
                if (curr > max) max = curr;
                break;
            } else if (hi > bad[i] && hi <= bad[i + 1]) {
                if (lo < bad[i]) {
                    int s1 = bad[i] - 1 - lo + 1;
                    int s2 = hi - (bad[i] + 1) + 1;
                    if (hi == bad[i + 1]) --s2;
                    curr = Math.max(s1, s2);
                } else if (lo == bad[i]) {
                    curr = hi - (bad[i] + 1) + 1;
                    if (hi == bad[i + 1]) --curr;
                } else if (lo > bad[i]) {
                    curr = hi - lo + 1;
                    if (hi == bad[i + 1]) --curr;
                }
                if (curr > max) max = curr;
                break;
            } else if (hi > bad[i + 1]){
                if (lo <= bad[i]) {
                    int s1 = bad[i] - lo;
                    int s2 = bad[i + 1] - bad[i] - 1;
                    curr = Math.max(s1, s2);
                    lo = bad[i + 1] + 1;
                } else if (lo > bad[i] && lo <= bad[i + 1]) {
                    curr = bad[i + 1] - lo;
                    lo = bad[i + 1] + 1;
                } else if ( lo > bad[i + 1]) {
                   ;
                }
                if (curr > max) max = curr;
            }
        }
       
        if ( i == bad.length - 1) {
            int curr = hi - lo + 1;
            if ( curr > max ) max = curr;
        }
        return max;
    }
}

Tuesday, October 15, 2019

hotstar peak handling how

1. WC 2019 IND vs NZ Semifinal resulted in 25.3M concurrent users. Usage of around 10Tbps bandwidth which is 70% of India's capacity.
2. Scale proactively - can't scale in real time. Autoscaling results in insufficient capacity errors at times.
3. Graceful degradation - turn off less critical services when there is too much traffic. Reduce video quality for users. If auth system is having issues(DB in the backend is slow) - allow users to watch without login.
4. Peak traffic shift - as soon as Dhoni got out people went to the home screen. Home screen calls a lot of APIs - recommendation/watch history/general content.
5. Push notifications result in a traffic peaks.
6. Prewarm infra before match time.
7. No dumb clients - clients should add jitter to requests so as not to overwhelm the backend in case of issues.
8. Reject early - drive away a lot of spurious traffic - don't let it come to your backend.


Video

Blog

Facebook live streaming - split input feed into smaller chunks and deliver via hierarchy of cache servers. Don't let too many requests come to the origin server. First request will be a cache miss and rest will be held until the first one comes back.

Monday, October 14, 2019

reservoir sampling

good answer by sumitra narayan

Algorithm R from wikipedia:

(* S has items to sample, R will contain the result *)
ReservoirSample(S[1..n], R[1..k])
// fill the reservoir array
for i = 1 to k
   R[i] := S[i]
// replace elements with gradually decreasing probability
for i = k+1 to n
   j := random(1, i) // important: inclusive range
   if j <= k R[j] := S[i]

String search algorithms - Naive vs Rabin Karp vs KMP

Naive algorithm is O(n) in average as the probability of a random match is very low and even low for 2 consecutive character matches : 26^2. For 1 million char text and 1000 char word 1.04 million comparisons in average case. Worst case O(m*n) so if the above example 1 billion comparisons.

Rabin karp - used in plagiarism detection. Uses rolling hash - for e.g. if the text is ABCD, and we want to search a word of size 3 then we first compare the hash of ABC which is a*7^0 + b*7^1 + c*7^2. If it doesn't match the hash of the word then we move forward like this 1. first drop A => prev_hash - a*7^0. 2. Then divide the balance by 7 => (prev_hash - a*7^0)/7. 3. Then add D => (prev_hash - a*7^0)/7 + D*7^2. Worst case O(m*n)

KMP - Knuth Morris Pratt - uses a pre-computed table to decide from where to resume search in case of failure. Worst case O(n).

Saturday, September 28, 2019

Binary search simplified

class Main {
public static void main(String[] args) {

// java signed and unsigned right shift
int test = Integer.MAX_VALUE;
System.out.println(test);
int t1 = test + 1;
System.out.println(t1);
int t2 = (test + test) >>> 1;
System.out.println(t2);
int t3 = (test + test) >> 1;
System.out.println(t3);

int[] a = {3,3,3,20,20,20};
int k = 3;

k = 3;
System.out.println("any(" + k + ")" + any(a,k));

k = 4;
System.out.println("any(" + k + ")" + any(a,k));

k = 20;
System.out.println("any(" + k + ")" + any(a,k));

k = 3;
System.out.println("firstGte(" + k + ")" + firstGte(a,k));

k = 20;
System.out.println("firstGte(" + k + ")" + firstGte(a,k));

k = 19;
System.out.println("firstGte(" + k + ")" + firstGte(a,k));

k = 21;
System.out.println("firstGte(" + k + ")" + firstGte(a,k));

k = 3;
System.out.println("firstGt(" + k + ")" + firstGt(a,k));

k = 20;
System.out.println("firstGt(" + k + ")" + firstGt(a,k));

k = 19;
System.out.println("firstGt(" + k + ")" + firstGt(a,k));

k = 3;
System.out.println("lastLte(" + k + ")" + lastLte(a,k));

k = 20;
System.out.println("lastLte(" + k + ")" + lastLte(a,k));

k = 19;
System.out.println("lastLte(" + k + ")" + lastLte(a,k));

k = 3;
System.out.println("lastLt(" + k + ")" + lastLt(a,k));

k = 20;
System.out.println("lastLt(" + k + ")" + lastLt(a,k));

k = 19;
System.out.println("lastLt(" + k + ")" + lastLt(a,k));

k = 21;
System.out.println("lastLt(" + k + ")" + lastLt(a,k));
}

private static Integer lastLt(int[] a, int k) {
// last LT means the element just at the left of k(or virtual k)
int l = -1;
int r = a.length;

Integer ans = null;
while(r > l + 1) { // r - l will overflow if r = Int.MAX and l = -1
int m = (r+l) >>> 1; // r + l will overflow if r = Int.Max and l > 0
//System.out.println("in " + l +" " + r +" " +m);
if(a[m] < k) {
l = m;
} else {
r = m;
}
//System.out.println("out " + l +" " + r +" " +m);
}
return l;
}

private static Integer lastLte(int[] a, int k) {
// last LTE means (the last k) or (element just left of virtual k)
int l = -1;
int r = a.length;

Integer ans = null;
while(r > l + 1) { // r - l will overflow if r = Int.MAX and l = -1
int m = (r+l) >>> 1; // r + l will overflow if r = Int.Max and l > 0
//System.out.println("in " + l +" " + r +" " +m);
if(a[m] <= k) {
l = m;
} else {
r = m;
}
//System.out.println("out " + l +" " + r +" " +m);
}
return l;
}

private static Integer firstGt(int[] a, int k) {
//firstGt means the element just right of k (or virtual k)
int l = -1;
int r = a.length;

Integer ans = null;
while(r > l + 1) { // r - l will overflow if r = Int.MAX and l = -1
int m = (r+l) >>> 1; // r + l will overflow if r = Int.Max and l > 0
//System.out.println("in " + l +" " + r +" " +m);
if(a[m] > k) {
r = m;
} else {
l = m;
}
//System.out.println("out " + l +" " + r +" " +m);
}
return r;
}

private static Integer firstGte(int[] a, int k) {
//firstGte means (the first k) or (the element just right of virtual k)
int l = -1;
int r = a.length;

//exact match
Integer ans = null;
while(r > l + 1) { // r - l will overflow if r = Int.MAX and l = -1
int m = (r+l) >>> 1; // r + l will overflow if r = Int.Max and l > 0
//System.out.println("in " + l +" " + r +" " +m);
if(a[m] >= k) {
r = m;
} else {
l = m;
}
//System.out.println("out " + l +" " + r +" " +m);
}
return r;
}

private static Integer any(int[] a, int k) {
int l = -1;
int r = a.length;

//exact match
Integer ans = null;
while(r > l + 1) { // r - l will overflow if r = Int.MAX and l = -1
int m = (r+l) >>> 1; // r + l will overflow if r = Int.Max and l > 0
//System.out.println("in " + l +" " + r +" " +m);
if(a[m] == k) {
ans = m; break;
}
else if(a[m] > k) {
r = m;
} else {
l = m;
}
//System.out.println("out " + l +" " + r +" " +m);
}
return ans;
}
}

Thursday, September 19, 2019

MapReduce paper Notes

Paper is here.

MR programming model - Map - written by the user - takes an input key/value pair (file_name/file_contents) and emits intermediate pairs like (word, 1) for Wordcount. MR library combines all intermediate pairs and gives them to a reduce task which gives the final output for that intermediate key.

Handling master failure - just abort the MR task since there is only one master and hence failure is unlikely. Other option is to periodically save state of the master and pick up from there.

Locality - to preserve bandwidth, the MR master tries to schedule map task on a machine having the copy of the data or near to a replica of the data.

Backup tasks for Stragglers - When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution completes.

Custom partitioning - default is hash(key) mod R, at times it's not enough. For e.g. if it's web links and we want all links with same hostname in same output, then hash(hostname(key)) mod R is right hashing approach.

Combiner function - after the Map task is executed a big payload is sent over the network to the reducer. To reduce the size we can combine the intermediate keys which come out from the map. Reduce and Combine will most likely have the same code - just that output of Reduce will be written to an output file whereas output of the Combiner will be sent over the network to Reduce.

Debugging - there is a way to sequentially execute the MR tasks to debug them better.

Counters - there are ways to keep counts of specific things. They are sent from worker machines to master (piggybacking on the ping response).

Implementing Sort - TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1.

Some numbers - Cluster of 1800 machines with each one having two 2GHz Intel Xeon processors with HyperThreading enabled, 4GB of memory, two 160GB IDE disks and a gigabit Ethernet link.
Grep which scanned through 10^10 100 bytes records (1TB in total) took 150 seconds.
Sorting was around 1000 seconds.

Advantages - 
Code to deal with fault tolerance/distribution and parallelization is removed from the core task code.
Conceptually simpler.
MapReduce exploits a restricted programming model to parallelize the user program automatically and to provide transparent fault-tolerance.

















Thursday, September 5, 2019

CAP theorem

http://ksat.me/a-plain-english-introduction-to-cap-theorem/ 

CA => Consistent and available

C => same state
A => every request should receive a response
P => any number of messages in intercom can be lost

If
C => same state
AND
A => every request should receive a response, so every write request should be Success or Failure
THEN
!P => since you can't lose intercom messages

CP
If
C => same state
AND
P => any number of messages in intercom can be lost
THEN
!A => can't respond to write requests since there is no assurance of being C as the messages are lost

AP
A=> take write requests and respond
AND
P => any number of messages in intercom can be lost
THEN
!C => in absence of communication they will be in different states.


Wednesday, August 7, 2019

trinity piano grade 1 enchanted garden notes


Enchanted garden notes:
Bar 1-8: https://www.youtube.com/watch?v=NUKosaWmJVc&list=LLKd4SDfv1ujxDFLpKOh4fIw
Notice treble clef for both the hands.

Right hand:
AD AD BBCB AD AD AD BBCB A

Left hand:
GF GF GE GF GF  GF GE GF
D_  D_ D_  D_ D_ D_  D_ D_
Bar 9-12 make up the phrase: (notice left hand switching to bass clef)

https://www.youtube.com/watch?v=TQa8BOJUuuw&list=LLKd4SDfv1ujxDFLpKOh4fIw

Right hand:
A_CFG_AGA_CFG_AGF_ (finger 3 on A and ends with 1st finger on F)

Left hand:
F C R | Bf F R | F C R | Bf F R (play with fingers 1,5 or 2,5 based on hand size)

Thursday, May 9, 2019

Thursday, May 2, 2019

Advanced features

https://www.coursera.org/learn/competitive-data-science/lecture/mpCps/statistics-and-distance-based-features

Statistics and distance based features: groupby and nearest neighbor methods
Neighbors - for e.g. to predict rental prices, features could be number of schools/hospitals in a radius.
CTR example - ad price, ad position, user_id, page_id - you can use group by on user/page to add new features. Or even the previous history of the user.

Bray curtis metric.
-------------------------------------
Matrix factorizations: documents/words - dimensionality reduction.

mean encoding

A very popular/important video:
https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

Mean encoding regularization
CV loop
LOO - Leave one out - using target variable to generate the new feature makes our encoding biased.

Smoothing.
Noise.
Expanding mean.
------------------
generalizations and extensions of mean encodings: for regression/multiclass.
Many to many relations: for e.g. classification of users based on the apps installed on their phones. Each user can have multiple apps, each app can be installed by many users. Hence, many-to-many relation.

In this case, convert data to long representation. So that, each row will have <user_id, app_id, target> like <uid1, app_id1, target(0 or 1)>. Now you can take mean of targets for every app. But how to map it back to users?

Interactions and numerical features -?


Monday, April 29, 2019

Regression/Classification metrics optimization

https://www.coursera.org/learn/competitive-data-science/lecture/SQ9Uq/regression-metrics-optimization

MAE (L1 metric/L1 loss) - some methods don't support this metric since second derivative is zero.
RMSLE

Classification metrics optimization

https://www.coursera.org/learn/competitive-data-science/lecture/hvDC5/classification-metrics-optimization-i

logloss - very popular(like MSE for regression) - all NNs by default optimize logloss for classification. RFs turn out to be very bad in terms of logloss but they can be made better.

logloss requires models to output posterior probabilities - but what does it mean? logloss is easy to implement.

accuracy -



Friday, April 26, 2019

installing dropbox on centos linux

yum install docker - y
alias dropbox="docker exec -it dropbox dropbox"
alias dropbox-start="docker run -d --restart=always --name=dropbox -v
/home/val/Dropbox:/dbox/Dropbox -v /home/val/.dropbox:/dbox/.dropbox
-e DBOX_UID=1000 -e DBOX_GID=100 janeczku/dropbox"
sudo systemctl start docker
dropbox-start
docker ps (will give you the pid for e.g. a6f6a1d0866f)
docker logs --follow a6f6a1d0866f (now you will see a web link to
connect dropbox to your pc)

Thursday, April 4, 2019

metrics and loss function

Sometimes both are same for e.g. MSE. But at times they will be
different since it is hard to define loss function for some metrics.
In XgBoost you can write cusotm loss functions but they should have
smooth derivatives otherwise it will go crazy.

Early stopping - In case we don't know how to write a loss function
for the target metric, there is a simple solution - Early Stopping.
Keep optimizing the loss function but stop when the model starts over
fitting as per the target metric.

area under curve (AUC) simple explanation

Best explanation I have found is here:
https://www.coursera.org/learn/competitive-data-science/lecture/EhJzY/classification-metrics-review

1) Area under curve
For every threshold plot number of TPs on y-axis and FPs on x-axis. If
dataset can be clearly separated by a threshold then the AUC will be
1(max value).

2) Pair ordering
Consider all possible pairs such that one item is TP and another is
FP. AUC is the probability that the FP is ranked higher than TP (so
that threshold puts TPs on the left and FPs on the right).

Monday, April 1, 2019

Classification metrics

Accuracy
logloss, hard predictions, soft predictions
AUC?
Cohen's Kappa - error weight matrix - linear weighted or quadratic
weighted kappa.

Thursday, March 28, 2019

Metrics DS

RMSE - Root mean squared error
MSE - Mean squared error

MSE and RMSE are similar but behave differently for gradient based methods.
MAE - Mean absolute error

MAE optimal target constant is median - hence handles outliers better.
MSE optimal target constant is mean.

MAPE and MSPE are weighted versions of above.
P is for percentage.
They penalize based on the target absolute value.
For e.g. MSE and MAE will treat error of 1 equally for 9/10 and 999/1000.
But MAPE and MSPE won't. They will penalize more for 9/10 rather than
999/1000 since relative error is higher for 9/10.

Another one is RMSLE - uses logs. It penalizes relative errors and also unbiased towards smaller targets unlike MAPE and MSPE.

Sunday, March 24, 2019

cross validation strategies

leave one out cross validation - used when very few samples are available.
k fold - regular train/validate split(also called hold out) k times.

stratified cross validation

Advantages:
While validating, split the data in such a way that all classes are
represented in both train/validation sets.

Good for small and unbalanced datasets.

Disadvantages:
1. One specific issue that is important across even unbiased or
balanced algorithms, is that they tend not to be able to learn or test
a class that isn't represented at all in a fold, and furthermore even
the case where only one of a class is represented in a fold doesn't
allow generalization to performed resp. evaluated.

2. Also, supervised stratification compromises the technical purity of
the evaluation as the labels of the test data shouldn't affect
training, but in stratification are used in the selection of the
training instances.

tl;dr
Stratification is recommended if very few samples are available. For
large datasets, law of large numbers kicks in i.e. samples in
train/validation data will also be huge and representative of the
actual data distribution.

Sunday, March 3, 2019

Cartesian product code in python php

---php----
$a[0] = ['please', 'could you', ''];
$a[1] = ['create', 'make', ''];
$a[2] = ['a', 'an', ''];
$a[3] = ['quick', 'short',''];

$combined = array();
for($i = 0; $i < count($a); ++$i) {
$combined = combine($combined, $a[$i]);
}

print_r($combined);

function combine($a, $b) {
if(!$a) return $b;
if(!$b) return $a;
$out = array();
for($i = 0; $i < count($a); ++$i) {
for ($j = 0; $j < count($b); ++$j) {
$curr = trim($a[$i].' '.$b[$j]);
$out[] = $curr;
}
}
return $out;
}

---python 1 ---
def combine(a, b):
if not a:
return b
if not b:
return a
out = []
for ai in a:
for bi in b:
out.append((ai + ' ' + bi).strip())
return out

arr = []
arr.append(['please', 'cortana please', ''])
arr.append(['create', 'make', ''])
arr.append(['a', 'an', ''])
arr.append(['quick', 'short',''])
arr.append(['note'])

combined = []
for ci in arr:
combined = combine(combined, ci)
print(len(combined))
for e in combined:
print(e)
--------------------------

Python 2:
def find(all1, ans, i):
if i == len(all1):
print(ans)
return
j = 0
while j < len(all1[i]):
ans.append(all1[i][j])
find(all1, ans, i+1)
ans.pop()
j += 1


all1 = []
all1.append(["hi", "hey"])
all1.append(["there", "here"])
all1.append(["cute", "handsome"])

ans = []
find(all1, ans, 0)

Monday, February 25, 2019

mouse pointer getting stuck in bottom right corner

Windows 10
Dual monitor setup
Mouse: Microsoft Sculpt Ergonomic
Problem: Mouse pointer stuck in bottom right corner of screen
Solution: Remove the batteries and put them back.

Few days ago, I wasn't able to move my mouse pointer - so I changed batteries and it started working.

Again today the same thing. But today I noticed that the pointer was stuck in the bottom right corner.
May be it was the same earlier too but I didn't notice due to the dual monitor setup.

So I just removed the batteries and put them back again - it somehow resets something in the mouse and it starts working.

Tuesday, January 29, 2019

c# matching multiple word boundaries in a single string - regex OR pattern

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Linq;


class MainClass {
public static void Main (string[] args) {
var input = "my bull dog is missing cat";
var topics = new List<string>();
topics.Add("cat");
topics.Add("bull dog");
string pattern = string.Join("|", topics.Select(x => @"\b" + @x + @"\b"));

var regex = new Regex(pattern, RegexOptions.IgnoreCase);

var m = regex.Matches(input);
foreach (Match match in m)
{
int i = match.Index;
Console.WriteLine(i);
}
}
}

Tuesday, January 8, 2019

datetime and co-ordinate features

For datetime, new features which can be generated are: applying periodicity, calculating time passed since particular event, and date differences between two datetime features.

For coordinates, we should recall extracting interesting samples from trained test data, using places from additional data, calculating distances to centers of clusters, and adding aggregated statistics for surrounding area.

Categorical and ordinal features

First, ordinal is a special case of categorical feature but with values sorted in some meaningful order.
 - for e.g. 1st class, 2nd class in railways.

Second, label encoding, basically replace the unique values of categorical features with numbers.
 - either by sorting them alphabetically or assigning a code in order of appearance.

Third, frequency encoding - maps unique values to their frequencies.
- for e.g. how many times 1st class occurred.

Fourth, label encoding and frequency encoding are often used for tree-based methods.

Fifth, One-hot encoding is often used for non-tree-based-methods.

And finally, applying One-hot encoding combination on combinations of categorical features allows non-tree- based-models to take into consideration interactions between features, and improve.
 - for e.g. in titanic dataset - you could create a new categorical feature by combining sex and pclass.

If pclass = 1,2,3 and sex = M,F
then features could be:
1M, 1F, 2M, 2F, 3M, 3F and we could use one-hot encoding here.

One-hot encodings can be stored as Sparse metrices(which use the storage efficiently when number of non-zero values are less than half of total values).

Sunday, January 6, 2019

Numeric features - Competitive DS course

Preprocessing:
1. Tree based models - preprocessing doesn't matter for numeric features. Since they are only trying to split features irrespective of the scale.

2. Non-tree based models - numeric feature preprocessing matters. For e.g. in kNN - scaling one feature alone would result in completely different distances and hence predictions. Same goes for NNs and Linear models.

Most often used preprocessings:
MinMaxScaler => Value - min/(Max - min)
StandardScaler => Using Mean/Std
Rank => sets spaces between sorted values to be equal (handles outliers well)
log(1+x) sqrt(1+x)

Feature generation:
Fraction of price => for product pricing, for e.g. what's the impact of fractional price? So, fractional price is a new feature. .49 in 2.49.

Social media post interval => humans won't post at regular intervals of 1 second.


Blog Archive