Thursday, November 29, 2018

R tutorial + Stats + Types of Inferences

Random variables - Continuous,Discreet,Categorical
Their probability distributions be discreet/continuous.

First type of Inference - Computing Population mean from a Sample mean:
Problem: find mean weight of all football players in the world.
Input data: Weights of 45 players of your college. Mean = 173, StanDev = 15, N(number of samples) = 45
From this input data you have to compute the output.

Sampling distribution is a normal distribution with mu = x bar = sample mean
Sigma =  StanDev for Sampling distribution = Sample error which can be derived from StanDev of the sample = Sample StanDev/Sqrt(N)

So if sample mean = 173
number of data points = N = 45
then Sampling Distribution mean estimate = 173

If StanDev for Sample = 15
then Standard error = 15/Sqrt(45) = 2.23

So with 95% (2 Sigma) confidence we can say that the mean of population lies in 173 +- 2.23*2
-------------------------
Second type of Inference - Population Proportion - Identifying the population % - Election polling - Yes/No type of variable
Find out the winning chances of this candidate.
Pollster picks a sample of 2000 voters.
1100 say Yes for this candidate. 55%.
Sample StanDev in this case is different since it's population % case.
 = Sqrt(p*1-p/N) = Sqrt(55*45/2000) = 0.01
Here p = 55 as computed above
Standard Error  = Sample StanDev in this case = 0.01% = 1 basis point

2 Sigma = 2% here
Summary: 55% people support the candidate and we can say with 95% (2 sigma) probability that this number is off by at most 2%..
---------------------------------
Third - verifying whether the population mean is equal to a certain value - Hypothesis testing for population mean
A medical study
Let's verify this hypothesis: The average life expectancy of an Indian College Graduate is 70 years.
Let's take a sample.
N = 100
Mean life expectancy of the sample = 65
SD of life expectancy of that sample = 10

So:
Standard Error(the best estimate for the SD of the Sampling Distribution) = 10/Sqrt(100) = 1
Also,
Best point estimate of the population mean = Sample mean = 65

Now, let's perform a test of significance. What does that mean?
We need to figure out whether the difference between 65 and 70 is only due to chance or that difference is because the population mean is actually very different from 70.
First one is Null Hypothesis.
Second one is Alternative hypothesis.
If we end up accepting Null Hypothesis - it means that the mean is indeed 70.
If we accept Alternative Hypothesis - it means that the mean is not 70.

For doing this:
Compute the probability of Null hypothesis being true.
If the probability is too low - reject the Null hypothesis. Else accept it.

For doing this:
We will use Z-Statistic = Normalized distance between the Sample mean(65) and the Hypothesized population mean(70) = (Sample mean - Population mean)/Standard Error = 65-70/1 = -5
What we are doing here is (I am not sure about this part) converting it to a standard normal distribution with mean = 0 and StanDev = 1.
So now we need to check P(|Z| > 5) = P(the difference > 5) = Area under the curve for values beyond 5.

In R, compute like this: 1 - PNorm(5) - PNorm(-5) = 5.73e-07 which is very low. So Null Hypothesis is false.
What if this value was 0.1, then with 90% confidence we can say that the Null Hypothesis is false.
If it was 0.05 we can say the same thing with 95% confidence
If it was 0.01 then 99% confidence.
and so on....
-----------------------
4 - Hypothesis testing for population proportion
Verifying whether the population % is equal to a certain value
Example: 40% of the employees check Facebook first thing in the morning.

% of the sample who said yes = 43
N = 100 (number of samples)
StanDev of the sample = Sqrt(p(1-p)/n) = Sqrt(.43*.57/100) = .05
So mean = .43, SD=.05
Null hypothesis p = .4
Alternative hypothesis p <> .4
Null and Alternate hypothesis should be MECE(Mutually Exclusive and Collectively Exhaustive)
Z-statistic = p-hat - p/SD = .43 - .40/.05 = .03/.05 = .6
P(|Z| > .6) = probability that null hypothesis is true
Using R:
> print(1 - pnorm(.6) + pnorm(-.6)  )
[1] 0.5485062

So, probability is very strong that the Null Hypothesis holds.


pnorm(x) gives the area under the curve from x = -infinity to the given value of x.
So 1 - (pnorm(.6) - pnorm(-.6) ) = Area under the curve before -.6 and after .6 .
----------------------
5 - A/B testing - comparing the means of two populations
Two layouts of a website. Old A and new B.
Null hypothesis: Average minutes/visit on A <= B. B is no worse than A.
Alternate: A > B. B is worse than A.
It's the first time when the Null hypothesis has an inequality. In all earlier cases, it was the other way round. It's called one sided test otherwise it's called 2-sided test.

A has 5 mins per visit.
B has 3 mins per visit.

Null hypothesis says difference between A and B is due to chance - hence not significant.
Alternate hypothesis says it's significant.

Z-statistic in this case(when we are comparing means of 2 distributions) = (X1bar - X2bar)/Sqrt(Sigma_x1_square + Sigma_x2_square)
Sigma = standard error = StanDev/Sqrt(number_of_samples)
Here they are 0.5, 0.4
= (5-3)/sqrt(.5^2 + .4^2) = 2/sqrt(.25+.16) = 3.12
P (Z > 3.12) -notice only one sided test, no absolute value
> print(1 - pnorm(3.12)  )
[1] 0.0009042552
It's same as
pnorm(-3.12) because of symmetry of normal distribution
Probability is too low. Reject the null hypothesis.
So new layout is indeed worse.
----------------
Verifying whether 2 population % are different
Customer Surveys
Which users are more satisfied - mobile app or web
A = 100 people who use only the app
B = 100 people who use only the website
% of happy customers in A = 67%
% of happy customers in B = 63%

Standard Error = StanDev = Sqrt(p*(1-p)/100) = .047 for A and .048 for B
Null hyothesis: % voting yes on App <= website
Alternate App > website

Z-stat = .67-.63/sqrt(.047^2 + .048^2) = .04/sqrt(.002331 + .002211) = .04/.067 = .597
> print(1 - pnorm(.597)  )
[1] 0.2752537

So probability of null hypothesis = 0.27
Alternate = .73


Tuesday, November 27, 2018

Thalassemia intermedia is characterized by Hb level between 7 and 10
g/dl, MCV between 50 and 80 fl and MCH between 16 and 24 pg.
Thalassemia minor ischaracterized by reduced MCV and MCH, with
increased Hb A2 level.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2893117/

Wednesday, November 21, 2018

R and statistics

https://www.safaribooksonline.com/videos/learn-by-example/9781788996877


r <- c(1,2,3,4,5,5,6,6,8,9)
range(r)

Method 1:
bins <- seq(0,11,by=2)
intervals <- cut(r,bins,right=FALSE)
print(intervals)
print(bins)
t <- table(intervals)
print(t)
plot(t,type="h",main="rabit",xlab="intervals",ylab="count")

Method 2:
hist(r, breaks=bins)

Mean,Median,Mode:
print(mean(r))
print(median(r))
print(sort(table(r), decreasing = TRUE)[1])

Histogram is the bar chart of the frequency table.
-------------
Measuring data spread:

computing IQR - Inter quartile range - a measure of data spread

Divide your sorted data into 4 equal parts. The 3 partition boundaries inside it are Q1,Q2, Q3. 
Q3 - Q1 is IQR.
Q2 is effectively median of the data.
Q1 is the median of the first half of data.
Q3 is the median of the second half of data.

-------------
Box and whisker plots help us in visualizing IQR and outliers.
How to draw:
Draw a box starting at Q1 and ending at Q3. So width of this box is IQR.
From the median draw two whiskers in both directions (left and right). Each having length 1.5*IQR.
Now any data points beyond the end of the whiskers are outliers.
---------

Next measure of dispersion is: Standard Deviation(SD) - way more popular than IQR.
In financial markets - it's called volatility.

Essentially - how far away from the mean the points are.

SD = SQRT(Variance)
Variance = sum((mean - datapoint)^2)/n = mean of (squares of deviations)

Why is variance better than the mean of absolute values?

1. Variance is more sensitive:
For the deviations -2,4 and -3,3
Mean of abs = 3,3
But variance = (2^2 + 4^2)/2 = 10 
and (-3^2 + 3^2)/2 = 9
So variance is more for 2,4 as opposed to 3,3 which is good!

2. Variance is computationally cheaper:
An if condition(to check whether the number is negative or not) is more expensive than a square.

3. Variance has cool mathematical properties and fundamentally tied with Normal distribution.

4. Unlike IQR, Variance is sensitive to outliers. Like mean is sensitive to outliers but median is not.
So if you see median/IQR combo is not sensitive to outliers but Mean/Variance combo is sensitive.

Drawback of the variance: it's not of the same order of the dataset and the mean - since it's squared.
That's why we have SD = Sqrt(Variance)
-----------------
In R, summary() method gives a good summary of the dataset.
r <- c(1,2,3,4,5,5,6,6,8,9)

print(summary(r))

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 

   1.00    3.25    5.00    4.90    6.00    9.00 

IQR(r) method:
> print(IQR(r))

[1] 2.75
----------------------
Visually compare 2 datasets using boxplot and IQR
r1 <- c(1,1,2,4,5,5,6,6,9,10)
r2 <- c(1,2,3,4,5,5,6,6,8,9)
print(IQR(r1))
print(IQR(r2))
print(mean(r1))
print(mean(r2))
combined <- cbind(r1,r2)
boxplot(combined)
-----------------
Method for computing SD:
sd(r)
-------------






Wednesday, November 14, 2018

Hard-learned lessons in leading design at MailChimp - Aarron Walter VP of Design Education, InVision

1. Your legacy is the people you hire.
2. Bad things happen when people feel left out. Communicate early,
often. Rather than communicating the grand design at the end.
3. You don't become VP by keeping your headphones on.
4. Your legs are the best design tool - visit the customers.
5. Try to draw from many designs - ask everyone to draw and put on wall.
6. You need to start formalizing a refinement process—something at
MailChimp we called "Guns forward, guns backward." That might mean
having a team dedicated specifically to product refinement, constantly
sanding off the edges.

"The best product companies in the world have figured out how to make
constant quality improvements part of their essential DNA." –Phil
Libin, former CEO of Evernote

Monday, November 12, 2018

Building an image classifier quickly

Updated on 24th June 2020

Here are my steps:
1. Installed Ubuntu subsystem on my Windows 10 machine.
sudo -i
apt update
apt install python-pip
sudo rm -rf /etc/apt/apt.conf.d/20snapd.conf(had to run this due to an error)
apt install python-pip
apt install tensorflow 
apt update
apt install tensorflow
apt install python-pip
apt install tensorflow
apt install python3-pip
pip3 install --user --upgrade tensorflow
pip install numpy
pip install tensorflow
git clone https://github.com/googlecodelabs/tensorflow-for-poets-2 (HEAD: bc96088a4de86729920e120111f5b208f7f1cbb1)
cd tensorflow-for-poets-2
mkdir -p tf_files/images/useful
mkdir -p tf_files/images/useless
Put images in above folders

Training:
11. IMAGE_SIZE=224
12. ARCHITECTURE="mobilenet_0.50_${IMAGE_SIZE}"
13.  python -m scripts.retrain --bottleneck_dir=tf_files/bottlenecks --model_dir=tf_files/models/"${ARCHITECTURE}" --summaries_dir=tf_files/training_summaries/"${ARCHITECTURE}"  --output_graph=tf_files/retrained_graph.pb  --output_labe
ls=tf_files/retrained_labels.txt --architecture="${ARCHITECTURE}"  --image_dir=tf_files/images

Testing:
12. python -m scripts.label_image  --graph=tf_files/retrained_graph.pb  --image=YOUR_PATH_TO_IMAGE_HERE


I built it for classifying whatsapp images into personal vs generic(good morning messages etc). It works quite well.

Shell script to test and move images into respective folders:

#!/bin/bash
for filename in /mnt/c/images/actual/*.jpg; do
        echo $filename
        dest=`python -m scripts.label_image  --graph=tf_files/retrained_graph.pb  --image=$filename | head -n4 | tail -n1 |grep -o '^\S*'`
        echo $dest
        mv $filename /mnt/c/images/out/$dest
done

Friday, November 9, 2018

design of design essays

Rationalist vs Empiricist - in Software the later wins

Constraints:
are good.
General purpose artifact design is harder than special purpose one.
What's the budgeted resource - may not be dollars. Time/chip size/number of pins/UX real estate.
That resource can change too.


Design divorce:
Design as a separate thing is fairly recent phenomena.

User Model - Better wrong than vague - Truth will sooner come out of error than from confusion.

Telecollaboration:

Low tech is often good : document + phone call
Face to face time is crucial. Investing money in it makes sense.


What to design:

The hardest part of design is deciding what to design.

A chief service of a designer is helping clients discover what they want designed.

Requirements:

Any attempt to formulate all possible requirements at the start of a project will fail and would cause considerable delays.

Why does waterfall model still persist:
Contracts

Role of Process:

The trick is to hold "process" off long enough to permit great design
to occur, so that the lesser issues can be debated once the great
design is on the table—rather than smothering it in the cradle.

Thus, product processes are, properly, designed for follow-on
products. For innovation, one must step outside of process.

Thursday, November 8, 2018

Azure Centos 7.3 setup with apache + php7.3 +mongodb

#disable selinux
163 vim /etc/selinux/config
164 setenforce 0

#install php7.3
10 yum install
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
11 yum install http://rpms.remirepo.net/enterprise/remi-release-7.rpm
12 yum install yum-utils
22 yum-config-manager --enable remi-php73
25 yum install php php-mcrypt php-cli php-gd php-curl php-mysql
php-ldap php-zip php-fileinfo
26 php -v

#httpd(apache)
1 yum install httpd
27 vim /etc/httpd/conf/httpd.conf
38 netstat -punta | grep LISTEN
60 pkill -9 httpd
57 service httpd restart

#install mongodb
89 vim /etc/yum.repos.d/mongodb-org-4.0.repo (copied from mongodb website)
90 sudo yum install -y mongodb-org
93 sudo service mongod start
95 cat /var/log/mongodb/mongod.log |grep waiting
96 sudo chkconfig mongod on
98 mongo (client)
103 mongorestore -d master mongo_dump/master/

#php mongo extension
127 yum -y install gcc php-pear php-devel
141 pecl install mongodb
142 vim /etc/php.ini (enable mongodb.so extension)
125 php -m | grep -i mongo
144 service httpd restart
145 mongo
146 setsebool -P httpd_can_network_connect on (not required if you
disable selinux)


Blog Archive