Thursday, November 29, 2018

R tutorial + Stats + Types of Inferences

Random variables - Continuous,Discreet,Categorical
Their probability distributions be discreet/continuous.

First type of Inference - Computing Population mean from a Sample mean:
Problem: find mean weight of all football players in the world.
Input data: Weights of 45 players of your college. Mean = 173, StanDev = 15, N(number of samples) = 45
From this input data you have to compute the output.

Sampling distribution is a normal distribution with mu = x bar = sample mean
Sigma =  StanDev for Sampling distribution = Sample error which can be derived from StanDev of the sample = Sample StanDev/Sqrt(N)

So if sample mean = 173
number of data points = N = 45
then Sampling Distribution mean estimate = 173

If StanDev for Sample = 15
then Standard error = 15/Sqrt(45) = 2.23

So with 95% (2 Sigma) confidence we can say that the mean of population lies in 173 +- 2.23*2
-------------------------
Second type of Inference - Population Proportion - Identifying the population % - Election polling - Yes/No type of variable
Find out the winning chances of this candidate.
Pollster picks a sample of 2000 voters.
1100 say Yes for this candidate. 55%.
Sample StanDev in this case is different since it's population % case.
 = Sqrt(p*1-p/N) = Sqrt(55*45/2000) = 0.01
Here p = 55 as computed above
Standard Error  = Sample StanDev in this case = 0.01% = 1 basis point

2 Sigma = 2% here
Summary: 55% people support the candidate and we can say with 95% (2 sigma) probability that this number is off by at most 2%..
---------------------------------
Third - verifying whether the population mean is equal to a certain value - Hypothesis testing for population mean
A medical study
Let's verify this hypothesis: The average life expectancy of an Indian College Graduate is 70 years.
Let's take a sample.
N = 100
Mean life expectancy of the sample = 65
SD of life expectancy of that sample = 10

So:
Standard Error(the best estimate for the SD of the Sampling Distribution) = 10/Sqrt(100) = 1
Also,
Best point estimate of the population mean = Sample mean = 65

Now, let's perform a test of significance. What does that mean?
We need to figure out whether the difference between 65 and 70 is only due to chance or that difference is because the population mean is actually very different from 70.
First one is Null Hypothesis.
Second one is Alternative hypothesis.
If we end up accepting Null Hypothesis - it means that the mean is indeed 70.
If we accept Alternative Hypothesis - it means that the mean is not 70.

For doing this:
Compute the probability of Null hypothesis being true.
If the probability is too low - reject the Null hypothesis. Else accept it.

For doing this:
We will use Z-Statistic = Normalized distance between the Sample mean(65) and the Hypothesized population mean(70) = (Sample mean - Population mean)/Standard Error = 65-70/1 = -5
What we are doing here is (I am not sure about this part) converting it to a standard normal distribution with mean = 0 and StanDev = 1.
So now we need to check P(|Z| > 5) = P(the difference > 5) = Area under the curve for values beyond 5.

In R, compute like this: 1 - PNorm(5) - PNorm(-5) = 5.73e-07 which is very low. So Null Hypothesis is false.
What if this value was 0.1, then with 90% confidence we can say that the Null Hypothesis is false.
If it was 0.05 we can say the same thing with 95% confidence
If it was 0.01 then 99% confidence.
and so on....
-----------------------
4 - Hypothesis testing for population proportion
Verifying whether the population % is equal to a certain value
Example: 40% of the employees check Facebook first thing in the morning.

% of the sample who said yes = 43
N = 100 (number of samples)
StanDev of the sample = Sqrt(p(1-p)/n) = Sqrt(.43*.57/100) = .05
So mean = .43, SD=.05
Null hypothesis p = .4
Alternative hypothesis p <> .4
Null and Alternate hypothesis should be MECE(Mutually Exclusive and Collectively Exhaustive)
Z-statistic = p-hat - p/SD = .43 - .40/.05 = .03/.05 = .6
P(|Z| > .6) = probability that null hypothesis is true
Using R:
> print(1 - pnorm(.6) + pnorm(-.6)  )
[1] 0.5485062

So, probability is very strong that the Null Hypothesis holds.


pnorm(x) gives the area under the curve from x = -infinity to the given value of x.
So 1 - (pnorm(.6) - pnorm(-.6) ) = Area under the curve before -.6 and after .6 .
----------------------
5 - A/B testing - comparing the means of two populations
Two layouts of a website. Old A and new B.
Null hypothesis: Average minutes/visit on A <= B. B is no worse than A.
Alternate: A > B. B is worse than A.
It's the first time when the Null hypothesis has an inequality. In all earlier cases, it was the other way round. It's called one sided test otherwise it's called 2-sided test.

A has 5 mins per visit.
B has 3 mins per visit.

Null hypothesis says difference between A and B is due to chance - hence not significant.
Alternate hypothesis says it's significant.

Z-statistic in this case(when we are comparing means of 2 distributions) = (X1bar - X2bar)/Sqrt(Sigma_x1_square + Sigma_x2_square)
Sigma = standard error = StanDev/Sqrt(number_of_samples)
Here they are 0.5, 0.4
= (5-3)/sqrt(.5^2 + .4^2) = 2/sqrt(.25+.16) = 3.12
P (Z > 3.12) -notice only one sided test, no absolute value
> print(1 - pnorm(3.12)  )
[1] 0.0009042552
It's same as
pnorm(-3.12) because of symmetry of normal distribution
Probability is too low. Reject the null hypothesis.
So new layout is indeed worse.
----------------
Verifying whether 2 population % are different
Customer Surveys
Which users are more satisfied - mobile app or web
A = 100 people who use only the app
B = 100 people who use only the website
% of happy customers in A = 67%
% of happy customers in B = 63%

Standard Error = StanDev = Sqrt(p*(1-p)/100) = .047 for A and .048 for B
Null hyothesis: % voting yes on App <= website
Alternate App > website

Z-stat = .67-.63/sqrt(.047^2 + .048^2) = .04/sqrt(.002331 + .002211) = .04/.067 = .597
> print(1 - pnorm(.597)  )
[1] 0.2752537

So probability of null hypothesis = 0.27
Alternate = .73


No comments:

Blog Archive