Wednesday, November 21, 2018

R and statistics

https://www.safaribooksonline.com/videos/learn-by-example/9781788996877


r <- c(1,2,3,4,5,5,6,6,8,9)
range(r)

Method 1:
bins <- seq(0,11,by=2)
intervals <- cut(r,bins,right=FALSE)
print(intervals)
print(bins)
t <- table(intervals)
print(t)
plot(t,type="h",main="rabit",xlab="intervals",ylab="count")

Method 2:
hist(r, breaks=bins)

Mean,Median,Mode:
print(mean(r))
print(median(r))
print(sort(table(r), decreasing = TRUE)[1])

Histogram is the bar chart of the frequency table.
-------------
Measuring data spread:

computing IQR - Inter quartile range - a measure of data spread

Divide your sorted data into 4 equal parts. The 3 partition boundaries inside it are Q1,Q2, Q3. 
Q3 - Q1 is IQR.
Q2 is effectively median of the data.
Q1 is the median of the first half of data.
Q3 is the median of the second half of data.

-------------
Box and whisker plots help us in visualizing IQR and outliers.
How to draw:
Draw a box starting at Q1 and ending at Q3. So width of this box is IQR.
From the median draw two whiskers in both directions (left and right). Each having length 1.5*IQR.
Now any data points beyond the end of the whiskers are outliers.
---------

Next measure of dispersion is: Standard Deviation(SD) - way more popular than IQR.
In financial markets - it's called volatility.

Essentially - how far away from the mean the points are.

SD = SQRT(Variance)
Variance = sum((mean - datapoint)^2)/n = mean of (squares of deviations)

Why is variance better than the mean of absolute values?

1. Variance is more sensitive:
For the deviations -2,4 and -3,3
Mean of abs = 3,3
But variance = (2^2 + 4^2)/2 = 10 
and (-3^2 + 3^2)/2 = 9
So variance is more for 2,4 as opposed to 3,3 which is good!

2. Variance is computationally cheaper:
An if condition(to check whether the number is negative or not) is more expensive than a square.

3. Variance has cool mathematical properties and fundamentally tied with Normal distribution.

4. Unlike IQR, Variance is sensitive to outliers. Like mean is sensitive to outliers but median is not.
So if you see median/IQR combo is not sensitive to outliers but Mean/Variance combo is sensitive.

Drawback of the variance: it's not of the same order of the dataset and the mean - since it's squared.
That's why we have SD = Sqrt(Variance)
-----------------
In R, summary() method gives a good summary of the dataset.
r <- c(1,2,3,4,5,5,6,6,8,9)

print(summary(r))

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 

   1.00    3.25    5.00    4.90    6.00    9.00 

IQR(r) method:
> print(IQR(r))

[1] 2.75
----------------------
Visually compare 2 datasets using boxplot and IQR
r1 <- c(1,1,2,4,5,5,6,6,9,10)
r2 <- c(1,2,3,4,5,5,6,6,8,9)
print(IQR(r1))
print(IQR(r2))
print(mean(r1))
print(mean(r2))
combined <- cbind(r1,r2)
boxplot(combined)
-----------------
Method for computing SD:
sd(r)
-------------






No comments:

Blog Archive