Wednesday, December 30, 2015

3. In src created above, unzip zip from opencv source obtained from here : https://github.com/Itseez/opencv. src should contain 3rdparty,apps,cmake etc folders

4. get opencv_contrib sources from https://github.com/Itseez/opencv_contrib. unzip and copy required modules (for e.g. face) from its modules folder to src/modules above.

5. create directory build at the same level as src above.

6. download cmake for windows. open cmake gui. select build and src folders created above.

7. Click Configure. Select generator as Visual Studio 12 2013 Win64 (notice my env above, if you have a different env, tinker a bit)

8. Once done with configuration, select required modules (for e.g. face) and hit Generate.

9. In the build folder you will see OpenCV.sln is created. Open it with Visual Studio.

10. Locate ALL_BUILD in Solutions Explorer. Right click and Build.

11. cv2.pyd is created in build/lib/Release. Copy this to C:\Anaconda\Lib\site-packages

12. Add the complete path of build\bin\Release to your PATH environment variable.

Saturday, December 26, 2015

python notes

1. if you re-declare a method in the same file, the previous definition gets overwritten, without any error.

2. for importing a module from a different directory, add __init__.py to that folder : http://stackoverflow.com/a/21995949/49560

Installin wxpython on Windows with Anaconda

get the binary from here : http://www.wxpython.org/download.php and start installing.

Select the installation path as : C:\Anaconda\Lib\site-packages

Thursday, December 24, 2015

list vs array :
List can hold heterogeneous objects while array can't. For doing math on arrays, prefer numpy.
Source : http://stackoverflow.com/questions/176011/python-list-vs-array-when-to-use

list vs tuple :
tuples are fixed size and faster. Some tuples can be used as dictionary keys (specifically, tuples that contain immutable values like strings, numbers, and other tuples). Lists can never be used as dictionary keys, because lists are not immutable.

but unless you have a specific reason, always use list.

test_list = []

test_tuple = ()

test_dict = {}

Wednesday, December 23, 2015

Warning:java: source value 1.5 is obsolete and will be removed in a future release

Intellij idea 14 fixes :

File -> Settings -> Build,Execution and Deployment -> Target Bytecode Version -> 1.8
File -> Project Structure -> Modules -> Sources -> Language level -> 8

XGBoost Linear Regression output incorrect

Source

I am a newbie to XGBoost so pardon my ignorance. Here is the python code :

import pandas as pd
import xgboost as xgb

df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30]})
X_train = df.drop('y',axis=1)
Y_train = df['y']
T_train_xgb = xgb.DMatrix(X_train, Y_train)

params = {"objective": "reg:linear"}
gbm = xgb.train(dtrain=T_train_xgb,params=params)
Y_pred = gbm.predict(xgb.DMatrix(pd.DataFrame({'x':[4,5]})))
print Y_pred

Output is :

[ 24.126194  24.126194]

As you can see the input data is simply a straight line. So the output I expect is [40,50]. What am I doing wrong here?

Answer :

down vote accept

It seems that XGBoost uses regression trees as base learners by default. XGBoost (or Gradient boosting in general) work by combining multiple of these base learners. Regression trees can not extrapolate the patterns in the training data, so any input above 3 or below 1 will not be predicted correctly in your case. Your model is trained to predict outputs for inputs in the interval [1,3], an input higher than 3 will have the same output as 3, and an input less than 1 will be given the same output as 1.

Additionally, regression trees do not really see your data as a straight line as they are non-parametric models, which means they can theoretically fit any shape that is more complicated than a straight line. Roughly, a regression tree works by assigning your new input data to some of the training data points it have seen during training, and produce the output based on that.

This is in contrast to parametric regressors (like linear regression) which actually look for the best parameters of a hyperplane (straight line in your case) to fit your data. Linear regression does see your data as a straight line with a slope and and intercept.

You can change the base learner of your XGB model to a GLM (generalised linear model) by adding abooster parameter in your model params :

import pandas as pd
import xgboost as xgb

df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30]})
X_train = df.drop('y',axis=1)
Y_train = df['y']
T_train_xgb = xgb.DMatrix(X_train, Y_train)

params = {"objective": "reg:linear", "booster":"gblinear"}
gbm = xgb.train(dtrain=T_train_xgb,params=params)
Y_pred = gbm.predict(xgb.DMatrix(pd.DataFrame({'x':[4,5]})))
print Y_pred

In general, to debug why your XGBoost model is behaving in a particular way, see the model parameters :

gbm.get_dump()

If your base learner is linear model, the get_dump output is :

['bias:\n4.49469\nweight:\n7.85942\n']

In your code above, since you tree base learners, the output will be :

['0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=2.85\n\t\t4:leaf=5.85\n\t2:leaf=8.85\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=1.995\n\t\t4:leaf=4.095\n\t2:leaf=6.195\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=1.3965\n\t\t4:leaf=2.8665\n\t2:leaf=4.3365\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.97755\n\t\t4:leaf=2.00655\n\t2:leaf=3.03555\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.684285\n\t\t4:leaf=1.40458\n\t2:leaf=2.12489\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.478999\n\t\t4:leaf=0.983209\n\t2:leaf=1.48742\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.3353\n\t\t4:leaf=0.688247\n\t2:leaf=1.04119\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.23471\n\t\t4:leaf=0.481773\n\t2:leaf=0.728836\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.164297\n\t\t4:leaf=0.337241\n\t2:leaf=0.510185\n',
 '0:[x<2] yes=1,no=2,missing=1\n\t1:leaf=0.115008\n\t2:[x<3] yes=3,no=4,missing=3\n\t\t3:leaf=0.236069\n\t\t4:leaf=0.357129\n']

Tip : I actually prefer to use xgb.XGBRegressor or xgb.XGBClassifier classes, since they follow thesci-kit learn API. And because sci-kit learn has so many machine learning algorithm implementations, using XGB as an additional library does not disturb my workflow only when I use the sci-kit interface of XGB.

Thursday, December 17, 2015

python print object memory address

print hex(id(object))

Pandas chained indexing - example

Here we want to set all age > 2 values to NaN.    

import pandas as pd
import numpy as np

df = pd.DataFrame({'age':[1,2,3,4,5], 'income':[10,20,30,40,50]})

#guaranteed to work without warnings/errors  df.loc[df.age > 2, 'age'] = np.NaN


print df
#won't work
df[df["age"] > 2]["age"] = np.NaN #error
#will work, but may give warning  
df["age"][df["age"] > 2] = np.NaN

Friday, December 11, 2015

Installing Anaconda and xgboost on Ubuntu

wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.3.0-Linux-x86_64.sh

And after download is finished do:

bash Anaconda-2.3.0-Linux-x86_64.sh

Source

Once Anaconda is installed, install xgboost :

git clone https://github.com/dmlc/xgboost.git

cd xgboost; make; cd wrapper; python setup.py install

Thursday, December 10, 2015

Installing xgboost on Windows for Python (While using Anaconda)

1) Download Visual Basic Studio. You can download the community
edition @ https://www.visualstudio.com/en-us/news/vs2013-community-vs.aspx.
There is a "free visual studio button on the upper right corner"

2) Copy all content from
https://github.com/dmlc/xgboost/tree/master/windows and Open Visual
studio existing project on Visual studio

3) There are a couple of drop down menues you need to select (
"Release" and "X64" and then select build --> build all from the upper
menu. It should look something like the attached screenshot.

4) if you see the message ========== Build: 3 succeeded, 0 failed, 0
up-to-date, 0 skipped ==========, it is all good

5 --- this step was missing in the following link) Go to xgboost/python-package and do python setup.py install

source : https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13043/run-xgboost-from-windows-and-python

Errors running apt-get on aws ubuntu instance

267463#267463

ML notes : some data sources

https://research.stlouisfed.org/fred2/

https://www.quandl.com/data/DOE/RWTC-WTI-Crude-Oil-Spot-Price-Cushing-OK-FOB

http://www.pewresearch.org/about/

https://www.icpsr.umich.edu/icpsrweb/landing.jsp

Tuesday, December 8, 2015

Great explanation of how async/non blocking I/O in node.js works

https://www.future-processing.pl/blog/on-problems-with-threads-in-node-js/

Monday, December 7, 2015

ML Notes : Objective function

Objective function = Training Loss + Regularization

Training Loss measures how well model fit on training data.

Regularization, measures complexity of model.

ML notes : good explanation of gradient descent

http://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/

https://www.quora.com/What-is-an-intuitive-explanation-of-gradient-descent

ML Notes : Precision vs Recall (true positive rate)

Precision is a counterpart to Recall(true positive rate) in the following sense: while true positive rate is the proportion of predicted positives among the actual positives, precision is the proportion of actual positives among the predicted positives.

Peter Flach.

ML notes : linear regression vs logistic regression

In the linear regression model the dependent(output) variable y is
considered continuous, whereas in logistic regression it is
categorical, i.e., discrete. In application, the former is used in
regression settings while the latter is used for binary classification
or multi-class classification (where it is called multinomial logistic
regression)

http://stats.stackexchange.com/questions/29325/what-is-the-difference-between-linear-regression-and-logistic-regression

ML Notes : false positive, false negative, true positive, true negative

A good way to think of this is to remember that positive/negative
refers to the classifier's prediction, and true/false refers to
whether the prediction is correct or not. So, a false positive is
something that was incorrectly predicted as positive, and therefore an
actual negative (e.g., a ham email misclassified as spam, or a healthy
patient misclassified as having the disease in
question).

Peter Flach

ML Notes : Area Under Curve (AUC) vs Overall Accuracy

http://stats.stackexchange.com/a/69944/73383

The area under the curve (AUC) is equal to the probability that a
classifier will rank a randomly chosen positive instance higher than a
randomly chosen negative example. It measures the classifiers skill in
ranking a set of patterns according to the degree to which they belong
to the positive class, but without actually assigning patterns to
classes.

The overall accuracy also depends on the ability of the classifier to
rank patterns, but also on its ability to select a threshold in the
ranking used to assign patterns to the positive class if above the
threshold and to the negative class if below.

Thus the classifier with the higher AUROC statistic (all things being
equal) is likely to also have a higher overall accuracy as the ranking
of patterns (which AUROC measures) is beneficial to both AUROC and
overall accuracy. However, if one classifier ranks patterns well, but
selects the threshold badly, it can have a high AUROC but a poor
overall accuracy.

Thursday, December 3, 2015

memory being used by httpd

ps -C httpd -O rss | gawk '{ count ++; sum += $2 }; END {count --;
print "Number of processes =",count; print "Memory usage per process
=",sum/1024/count, "MB"; print "Total memory usage =", sum/1024, "MB"
;};'

Software Troubles and Troubleshooting