Wednesday, December 30, 2015

python unix style pathname pattern expansion

import glob

 for imagePath in glob.glob('faces/*.jpg'):
     if imagePath.endsWith('.jpg'):
         continue

Sunday, December 27, 2015

Building OpenCV with face module (opencv_contrib) from sources on Windows 7 64 bit

Thanks for help to : 


My env : 
1. Windows 7 64 bit
2. Anaconda 64 bit python
3. Visual Studio 12 2013

Steps : 
1. mkdir opencv
2. cd opencv; mkdir build; mkdir src;
3. In src created above, unzip zip from opencv source obtained from here : https://github.com/Itseez/opencv. src should contain 3rdparty,apps,cmake etc folders
4. get opencv_contrib sources from https://github.com/Itseez/opencv_contrib. unzip and copy required modules (for e.g. face) from its modules folder to src/modules above.
5. create directory build at the same level as src above.
6. download cmake for windows. open cmake gui. select build and src folders created above.
7. Click Configure. Select generator as Visual Studio 12 2013 Win64 (notice my env above, if you have a different env, tinker a bit)
8. Once done with configuration, select required modules (for e.g. face) and hit Generate.
9. In the build folder you will see OpenCV.sln is created. Open it with Visual Studio.
10. Locate ALL_BUILD in Solutions Explorer. Right click and Build.
11. cv2.pyd is created in build/lib/Release. Copy this to C:\Anaconda\Lib\site-packages
12. Add the complete path of build\bin\Release to your PATH environment variable.

Saturday, December 26, 2015

python notes

1. if you re-declare a method in the same file, the previous definition gets overwritten, without any error.

2. for importing a module from a different directory, add __init__.py to that folder : http://stackoverflow.com/a/21995949/49560

Installin wxpython on Windows with Anaconda

get the binary from here : http://www.wxpython.org/download.php and start installing.

Select the installation path as : C:\Anaconda\Lib\site-packages

Thursday, December 24, 2015

python in-built data types

list vs array :
List can hold heterogeneous objects while array can't. For doing math on arrays, prefer numpy.
Source : http://stackoverflow.com/questions/176011/python-list-vs-array-when-to-use

list vs tuple :
tuples are fixed size and faster. Some tuples can be used as dictionary keys (specifically, tuples that contain immutable values like strings, numbers, and other tuples). Lists can never be used as dictionary keys, because lists are not immutable.

but unless you have a specific reason, always use list.

test_list = []
test_tuple = ()
test_dict = {}

Wednesday, December 23, 2015

Warning:java: source value 1.5 is obsolete and will be removed in a future release

Intellij idea 14 fixes :

  1. File -> Settings -> Build,Execution and Deployment -> Target Bytecode Version -> 1.8
  2. File -> Project Structure -> Modules -> Sources -> Language level -> 8

XGBoost Linear Regression output incorrect

I am a newbie to XGBoost so pardon my ignorance. Here is the python code :
import pandas as pd
import xgboost as xgb

df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30]})
X_train = df.drop('y',axis=1)
Y_train = df['y']
T_train_xgb = xgb.DMatrix(X_train, Y_train)

params = {"objective": "reg:linear"}
gbm = xgb.train(dtrain=T_train_xgb,params=params)
Y_pred = gbm.predict(xgb.DMatrix(pd.DataFrame({'x':[4,5]})))
print Y_pred
Output is :
[ 24.126194  24.126194]
As you can see the input data is simply a straight line. So the output I expect is [40,50]. What am I doing wrong here?

Answer : 

It seems that XGBoost uses regression trees as base learners by default. XGBoost (or Gradient boosting in general) work by combining multiple of these base learners. Regression trees can not extrapolate the patterns in the training data, so any input above 3 or below 1 will not be predicted correctly in your case. Your model is trained to predict outputs for inputs in the interval [1,3], an input higher than 3 will have the same output as 3, and an input less than 1 will be given the same output as 1.
Additionally, regression trees do not really see your data as a straight line as they are non-parametric models, which means they can theoretically fit any shape that is more complicated than a straight line. Roughly, a regression tree works by assigning your new input data to some of the training data points it have seen during training, and produce the output based on that.
This is in contrast to parametric regressors (like linear regression) which actually look for the best parameters of a hyperplane (straight line in your case) to fit your data. Linear regression does see your data as a straight line with a slope and and intercept.
You can change the base learner of your XGB model to a GLM (generalised linear model) by adding abooster parameter in your model params :
import pandas as pd
import xgboost as xgb

df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30]})
X_train = df.drop('y',axis=1)
Y_train = df['y']
T_train_xgb = xgb.DMatrix(X_train, Y_train)

params = {"objective": "reg:linear", "booster":"gblinear"}
gbm = xgb.train(dtrain=T_train_xgb,params=params)
Y_pred = gbm.predict(xgb.DMatrix(pd.DataFrame({'x':[4,5]})))
print Y_pred
In general, to debug why your XGBoost model is behaving in a particular way, see the model parameters :
gbm.get_dump()
If your base learner is linear model, the get_dump output is :
['bias:\n4.49469\nweight:\n7.85942\n']
In your code above, since you tree base learners, the output will be :
['0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=2.85\n\t\t4:leaf=5.85\n\t2:leaf=8.85\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=1.995\n\t\t4:leaf=4.095\n\t2:leaf=6.195\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=1.3965\n\t\t4:leaf=2.8665\n\t2:leaf=4.3365\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.97755\n\t\t4:leaf=2.00655\n\t2:leaf=3.03555\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.684285\n\t\t4:leaf=1.40458\n\t2:leaf=2.12489\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.478999\n\t\t4:leaf=0.983209\n\t2:leaf=1.48742\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.3353\n\t\t4:leaf=0.688247\n\t2:leaf=1.04119\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.23471\n\t\t4:leaf=0.481773\n\t2:leaf=0.728836\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.164297\n\t\t4:leaf=0.337241\n\t2:leaf=0.510185\n',
 '0:[x<2] yes=1,no=2,missing=1\n\t1:leaf=0.115008\n\t2:[x<3] yes=3,no=4,missing=3\n\t\t3:leaf=0.236069\n\t\t4:leaf=0.357129\n']
Tip : I actually prefer to use xgb.XGBRegressor or xgb.XGBClassifier classes, since they follow thesci-kit learn API. And because sci-kit learn has so many machine learning algorithm implementations, using XGB as an additional library does not disturb my workflow only when I use the sci-kit interface of XGB.

Thursday, December 17, 2015

python print object memory address

print hex(id(object))

Pandas chained indexing - example

Here we want to set all age > 2 values to NaN.    

import pandas as pd
import numpy as np
df = pd.DataFrame({'age':[1,2,3,4,5], 'income':[10,20,30,40,50]})
#guaranteed to work without warnings/errors  
df.loc[df.age > 2, 'age'] = np.NaN
print df
#won't work
df[df["age"] > 2]["age"] = np.NaN #error
#will work, but may give warning  
df["age"][df["age"] > 2] = np.NaN 
  

Friday, December 11, 2015

Installing Anaconda and xgboost on Ubuntu

wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.3.0-Linux-x86_64.sh
And after download is finished do:
bash Anaconda-2.3.0-Linux-x86_64.sh

Once Anaconda is installed, install xgboost : 

cd xgboost; make; cd wrapper; python setup.py install

Thursday, December 10, 2015

Installing xgboost on Windows for Python (While using Anaconda)

1) Download Visual Basic Studio. You can download the community
edition @ https://www.visualstudio.com/en-us/news/vs2013-community-vs.aspx.
There is a "free visual studio button on the upper right corner"

2) Copy all content from
https://github.com/dmlc/xgboost/tree/master/windows and Open Visual
studio existing project on Visual studio

3) There are a couple of drop down menues you need to select (
"Release" and "X64" and then select build --> build all from the upper
menu. It should look something like the attached screenshot.

4) if you see the message ========== Build: 3 succeeded, 0 failed, 0
up-to-date, 0 skipped ==========, it is all good

5 --- this step was missing in the following link) Go to xgboost/python-package and do python setup.py install

source : https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13043/run-xgboost-from-windows-and-python

Errors running apt-get on aws ubuntu instance

ML notes : some data sources

Monday, December 7, 2015

ML Notes : Objective function

Objective function = Training Loss + Regularization

Training Loss measures how well model fit on training data.
Regularization, measures complexity of model.

ML notes : good explanation of gradient descent

ML Notes : Precision vs Recall (true positive rate)

Precision is a counterpart to Recall(true positive rate) in the following sense: while true positive rate is the proportion of predicted positives among the actual positives, precision is the proportion of actual positives among the predicted positives.

ML notes : linear regression vs logistic regression

In the linear regression model the dependent(output) variable y is
considered continuous, whereas in logistic regression it is
categorical, i.e., discrete. In application, the former is used in
regression settings while the latter is used for binary classification
or multi-class classification (where it is called multinomial logistic
regression)

http://stats.stackexchange.com/questions/29325/what-is-the-difference-between-linear-regression-and-logistic-regression

ML Notes : false positive, false negative, true positive, true negative

A good way to think of this is to remember that positive/negative
refers to the classifier's prediction, and true/false refers to
whether the prediction is correct or not. So, a false positive is
something that was incorrectly predicted as positive, and therefore an
actual negative (e.g., a ham email misclassified as spam, or a healthy
patient misclassified as having the disease in
question).

Peter Flach

ML Notes : Area Under Curve (AUC) vs Overall Accuracy

http://stats.stackexchange.com/a/69944/73383

The area under the curve (AUC) is equal to the probability that a
classifier will rank a randomly chosen positive instance higher than a
randomly chosen negative example. It measures the classifiers skill in
ranking a set of patterns according to the degree to which they belong
to the positive class, but without actually assigning patterns to
classes.

The overall accuracy also depends on the ability of the classifier to
rank patterns, but also on its ability to select a threshold in the
ranking used to assign patterns to the positive class if above the
threshold and to the negative class if below.

Thus the classifier with the higher AUROC statistic (all things being
equal) is likely to also have a higher overall accuracy as the ranking
of patterns (which AUROC measures) is beneficial to both AUROC and
overall accuracy. However, if one classifier ranks patterns well, but
selects the threshold badly, it can have a high AUROC but a poor
overall accuracy.

Thursday, December 3, 2015

memory being used by httpd

ps -C httpd -O rss | gawk '{ count ++; sum += $2 }; END {count --;
print "Number of processes =",count; print "Memory usage per process
=",sum/1024/count, "MB"; print "Total memory usage =", sum/1024, "MB"
;};'

Monday, November 2, 2015

bootstrap notes

1. headings are also available as classes : class="h1".
2. <h1> <small>, <h2> <small> are other combinations.
3. <p class="lead"> for a paragraph which stands out from others.
4. <ul class="list-inline" for inline list
5. class = table, table-striped, table-bordered
6. tr class success,active,info
7. inline form
8. sr only for accessiblity
9. code tag
10. img responsive class
11. class img-rounded, img-thumbnail, img-circle
12. nav tabs pills stacked justified
13. tabs with dropdowns
14. navbar, breadcrumbs, Glyphicons
15. jumbotron, well, page header
16. badge, label
17. multi color progress bars
18. panel, panel with tables
19. thumbnail
20. pagination
21. list group, button group, button toolbar
22. split button dropdown
23. justified button group
24. alert, radio button, checkbox
25. media object
26. modal, carousal, scrollspy, accordion collapse
27. popover, tooltip
28. www.startbootstrap.com
29. www.bootswatch.com
30. www.blacktie.co
31. bootstrapzero.com
32. bootplus
33. Fbootstrapp
34. Bootmetro
35. Font awesome library
36. Jasny bootstrap
37. fuel ux, bootsnipp
38. bootdey
39.

Wednesday, October 14, 2015

filling leadership circle form

$(":radio[value=]").each(function(){ $(this).prop('checked',true);$(this).parent().addClass('checked'); });$(":submit[value=Continue]").click();

mysql create user

GRANT all ON twitter.* TO 'twitter'@'localhost' identified by 'twitter';
flush privileges;

Tuesday, October 13, 2015

Elasticsearch notes

1. autocomplete
2. more like this

random journals, nausea appearing before vomiting (keyword ordering using script) etc
5. 

Monday, October 12, 2015

elasticsearch notes

1. What is constant score query?
2. Autocomplete feature : suggestion modes -> popular,missing,always
3. CORS didn't work.
4. 

Sunday, October 11, 2015

Elasticsearch notes

1. Use marvel like a scratchpad for trying out GET/POST/PUT etc
2. PUT needs an id, POST generates one
3. ES auto generates field type mapping, can be overridden
4. Analyzers : whitespace, camelCase,regexp, Hindi etc
5. boost a field
6. search queries can be very structured, AND, OR of multiple fields
7. index is divided into shards. each shard can have replicas.
8. document routing - which document goes to which shard
9. prefix search, range search, 
10. filter query : doesn't compute score hence fast
11. match query computes score, so slow


Questions:
2. 

Sunday, May 3, 2015

experiments with twitter nlp

I wanted a good tool for named entity extraction on tweets. I came across this : https://github.com/aritter/twitter_nlp.

Here is how I installed it : 
2. unzip master
3. cd master
4. yum install glibc-static
5. sh build.sh

To run it, go to the directory containing python folder. 
1. export TWITTER_NLP=./ 
2. cat test.1k.txt | python python/ner/extractEntities2.py

I made another sample file, test2, with these 2 tweets

I live in Jodhpur.
usgs reports a m0.46 #earthquake 13km nw of jodhpur, rajasthan on 5/1/15 @ 15:50:46 utc http://t.co/mqgsvgnkbo #quake

and then :
 cat test2 | python python/ner/extractEntities2.py --classify --pos --event

Results:
I/O/PRP/O live/O/VBP/B-EVENT in/O/IN/O Jodhpur/B-person/NNP/O ./O/./O
usgs/O/NNP/O reports/O/VBZ/B-EVENT a/O/DT/O m/O/NN/O 0.46/O/HT/O #earthquake/O/HT/O 13km/O/HT/O nw/O/NN/O of/O/IN/O jodhpur/O/NN/O ,/O/,/O rajasthan/O/VBN/O on/O/IN/O 5/1/15/O/CD/O @/O/IN/O 15:50/O/CD/O :/O/:/O 46/O/CD/O utc/O/:/O http://t.co/mqgsvgnkbo/O/URL/O #quake/O/HT/O

So Jodhpur is classified as a person though it should be a location.

I don't know what am I doing wrong here?

Monday, April 13, 2015

most recent files in a directory

find $1 -type f -exec stat --format '%Y :%y %n' "{}" \; | sort -nr |
cut -d: -f2- | head

Sunday, April 5, 2015

Beautiful Soup

You didn't write that awful page. You're just trying to get some data
out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-turnaround screen
scraping projects.

http://www.crummy.com/software/BeautifulSoup/

Saturday, April 4, 2015

Installing PostGIS on centos 6.5

sudo rpm -ivh http://yum.postgresql.org/9.4/redhat/rhel-6-x86_64/pgdg-centos94-9.4-1.noarch.rpm

yum install postgresql94 postgresql94-server postgresql94-libs
postgresql94-contrib postgresql94-devel

sudo yum install postgis2_94

yum install pgrouting_94

yum install php-pgsql

service postgresql-9.4 initdb
----------------------------------------------------------------
connecting :

su postgres

psql -p 5432

SELECT setting FROM pg_settings WHERE name = 'config_file';

postgres=# create database shops_new;

CREATE DATABASE db1;
\l - show databases
\d <table>; describe table
\connect db1; connect to db1
\dt show tables


postgres=# \connect db_new;

You are now connected to database "db_new" as user "postgres".

db_new=# create extension postgis;

CREATE EXTENSION

db_new=#

ALTER TABLE your_table ADD COLUMN idBIGSERIAL PRIMARY KEY;


db_new=# ALTER TABLE table1 ADD COLUMN name character varying(128) NOT
NULL DEFAULT '';

db_new=# ALTER TABLE table1 ADD COLUMN address character varying(1024)
NOT NULL DEFAULT '';

ALTER TABLE table1 ADD COLUMN male int;

SELECT AddGeometryColumn ('','shops','location',4326,'POINT',2);


INSERT INTO shops(name,address,location,male,female,free,rating)
VALUES("name1","address1",ST_GeomFromText('POINT(-71.060316
48.432044)', 4326), 1,1,1,1);

http://www.cyberciti.biz/faq/psql-fatal-ident-authentication-failed-for-user/

Monday, March 30, 2015

twitter bulk unfollow

Open "Following" page on Twitter and scroll down to have all following accounts showed in the browser.
Next run the following JavaScript code in your development console:
$('.button-text.unfollow-text').trigger('click');

http://www.quora.com/How-can-you-unfollow-everyone-on-Twitter-at-once

Blog Archive