Software Troubles and Troubleshooting: January 2019

Tuesday, January 29, 2019

c# matching multiple word boundaries in a single string - regex OR pattern

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Linq;

class MainClass {
public static void Main (string[] args) {
var input = "my bull dog is missing cat";
var topics = new List<string>();
topics.Add("cat");
topics.Add("bull dog");
string pattern = string.Join("|", topics.Select(x => @"\b" + @x + @"\b"));

var regex = new Regex(pattern, RegexOptions.IgnoreCase);

var m = regex.Matches(input);
foreach (Match match in m)
{
int i = match.Index;
Console.WriteLine(i);
}
}
}

Sunday, January 27, 2019

installing xgboost python windows anaconda

conda install -c anaconda py-xgboost

Tuesday, January 8, 2019

datetime and co-ordinate features

For datetime, new features which can be generated are: applying periodicity, calculating time passed since particular event, and date differences between two datetime features.

For coordinates, we should recall extracting interesting samples from trained test data, using places from additional data, calculating distances to centers of clusters, and adding aggregated statistics for surrounding area.

Categorical and ordinal features

First, ordinal is a special case of categorical feature but with values sorted in some meaningful order.
- for e.g. 1st class, 2nd class in railways.

Second, label encoding, basically replace the unique values of categorical features with numbers.
- either by sorting them alphabetically or assigning a code in order of appearance.

Third, frequency encoding - maps unique values to their frequencies.
- for e.g. how many times 1st class occurred.

Fourth, label encoding and frequency encoding are often used for tree-based methods.

Fifth, One-hot encoding is often used for non-tree-based-methods.

And finally, applying One-hot encoding combination on combinations of categorical features allows non-tree- based-models to take into consideration interactions between features, and improve.
- for e.g. in titanic dataset - you could create a new categorical feature by combining sex and pclass.

If pclass = 1,2,3 and sex = M,F
then features could be:
1M, 1F, 2M, 2F, 3M, 3F and we could use one-hot encoding here.

One-hot encodings can be stored as Sparse metrices(which use the storage efficiently when number of non-zero values are less than half of total values).

Sunday, January 6, 2019

Numeric features - Competitive DS course

Preprocessing:
1. Tree based models - preprocessing doesn't matter for numeric features. Since they are only trying to split features irrespective of the scale.

2. Non-tree based models - numeric feature preprocessing matters. For e.g. in kNN - scaling one feature alone would result in completely different distances and hence predictions. Same goes for NNs and Linear models.

Most often used preprocessings:
MinMaxScaler => Value - min/(Max - min)
StandardScaler => Using Mean/Std
Rank => sets spaces between sorted values to be equal (handles outliers well)
log(1+x) sqrt(1+x)

Feature generation:
Fraction of price => for product pricing, for e.g. what's the impact of fractional price? So, fractional price is a new feature. .49 in 2.49.

Social media post interval => humans won't post at regular intervals of 1 second.