Online learning doesn't require learning rate configuration?
ceiling analysis  machine learning pipeline etc
Monday, February 27, 2017
Anomaly detection  Andrew Ng
Anomaly detection vs Supervised learning  when negative examples are
too few go for anamoly detection
Anamoly detection  choosing features  features should have Normal
distribution. Plot histogram and see. If not, try log(x), log(x+c),
x^0.5, x^0.2 etc. Try combination of features : CPU/Net traffic,
CPU^2/Network traffic etc
Multivariate Normal distribution  let's say memory is unusually high
for a given cpu load. But both of them individually have good enough
probability of occurring. But they are at different sides of their
respective bell curves. So we would go for multivariate Normal
distribution.
Each feature modelled independently as gaussian and multiplied is same
as multivariate Gaussian when axes are aligned, i.e. all off diagonal
components are zero.
Multivariate captures correlations between features automatically.
Otherwise you have to create those unusual features manually.
But the original model is computationally cheaper and scales with
large number of features. In MV, you have to do large matrix
operations.
In MV m > n => number of examples should be more than number of
features. Not so in original. Since you can't inverse the matrix.
In MV, the covariance matrix(sigma) should be invertible. It will not
be invertible if there are redundant features, i.e. you have duplicate
features like x2 = x1 or x3 = x4 + x5 etc.
too few go for anamoly detection
Anamoly detection  choosing features  features should have Normal
distribution. Plot histogram and see. If not, try log(x), log(x+c),
x^0.5, x^0.2 etc. Try combination of features : CPU/Net traffic,
CPU^2/Network traffic etc
Multivariate Normal distribution  let's say memory is unusually high
for a given cpu load. But both of them individually have good enough
probability of occurring. But they are at different sides of their
respective bell curves. So we would go for multivariate Normal
distribution.
Each feature modelled independently as gaussian and multiplied is same
as multivariate Gaussian when axes are aligned, i.e. all off diagonal
components are zero.
Multivariate captures correlations between features automatically.
Otherwise you have to create those unusual features manually.
But the original model is computationally cheaper and scales with
large number of features. In MV, you have to do large matrix
operations.
In MV m > n => number of examples should be more than number of
features. Not so in original. Since you can't inverse the matrix.
In MV, the covariance matrix(sigma) should be invertible. It will not
be invertible if there are redundant features, i.e. you have duplicate
features like x2 = x1 or x3 = x4 + x5 etc.
Wednesday, February 15, 2017
neural network notes
Study back propagation and implement gradient descent.
Implement dropout.
Cross entropy is an alternative to quadratic cost function for faster learning.
Softmax is a different activation(output) function. An alternative to Sigmoid. Sum of outputs is always 1. Hence can be thought of as a probability distribution. In a Sigmoid layer, output activations won't always sum to 1.
2 good combinations in NN are : Softmax + Log likelihood cost & Sigmoid + Quadratic cost
Usually Softmax + Log Likelihood is good for multi class classification problems.
Validation_data vs. test_data
Validation_data for tuning hyper parameters like learning rate
test_data for evaluation
Avoiding overfitting
Best way to avoid over fitting is to have larger training sets.
Regularization is another way to prevent over fitting since it pushes towards smaller weights. It means small changes in inputs will yield small changes in output. If the weights are large, small changes in input may result in large changes in output. So it's helping the model avoid the effects of noise.
L2 Regularization  add weight^2 to cost
L2 Regularization  add weight to cost
You could train multiple Neural networks and do a voting on their results.
Similarly, there is Dropout in which you remove half the neurons at a time which helps you adjust the weights in an average way.
Expand the data set  for images add rotations/scaling/elastic distortions, for speech  vary the speed up/down, add noise
Weight Initialization
Explore Gaussian.
1/nin−−−√
Vanishing Gradient
In a multi layer NN, initial layers' learning can explode or vanish  the learning rates may be too high as compared to others or too low.
Convolutional networks
Implement dropout.
Cross entropy is an alternative to quadratic cost function for faster learning.
Softmax is a different activation(output) function. An alternative to Sigmoid. Sum of outputs is always 1. Hence can be thought of as a probability distribution. In a Sigmoid layer, output activations won't always sum to 1.
2 good combinations in NN are : Softmax + Log likelihood cost & Sigmoid + Quadratic cost
Usually Softmax + Log Likelihood is good for multi class classification problems.
Validation_data vs. test_data
Validation_data for tuning hyper parameters like learning rate
test_data for evaluation
Avoiding overfitting
Best way to avoid over fitting is to have larger training sets.
Regularization is another way to prevent over fitting since it pushes towards smaller weights. It means small changes in inputs will yield small changes in output. If the weights are large, small changes in input may result in large changes in output. So it's helping the model avoid the effects of noise.
L2 Regularization  add weight^2 to cost
L2 Regularization  add weight to cost
You could train multiple Neural networks and do a voting on their results.
Similarly, there is Dropout in which you remove half the neurons at a time which helps you adjust the weights in an average way.
Expand the data set  for images add rotations/scaling/elastic distortions, for speech  vary the speed up/down, add noise
Weight Initialization
Explore Gaussian.
Vanishing Gradient
In a multi layer NN, initial layers' learning can explode or vanish  the learning rates may be too high as compared to others or too low.
Convolutional networks
 Local receptive fields, stride length
 Shared weights and biases  all neurons in a hidden layer will have same weights and biases. So that all of them can detect the same feature at different locations. They are protected against translational changes. An image shifted slightly to right or left of something is still the image of the same thing.
 Map from input layers to hidden layers is called feature map. Shared weights and biases constitute a kernel/filter.
 One input layer can be mapped to multiple hidden layers. That enables detection of multiple features.
 Later layers could be pooling layers  map 2x2 inputs to one neuron/pixel.
Recurrent Neural Networks(RNNs)
 Output of a neuron might be determined by its earlier value. Time based. Might fit Speech and Natural language problems.
Deep Belief Networks (DBNs)
 Generative  not only recognize digits, but able to produce them as well.
 Able to do unsupervised learning too.
 Restricted Boltzmann machines are a key component of DBNs.
What's going on with NNs
 Playing video games
 NLP
Tuesday, February 14, 2017
Neural network notes  2
http://neuralnetworksanddeeplearning.com/chap1.html#eqtn3
Feed forward vs Recurrent nets
Feed forward simply give output as input to the next layer
Recurrent nets can give output to previous layer and that could come back as input after some time to the same layer.
But as of now algorithms are not good enough for Recurrent.
Weights and biases/Gradient Descent
Feed forward vs Recurrent nets
Feed forward simply give output as input to the next layer
Recurrent nets can give output to previous layer and that could come back as input after some time to the same layer.
But as of now algorithms are not good enough for Recurrent.
Weights and biases/Gradient Descent
Neural networks notes 1
http://neuralnetworksanddeeplearning.com/chap1.html
Perceptrons cam compute anything since they are NAND gates. NN is a
network of perceptrons which can adjust weights and biases, hence
better than a conventional laid out circuit. Their inputs and outputs
are 0/1. Output = w.x + b where w.x is dot product of weights and
inputs and b is bias. Bias is threshold.
But inputs to/outputs from Sigmoid neurons can be 0.683, i.e. anything
between 0 to 1. Output activation function = 1/(1 + e^(z)) where z =
w.x + b. If we plot it, it's a smoothed version of step function or
Perceptron. Which gives it the property that small changes in inputs
result in small changes in output, unlike Perceptron. This property is
helpful in tuning of a NN, otherwise small changes in inputs will
result in significant changes down the line.
Still, Sigmoids and Perceptrons are similar in the sense that for
large z, output is 1 for small z output is 0.
Essentially, Δoutput is a linear function of the changes Δwj and Δb in
the weights and bias.
We can use other activation functions too, but σ(z) ≡ 1/(1+e^z) is
popular since exponential has nice differential properties.
Perceptrons cam compute anything since they are NAND gates. NN is a
network of perceptrons which can adjust weights and biases, hence
better than a conventional laid out circuit. Their inputs and outputs
are 0/1. Output = w.x + b where w.x is dot product of weights and
inputs and b is bias. Bias is threshold.
But inputs to/outputs from Sigmoid neurons can be 0.683, i.e. anything
between 0 to 1. Output activation function = 1/(1 + e^(z)) where z =
w.x + b. If we plot it, it's a smoothed version of step function or
Perceptron. Which gives it the property that small changes in inputs
result in small changes in output, unlike Perceptron. This property is
helpful in tuning of a NN, otherwise small changes in inputs will
result in significant changes down the line.
Still, Sigmoids and Perceptrons are similar in the sense that for
large z, output is 1 for small z output is 0.
Essentially, Δoutput is a linear function of the changes Δwj and Δb in
the weights and bias.
We can use other activation functions too, but σ(z) ≡ 1/(1+e^z) is
popular since exponential has nice differential properties.
letsencrypt interfact not found
rm rf /root/.local/share/letsencrypt
wget https://raw.githubusercontent.com/letsencrypt/letsencrypt/master/letsencryptauto
chmod +x letsencryptauto
./letsencryptauto debug renew
wget https://raw.githubusercontent.com/letsencrypt/letsencrypt/master/letsencryptauto
chmod +x letsencryptauto
./letsencryptauto debug renew
exponential function and e
f'(e^x) = e^x, rate of change at any x = e^x
integration(1/x) = ln(x), i.e. area under the curve from x = 1 to x =
k is ln(k), for k = e, area is 1.
slope at x=0 of e^x is 1
compound interest, (1 + 1/n)^n = e as n approaches infinity
e = sigma(1/fact(n))
e^x = sigma(similar)
integration(1/x) = ln(x), i.e. area under the curve from x = 1 to x =
k is ln(k), for k = e, area is 1.
slope at x=0 of e^x is 1
compound interest, (1 + 1/n)^n = e as n approaches infinity
e = sigma(1/fact(n))
e^x = sigma(similar)
Monday, February 13, 2017
Sunday, February 12, 2017
Numpy vs Tensorflow Matrix multiplication
Numpy is much much faster (Note: Used CPU version, not the GPU version)
import numpy import tensorflow as tf import time def getTestData(): A = [[1., 2., 3., 4.],[3.,4.,5.,6.],[7.,8,9.,10.],[11.,12.,13.,14.]] return 6,A def tfMatMul(): n,A = getTestData() A = tf.constant(A) sess = tf.Session() for num in range(1,n): A = tf.matmul(A,A) output = sess.run(A) A = tf.convert_to_tensor(output) sess.close() return output def numPyMatMul(): n,A = getTestData() for num in range(1,n): A = numpy.matmul(A,A) return A def timedRun(methodToRun): start = time.time() result = methodToRun() end = time.time() diff = end  start print("Time Taken :"+str(diff)) print(result) timedRun(numPyMatMul) timedRun(tfMatMul)
Saturday, February 11, 2017
conda commands
conda create name test35 python=3.5
conda info envs (list)
activate test35
deactivate test35
conda remove name test35 all
conda install y scipy
conda install y scipy
PyCharm with Anaconda  Using a specific environment
Search for "Project Interpreter"
Click on Settings Icon(Wheel)
Add Local
Choose python.exe in your env, path is like Anaconda3\envs\<env_name>\python.exe
Thursday, February 9, 2017
scala notes
https://www.safaribooksonline.com/library/view/scalaforthe/9780134510613/
for yield => map
for yield => map
for yield with guard clause => filter
reduceLeft
clojures  how are they implemented? In scala, they are objects which capture the method and bindings of free variables.
Expression evalutation (Recursive data structures)
abstract class Expr case class Num(value: Int) extends Expr case class Sum(left: Expr, right: Expr) extends Expr case class Product(left: Expr, right: Expr) extends Expr val e = Product(Num(3), Sum(Num(4), Num(5))) def eval(e: Expr):Int = e match { case Num(v) => v case Sum(l,r) => eval(l) + eval(r) case Product(l,r) => eval(l) * eval(r) } eval(e)
Expression evalutation (Recursive data structures)  OOP version
abstract class Expr { def eval: Int } class Num(val data: Int) extends Expr { def eval: Int = data } class Product(val left: Expr, val right: Expr) extends Expr { def eval: Int = left.eval * right.eval } class Sum(val left: Expr, val right: Expr) extends Expr { def eval: Int = left.eval + right.eval } val e = new Product(new Num(3), new Sum(new Num(4), new Num(5))) e.evalSo what to use, Polymorphism version or case classes.
Use case classes when your cases are bound. Like here. There is a finite set of expressions.
Use Polymorphism
Wednesday, February 8, 2017
scala notes
https://www.safaribooksonline.com/library/view/scalaforthe/9780134510613/
packages => nesting can be all at one place, no need to have similar source directories
packages => nesting can be all at one place, no need to have similar source directories
imports are flexible, can import specific classes, can hide a class, can alias, can import anywhere (lexical scoping)
Traits (like Java interfaces)
But much more powerful
Traits cannot have construction parameters, otherwise they are same as classes.
traits can be mixed in with objects, rather than class declaration
traits can invoke others in a priorlayer (consolelogger, timestamplogger, shortlogger)
Thursday, February 2, 2017
Changing the port for react app(createreactapp)
Built with create react app
Edit node_modules/reactscripts/scripts/start.js
Search for DEFAULT_PORT and modify as follows:
var DEFAULT_PORT = process.env.PORT  80;
Edit node_modules/reactscripts/scripts/start.js
Search for DEFAULT_PORT and modify as follows:
var DEFAULT_PORT = process.env.PORT  80;
Subscribe to:
Posts (Atom)
Blog Archive

▼
2017
(62)

▼
February
(14)
 Andrew Ng course
 Anomaly detection  Andrew Ng
 neural network notes
 Neural network notes  2
 Neural networks notes 1
 letsencrypt interfact not found
 exponential function and e
 tensorflow notes
 Numpy vs Tensorflow Matrix multiplication
 conda commands
 PyCharm with Anaconda  Using a specific environme...
 scala notes
 scala notes
 Changing the port for react app(createreactapp)

▼
February
(14)