Published on

CS182/282A Spring 2023 1/23/23

Authors
  • avatar
    Name
    Homing So
    Twitter

Standard Optimization-based paradigm for supervised learning

Table of Contents

Ingredients

  • Training Data (i=1,...,n)\left( i = 1, ..., n \right)
  • XiX_i: inputs, covariates;
  • YiY_i: outputs, labels
  • Model fθ(),θparametersf_\theta{ \left( \cdot \right) },\theta\Leftarrow parameters

Training via Empirical Rick Minimization

θ^=argminθ1ni=1nltrain(yi,fθ(xi))\hat{ \theta } = \mathop{\arg\min}\limits_{ \theta } \frac{1}{n} \sum\limits_{i=1}^n l_{train}\left( y_i, f_\theta\left( x_i \right) \right):

  • choose a θ^\hat{\theta} that we can learn from data that minimizes something
  • l(y,y^) returns a real number l\left( y, \hat{y} \right) \text{ returns a real number }: ll is a loss that compares yy to some prediction of y^\hat{y}, always return a real number(difference between Train Data which may be vectors or numbers) that we can minimizes

True Goal is Real World Performance on unseen XX

mathematical proxy:

  • P(X,Y)\exists{ P\left(X, Y\right) }: assume a probability distribution
  • want a low EX,YE_{X, Y} (expectation over XX and YY) of the loss l(Y,fθ(X))l\left( Y, f_\theta\left( X \right) \right)

Complication

  1. We have no access to P(X,Y)P\left( X, Y \right)

We want to do well on average on stuff we haven't seen, we assume that there's average make some sense and there is some underlying distribution but we don't know it

Solution:

  • (Xtest,i,Ytest,i)i=1ntest\left( X_{test, i}, Y_{test, i} \right)_{ i = 1 }^{ n_{test} }: Collect a Test set of held back Data
  • Test Error=1ntesti=1ntestl(ytest,i,fθ(xtest,i))\text{Test Error} = \frac{1}{n_{test}} \sum\limits_{i=1}^{n_{test}} l\left( y_{test, i}, f_{\theta}\left( x_{test, i} \right) \right)

We collected this Test set it is somewhat faithful representation of what we expect to see in the real world, and we hope that the real world follows the kinds of things that probability distributions do, so we hope that averaging and sampling gives us some predictive power on what will actually happen.

I don't know what to do, I know how to do this, so I'll just do this

  1. Loss ltrue(,)l_{true}\left( \cdot, \cdot \right) that we care about is incompatible with our optimizer

You want to do this, this requires something to go around and try to calculate what this argument is. That's something algorithm they'll have to do this work, that algorithm will only work if certain things happen. It might be that the loss you care about doesn't let it do what it needs to do.

You actually care about some loss that's not differentiable, because it's what's practically relevant for your problem. But your minimizer is going to be using derivatives and so yo will say can't work.

Solution:

  • ltrain(,)l_{train} \left( \cdot, \cdot \right): use a surrogate loss that we can work with.

Classic Example:

  1. y{cat,dog}y \in \lbrace \text{cat}, \text{dog} \rbrace, ltruel_{true}: Hamming Loss

  2. yRy \rightarrow \mathbf{R} where training data mapped to:

    {cat1dog+1\begin{equation} \left\{ \begin{array}{lr} cat \rightarrow -1 & \\ dog \rightarrow +1 & \end{array} \right. \end{equation}
  3. ltrainl_{train}: squared error

When we're evaluating test error, we use ltruel_{true}.

The purpose of evaluating test error is to get a sense of how well you might do on real world data. It is an evaluation that you're doing on a specific model that's already been optimized, no optimization is going to be happening on test error.

Examples You should know:

  • binary classification: logistic loss, hinge loss
  • multi-classification: cross-entropy loss

Aside:

1ni=1nltrue(yi,fθ^(xi))\frac{1}{n} \sum\limits_{i=1}^{n} l_{true} \left( y_{i}, f_{\hat{\theta}} \left( x_{i} \right) \right) evaluating on the training set, different than 1ni=1nltrain(yi,fθ(xi))\frac{1}{n} \sum\limits_{i=1}^n l_{train} \left( y_i, f_{\theta} \left( x_{i} \right) \right)

This object is kind of practically speaking for everyone who's going to be working on things is debugging.

We want to use this to understand whether or not actually optimizing our training losses doing anything reasonable with respect to the thing we actually care about, and see how well are we actually doing. Because if there was a growth, I'm gonna add some more words. If there was a grotesque mismatch between what you told this to optimize and how you were doing on the thing you were kind of moving towards, then maybe something wrong.

1ntesti=1ntestl(ytest,i,fθ(xtest,i))\frac{1}{n_{test}} \sum\limits_{i=1}^{n_{test}} l\left( y_{test,i}, f_{\theta}\left( x_{test,i} \right) \right)

You want this to be a faithful measurement of how things might work in practice, but if you looked at this guy and said 'oh wait, okay I should have changed this, then you go back and you say 'let me look at this again', then you might be running an optimization loop involving you as the optimizer in which you're actually looking at this held back data and this data isn't begin held back anymore. And because it isn't being held back, you might not trust how well things will work in practice. (Kind of perspective on the phenomenon of overfitting)

No such care on 1ni=1nltrain(yi,fθ(xi))\frac{1}{n} \sum\limits_{i=1}^{n} l_{train} \left( y_{i}, f_{\theta} \left( x_{i} \right) \right), because you're already using this data to evaluate how well you're doing, In the sense of your optimization algorithm is looking at it all the time. So whether you choose to take other views of it, Is cost free.

  1. You run your Optimizer with your surrogate loss and we get "crazy" values for θ^\hat{\theta} you're on the Optimizer, and/or you get really bad test performance. (Another kind of perspective on the phenomenon of overfitting)

Solution:

  • Add an explicit regression during training: θ^=argminθ(1ni=1nltrain(yi,fθ(xi)))+Rλ(θ)\hat{\theta} = \mathop{\arg\min}\limits_{\theta} \left( \frac{1}{n} \sum\limits_{i=1}^{n} l_{train} \left( y_{i}, f_{\theta} \left( x_{i} \right) \right) \right) + R_{\lambda} \left( \theta \right), e.g. Ridge Regression: R(θ)=λθ2R\left( \theta \right) = \lambda\| \theta \|^2

Notice: we added another parameter λ\lambda. How do we choose it?

Native Hyper parameters: θ^=argminθ,λ0()\hat{\theta} = \mathop{\arg\min}\limits_{\theta, \lambda^{\geq0}} \left( \frac{}{} \right)

  • Split parameters into "Normal parameters θ\theta, and Hyper parameters λ\lambda"

"Hyper parameter is a parameter that if you let the optimizer just work with it, it would go crazy, so you have to segregate it out"

Hold Out additional Data(Validation Set), use that to optimize hyper parameters

When you do hyper parameter optimization using the validation set, you might be using different kind of optimizer than you used for the argument you’re doing for finding your parameters. So typically in the context of deep learning this thing is always going to be some variation of gradient descent is what we use to do this kind of setting. But for hyper parameter setting you might be doing a Brute Force grid search or searches based invoking ideas related to things like multi-ram Bandits or other techniques of you know zeroth order optimization algorithms that will help you do that, you can also use for some hyper parameter searches gradient based approaches when it.

All Solution:

  • Simplify model: "Reduce model order"

Further Complication

  • The Optimizer might have its own parameters. e.g. learning rate

Generally, optimizers might have their own tunable knobs. And in pratice, as a someone trying to do deep learning, you’re going to have discrete choice of which optimizer to use.

You see a two subtly different perspective.

  • Most basic/root optimizer approach: Gradient Descent

Gradient Descent is an iterative optimization approach where you make improvements and you make then locally.

  • Idea: Change parameter a little bit at a time.

All you care about is how does your loss behave in the neighborhood of the parameters you're in.

So look at local neighborhood of loss around.

θt+1=θt+η(θLtrain,θ),Ltrain,θ=1ni=1nltrain(yi,fθ(xi))+R()\theta_{t+1} = \theta_{t} + \eta \left( - \nabla_{\theta} L_{train, \theta} \right), L_{train, \theta} = \frac{1}{n} \sum\limits_{i=1}^{n} l_{train} \left( y_{i}, f_{\theta} \left( x_{i} \right) \right) + R \left( \cdot \right)

This is a Discrete-time Dynamic System

η "step size"/"learning rate" \eta \leftarrow \text{ "step size"/"learning rate" }, this η\eta controls stability of this system

η\eta too large, Dynamics go unstable (it oscillate)

η\eta too small, it takes to long too converge

Ltrain(θ++Δθ)Ltrain(θ+)+θLtrain"row"θ+ΔθL_{train} \left( \theta_{+} + \Delta\theta \right) \approx L_{train} \left( \theta_{+} \right) + \underbrace{\frac{\partial}{\partial\theta} L_{train}}_{\text{"row"}} \rfloor_{\theta_{+}} \Delta\theta

The transpose of this "row" is called the gradient