Simple illustration of training vs validation

Hello,

Scilab allows to write compact code to illustrate very important notions in machine learning. For example, here is how we teach our UTC undergraduate students the interplay between training and validation with simple polynomial regression.

First, download the two vectors \mathbf{t},\mathbf{y} and plot the data (t_i,y_i)_{i=1\dots n}

http_get("https://moodle.utc.fr/mod/resource/view.php?id=109121","data1.sod",follow=%t)
load data1.sod
plot(t,y,'^');

To fit a polynomial P(t)=\sum_{k=0}^{p} \theta_{k+1}t^k to this data in the least-squares sense, we first have to construct a matrix \mathbf{A} such that the i-th residual r_i=P(t_i)-y_i is the i-th component of vector \mathbf{r}=\mathbf{A}\mathbf{\Theta}-\mathbf{y} where \mathbf{\Theta}=(\theta_1,\dots,\theta_{p+1}). For example, let’s do this for p=2:

A = [];
for k = 0:2
    A = [A, t.^k]
end

Then, finding \mathbf{\Theta} such that the residual sum of squares \Vert\mathbf{r} \Vert^2 = \Vert\mathbf{\mathbf{A}\mathbf{\Theta}-\mathbf{y}} \Vert^2 is minimal can be done in several ways (see Linear least squares - Wikipedia), but Scilab does this by using numerically stable computations (see QR decomposition - Wikipedia).

θ = A\y;

Usually the the left division operator is used to compute the solution of square systems of equations when A is square and invertible, but in our case, it allows to compute a solution θ in the least-squares sense, i.e. such that norm(A*θ-y)^2 is minimal.
It is then straightforward to compute the values of the polynomial at (t_i)_{i=1\dots n} as they are given by the product \mathbf{A\Theta}, and plot its graph:

plot(t,A*θ,"r")

data2
The polynomial does not fit the data very well, and increasing its degree will improve the fit, but how large can we take it ?

Before giving an answer, data scientists will tell us to split our data in two sets, a training set T\subset \{1,2,\dots,n\} of indices of our data points, and a validation set given by its complement V=\{1,2,\dots,n\}\backslash T,

n = size(t,1);
T = round(n/6:5*n/6);
V = setdiff(1:n,T);

Here we kept in T two thirds of our dataset located in the center. Let us depict the two sets of data:

clf
plot(t(T),y(T),'^',t(V),y(V),'^');

data3
Then, they will tell us to compute \mathbf{\Theta} by using only the training data, which we do in Scilab as:

θ = A(T,:)\y(T) 

The whole residual vector is computed in Scilab with

r = A*θ

and the sum of squares restricted to validation data and training data as

RSS_V = norm(r(V))^2;
RSS_T = norm(r(T))^2;

The validation RSS is the only relevant quantity to evaluate whether the fit is able to generalize to new data.

If an adequate degree p for our polynomial has to be chosen, then we just have to chose the one giving the lowest validation RSS.

The following script considers degrees 0 to 7 and plots the graph of the polynomial with the fitted data, and finally plots both RSS with respect to the degree of the polynomial:

A = [];
for p = 1:8
    A = [A t.^(p-1)];
    θ = A(T,:)\y(T);
    r = A*θ-y;
    RSS_T(p) = norm(r(T))^2;
    RSS_V(p) = norm(r(V))^2;

    subplot(3,3,p);
    plot(t(T),y(T),'^',t(V),y(V),'v');
    gca().auto_scale="off";
    h = plot(t,A*θ,'r',"thickness",2);
    legend(h,"deg(P)="+string(p-1),boxed=%f);
end

subplot(3,3,9);
semilogy(0:7,[RSS_T RSS_V]);
legend("training RSS","validation RSS",2,boxed=%f);

train

Clearly, the best choice is p=3.

We would not have been able to make a choice if we just looked at the training RSS, which is strictly decreasing.

In general, choosing T and V is not that easy and to prevent biases resulting of an arbitrary choice, cross-validation is the way to go (see Cross-validation (statistics) - Wikipedia).

2 Likes

Excellent … we need more of that.

1 Like