[Statistic] Show Index number of Outliner in vector

Hello,

I have a matrix of measurement data, and one of the columns contains temperature measurements. Some of the data points contain outliers or anomalies. I’m currently using the Scilab Toolbox “samplestat” to filter out these outliers with the following function:

function [outlierfree, outlier] = ST_outlier(v, mod)

My goal is to obtain the indices of these “outliers.” I need these indices to remove the corresponding rows from the other columns in the matrix. For example, I want to align the good temperature values (outlierfree) with the time vector.

I’ve attempted to find all the outliers in the temperature vector using the find() function. However, there’s a possibility that the outlier value may occur multiple times in the temperature vector. Consequently, I end up with multiple indices for the same outlier value, and this could lead to the unintentional deletion of good data points.

Ideally, I’m looking for a function that can provide me with the indices of the outlier values.

I hope this clarifies my issue, and I’m hoping that someone can assist me with this. Best regards.

Hello,

Please provide an example, with small matrices.

S.

Firstly, I want to express my gratitude for your prompt response. Your support serves as a great source of motivation for me.

Below, you will find a matrix:

If you plot column 8 against column 4, you will notice the presence of two outliers.

Now, let’s consider the use of the “Get Index in Plot” function:

% If I start with ST_outlier(Matrix(:,4)); , there will be no outliers found
[outlierfree, outlier] = ST_outlier(Matrix(10:end, 4)); 

--> outlier
outlier  = 

   25.
   8.

% Obtain the index of outliers
for i = 1:size(outlier, '*')
    index(i) = find(Matrix(10:end, 4) == outlier(i));
end

% The issue arises here if the values of the outliers are found in the matrix, even if they are not outliers.

--> index  = 

   11.
   15.

Now, what would be the most effective way to delete all rows corresponding to these indices? For example:

for i = 1:size(index)
     Matrix(index(i), :) = [];
end

Is it possible to delete all rows if I have used struct.row1, struct.row2, and so on?

I hope this example provides clarity on my predicament.

79.0	5.0	182.0	6.0	80.0	377.0	95.0	0.0
79.0	5.0	182.0	8.0	80.0	377.0	96.0	0.01
79.0	5.0	182.0	12.0	80.0	377.0	97.0	0.02
79.0	5.0	182.0	20.0	80.0	377.0	98.0	0.03
79.0	5.0	183.0	25.0	80.0	377.0	99.0	0.04
79.0	5.0	183.0	32.0	80.0	377.0	0.0	0.05
79.0	5.0	183.0	40.0	80.0	377.0	1.0	0.06
79.0	5.0	183.0	50.0	80.0	377.0	2.0	0.07
79.0	5.0	183.0	62.0	80.0	377.0	3.0	0.08
79.0	5.0	185.0	75.0	80.0	377.0	4.0	0.09
79.0	5.0	185.0	88.0	80.0	377.0	5.0	0.1
79.0	5.0	185.0	100.0	80.0	377.0	6.0	0.11
79.0	5.0	185.0	101.0	80.0	377.0	7.0	0.12
79.0	5.0	185.0	101.0	80.0	377.0	8.0	0.13
79.0	5.0	185.0	101.0	80.0	377.0	9.0	0.14
79.0	5.0	185.0	100.0	80.0	377.0	10.0	0.15
79.0	5.0	185.0	102.0	80.0	377.0	11.0	0.16
79.0	5.0	185.0	102.0	80.0	377.0	12.0	0.17
79.0	5.0	185.0	102.0	80.0	377.0	13.0	0.18
79.0	5.0	185.0	25.0	80.0	377.0	14.0	0.19
79.0	5.0	183.0	100.0	80.0	377.0	15.0	0.2
79.0	5.0	183.0	105.0	80.0	377.0	16.0	0.21
79.0	5.0	183.0	105.0	80.0	377.0	17.0	0.22
79.0	5.0	183.0	8.0	80.0	377.0	18.0	0.23
79.0	5.0	183.0	102.0	80.0	377.0	19.0	0.24
79.0	5.0	185.0	101.0	80.0	377.0	20.0	0.25
79.0	5.0	185.0	101.0	80.0	377.0	22.0	0.27
79.0	5.0	183.0	101.0	80.0	377.0	23.0	0.28
79.0	5.0	183.0	100.0	80.0	377.0	24.0	0.29
79.0	5.0	183.0	102.0	80.0	377.0	26.0	0.31
79.0	5.0	183.0	102.0	80.0	377.0	27.0	0.32
79.0	5.0	183.0	102.0	80.0	377.0	28.0	0.33
79.0	5.0	183.0	102.0	80.0	377.0	29.0	0.34
79.0	5.0	183.0	102.0	80.0	377.0	30.0	0.35
79.0	5.0	183.0	103.0	80.0	377.0	31.0	0.36
79.0	5.0	183.0	103.0	80.0	377.0	32.0	0.37
79.0	5.0	185.0	103.0	80.0	377.0	33.0	0.38
79.0	5.0	183.0	101.0	80.0	377.0	35.0	0.4
79.0	5.0	183.0	105.0	80.0	377.0	36.0	0.41
79.0	5.0	183.0	105.0	80.0	377.0	37.0	0.42
79.0	5.0	183.0	105.0	80.0	377.0	38.0	0.43

image

Is it possible to extract the indices from the figure environment?

Hello,

I understand the problem now. It will be faster to hack the ST_outlier function in order to make it return the indexes of outliers…

S.

Thank you, Stéphane. I will give it a try.

data1 = data(:, 4);

The results for ‘[of, o] = ST_pearsonhartley(data1);’ and ‘[of, o] = ST_outlier(data1);’ are identical.

I have managed to obtain the indices of the outliers directly. I made a minor modification to the code in ‘ST_personhartley.sci’:

function [outlierfree, outlier, outlier_idx] = ST_pearsonhartley(v, p)

// Previous code

apifun_checklhs("ST_pearsonhartley", lhs, 1:3); // Output args

// Previous code

for i = 1:n
    q = abs((v(i) - X) / S);
    if (q > qcritval)
        outlier(j) = v(i);
        outlier_idx(j) = i;  // Index of the outlier
        j = j + 1;

        continue;
    end
    outlierfree(k) = v(i);
    k = k + 1;
end

This modification works perfectly!

However, there is still a question. How can I extend ‘qtable’ to handle a vector with more than 1000 values? But this will be solved after studying some statistic methods

"Pearson-Hartley is only applicable for sample distributions with more than 3\n" + ..
"and fewer than 1000 values.")); 
/////////////////////////////////////////////////////////////////////////////////////////

I have also made changes to ‘ST_outlier.sci’:
I will certainly explore deeper if necessary, but for now, I am content with speeding up the process. Sometimes, you don’t need to be an expert in everything.

My modified version of ‘ST_outlier.sci’ works perfectly, even though it is suggested using ‘ST_personhartley.sci’. GIT Hub scilab-samplestat Next weekend, I will attempt to understand the difference between these two functions, in order to unterstand the ‘ST_personhartley.sci’ and the ‘qtable’ aspect.

Here is my current solution:

size(measurement)  // Matrix size 1306 x 8
data = measurement(:, 4); // Only interested in finding outliers in column 4
[of, o, o_idx] = ST_outlier(data); // Find outliers and their indices in o_idx
measurement(o_idx, :) = []; // Remove all rows that match the indices in o_idx

OK great. But I am quite doubtfull about the use of Pearson Hartley method on time series if they are non-stationary (cf. the example you gave above).

Hello once more, do you possess any expertise in this particular domain as well?
What would you suggest for a dataset that is not stationary? I’ve actually found the ST_outlier function to be quite effective. That’s one of the reasons I initiated the filtering process even when the dataset is in a stationary state.

I would use a local median filter to compute a smoothed version x_f(k) of the signal x(k) and substract it from the original one to form the error e(k)=x(k)-x_f(k), then compute the z-scores (e(k)-\mu)/\sigma (\mu is the mean error and \sigma its standard deviation), and finally apply a threshold of \pm 3 to find the indexes of the outliers (even \pm 2 should be ok since your signal is quite clean). Here is a possible implementation:

x=measurement(:,4);
t=1:size(x,1);
// filter with median filter of bandwidth 2*p+1
// value of p can be adjusted as needed.
p=2;
xf=x;
for i=p+1:length(x)-p
    xf(i)=median(x(i-p:i+p));
end

// compute the error
e=x-xf;

// compute z-scores and find outliers
s=stdev(e);
mu=mean(e);
z=(e-mu)/s;
k=find(abs(z)>3);
plot(t,x,t,xf,t,e);
plot(t(k),x(k),'o')
legend("$x$","$x_f$","error","outliers",-1)

Mh,

if you have a smooth signal like this, you could also use:

d_signal = diff(signal); // with signal beeing the data you are interested in

Background:
At the position of the outliners, there is a big difference between conscutive data points.
To be more precise: There is a big negative difference between conscutive data points.

I guess with a threshold value for the d_signal it is possible to find the appropriate outliner positions.

Note:
“d_signal” has one less datapoint than “signal”, so you might need to adjust the index obtained by:
index = find(d_signal < threshold); // with threshold beeing a negative value

Example:

x_axis = Data(:,8);
signal = Data(:,4);
d_signal = diff(signal);
thresh = -20;
index = find(d_signal < thresh);

// shift the index to the correct position
index = index+1;

// plot the data
plot(x_axis, signal,‘-bl’);
plot(x_axis(index), signal(index),‘obl’);

grafik

Hello,

Like you said, this method needs a smooth signal, but in addition would not be adequate if the signal is natively increasing or decreasing (and the threshold would not be easy to tune either). Substracting an average/mean allows to obtain a zero-mean signal on which you can apply classical statistics methods (e.g. z-scores) to detect outliers.

Hello Stéphane,

you are absolutely right. The approach won’t work, if there are more than one outliners next to each other, since the difference between the outliners could be small.