Vectorized functions for strings operations

Hi dear scilab users,

I came across something that bothers me. I wonder if there is no function to easily do what I want, or it is just that I can’t find a simple way to do so.

I have ̶a̶n̶ ̶m̶l̶i̶s̶t̶ a table where one column is filled with strings, and some codes are encapsulated in these strings. Something like that:
[546] code 1
[007] code 2
[89] code 3
[546] code 1

The thing is my file is quite large (500000+ lines), and I would like to extract these codes vectorially, without the use of a loop on the array, using directly some functions. Because strsplit and some other functions seem to work only on a single string.

Is there a way to do so, or is it not implemented ?

Thank you all !

Hello, and welcome to Scilab’s Discourse !

Can you just post a small example ?

S.

Hello,
It feels nice to be answered !
I made some little code to clarify the situation:

function M = create_dumb_table(n_elements)
	Date = hours(1:n_elements)'
	code = sample(n_elements, ["[12] Hazard", "[7] Emergency", "[131] Brake", "[999] miscellaneous code", "[56] Future maintenance required", "[4] OK", "[0] N\A", "[33311] lunch break"])'
	M=table(Date, code, "VariableNames", ["Date" "code"]);
endfunction

M = create_dumb_table(30) // much more complex in my case but the core is here

// I would like to do something like
[splits, _] = strsplit(M("code"), "]");
M("code extract") = uint16(part(splits(1,:), 2:$)); // or strsplit(splits(1,:), "["); ...

// Which I can't, because strsplit doesn't seem to allow vectors of strings, and I didn't find any convenient method

// So I ended up with something like that, which is very slow on large arrays

function list_floats = find_floats_in_brackets(str_array)
	n = size(str_array,1)
	list_floats = zeros(n,1)
	for i = 1: n
		[str_number, _] = strsplit(str_array(i), "]")
		str_number = part(str_number(1), 2:$)
		list_floats(i) = uint16(str_number)
	end
endfunction

M("code extract") = find_floats_in_brackets(M("code"));

Is there a more practical way to do so ?
I had thought about using strchr to get the length of the string to be cut, but then the part function works on vectors, but doesn’t seem to allow flexible length cuts on the rows of the array string.

  • little bonus question : here my codes are integers, if they were floats, would I be forced to use evstr ? Because these are only floats, without calculations in it, maybe there is something faster.

Thank you in advance !

UCO

Basics are always your best friends:

msscanf(-1,M.code,"[%d]")

 ans = [30x1 double]

   56.
   0.
   0.
   7.
   56.
   33311.
   4.
   7.
   131.
   999.
   999.
   4.
   56.
   131.
   33311.
   12.
   33311.
   56.
   7.
   131.
   7.
   7.
   999.
   33311.
   131.
   131.
   12.
   7.
   56.
   999.

On my macBook the timing seems OK for a large table:

--> M = create_dumb_table(500000);

--> tic;

--> i = msscanf(-1,M.code,"[%d]");

--> toc

 ans = 

   0.0657260
2 Likes

Thank you you have completely answered my question. I was looking for functions inside the strings help documentation, I didn’t think to look somewhere else.

I must admit I still struggle because in reality there are exceptions in my file, sometimes there are no brackets at all, and sometimes it’s text that is inside the brackets. It is very rare, but it happens and the msscanf stops at first discrepancy. I will try to figure out how to handle these.

Hello,

I feel that it’s like to match, extract and handle separately the different cases in your vector of strings : looking around grep with regular expression to match as needle in the haystack.

best,

David