Fastest csv file to StringArray method challenge.

Forums for specific tips, techniques and example code
ConvertFromOldNGs
Posts: 5321
Joined: Wed Aug 05, 2009 5:19 pm

Fastest csv file to StringArray method challenge.

Postby ConvertFromOldNGs » Fri Aug 07, 2009 2:17 pm

by Leigh Roberts >> Tue, 17 Aug 1999 0:25:01 GMT

I have a need to rework the convertCsv method below that takes a line from a comma seperated variable file and converts it to an array of strings. I didn't write the code but I can see where I can make some improvements. The method is used in parsing files with 10's of thousands of lines, so I want it to run as fast as possible.

It also handles the situation where commas in the text are surrrounded by quotes. I think that this very rare situation could be dealt with by finding out if there are any quotes at all in the line and then performing the more expensive checks to weed the comma out.

Any comments on the relative speed of techniques would be appreciated.

Thanks

Leigh



convertCsv(): StringArray updating;

vars
i, j, pos :Integer;
len :Integer;
sa :StringArray;begin

create sa transient;
pos := 1;
i := 1;

while pos <= self.length do
// Locate next comma
len := self.pos (",",pos);

len := len - pos;
if self.pos (",",pos) = 0 and pos <= self.length then
len := self.length - pos + 1;
endif;
// Put string into array
sa := self [pos:len];
// If comma inside quotes, concatenate with rest of
// string
while sa [1] = '"' and sa [sa .length] <> '"' do
pos := pos + len + 1;
len := self.pos(",",pos);
len := len - pos;
sa := sa & "," & self [pos:len];
endwhile;
// Remove quotes from the string
if sa [1] = '"' and sa [sa .length] = '"' then
sa := sa [i] [2:sa [i].length-2];
endif;
sa [i] := sa [i].trimRight;
// Increment position to just after the last comma
pos := pos + len + 1;
i := i + 1;
endwhile;

// sa contains full line, and can now be used for display, etc

return sa;
end;

ConvertFromOldNGs
Posts: 5321
Joined: Wed Aug 05, 2009 5:19 pm

Re: Fastest csv file to StringArray method challenge.

Postby ConvertFromOldNGs » Fri Aug 07, 2009 2:18 pm

by Krull >> Tue, 17 Aug 1999 1:17:47 GMT

Leigh,

You will gain some immediate performance improvements by postponing adding the tokens to the array until the last minute. Replace all usages of sa with a local string variable and use sa.add to add the token to sa instead of a subscripted variable. These simple changes should provide significant perfromance improvements for several reasons. Firstly, each time you use a subscripted array variable sa this is actually a method call; operations on a local string variable are faster. Secondly, since arrays are currently implemented using a linked-list structure, the performance of array indexing operations degrades as the array gets larger. Thirdly, it is faster to use the sa.add to add an entry to the end of the array than sa. Try these changes and let us know what sort of improvement you get.

ConvertFromOldNGs
Posts: 5321
Joined: Wed Aug 05, 2009 5:19 pm

Re: Fastest csv file to StringArray method challenge.

Postby ConvertFromOldNGs » Fri Aug 07, 2009 2:18 pm

by Leigh Roberts >> Tue, 17 Aug 1999 2:10:33 GMT

Thanks Krull

Exactly as I suspected, there are better ways to do things in Jade if you know how Jade is implemented. It's a shame that these aren't documented in a performance tips section in the manuals.

Leigh

ConvertFromOldNGs
Posts: 5321
Joined: Wed Aug 05, 2009 5:19 pm

Re: Fastest csv file to StringArray method challenge.

Postby ConvertFromOldNGs » Fri Aug 07, 2009 2:18 pm

by Sean Dick >> Sun, 22 Aug 1999 8:02:59 GMT

I second that motion. Tell is which things are faster and which are slower but better for other uses.

ConvertFromOldNGs
Posts: 5321
Joined: Wed Aug 05, 2009 5:19 pm

Re: Fastest csv file to StringArray method challenge.

Postby ConvertFromOldNGs » Fri Aug 07, 2009 2:18 pm

by Krull >> Tue, 17 Aug 1999 2:29:03 GMT

Leigh,

Some further comments:

This method is presumably called for every line in the file, the method creates a transient StringArray and leaves the caller to delete it. It would be better to receive the string array as an input parameter, which provides the caller the opportunity of reusing the same string array by using array.clear() instead of a deleting and creating it each time. If you can do that, you avoid the relatively costly operation of creating a new transient array for every line; in addition an Array::clear is less expensive than a delete since it retains the collection header and root block.

You could also improve the overall efficiency by restructuring the method to make a single pass across the source string (self) while inlining the pos and trimRight methods.

ConvertFromOldNGs
Posts: 5321
Joined: Wed Aug 05, 2009 5:19 pm

Re: Fastest csv file to StringArray method challenge.

Postby ConvertFromOldNGs » Fri Aug 07, 2009 2:18 pm

by Leigh Roberts >> Tue, 17 Aug 1999 2:55:05 GMT
You could also improve the overall efficiency by restructuring the method to make a single pass across the source string (self) while inlining the pos and trimRight methods.


I take it you mean something like

self.pos(",",index).trimRight

In general is the rule that the more code you get in one line the more efficient?

Leigh

ConvertFromOldNGs
Posts: 5321
Joined: Wed Aug 05, 2009 5:19 pm

Re: Fastest csv file to StringArray method challenge.

Postby ConvertFromOldNGs » Fri Aug 07, 2009 2:18 pm

by Krull >> Tue, 17 Aug 1999 3:24:33 GMT
You could also improve the overall efficiency by restructuring the method tomake a single pass across the source string (self) while inlining the pos and trimRight methods.
I take it you mean something like

self.pos(",",index).trimRight

What I meant by inlining was actually replacing the calls to pos and trimRight by inline code. Normally, you wouldn't want to do this, but where performance is an issue it gives you another option to investigate. The pos method does more than you need since it scans for a substring whereas you are only looking for a character; and in the absence of a "'" in the source it must scan the entire string. You can restructure the algorithm so that it does the entire job in one pass of the source string without calling outside methods.
In general is the rule that the more code you get in one line the more efficient?

No, not really. You might save some a bit by combining expressions, but the gains won't be significant.

ConvertFromOldNGs
Posts: 5321
Joined: Wed Aug 05, 2009 5:19 pm

Re: Fastest csv file to StringArray method challenge.

Postby ConvertFromOldNGs » Fri Aug 07, 2009 2:18 pm

by Craig Shearer >> Tue, 17 Aug 1999 1:13:28 GMT

Of course, you'd probably get even better performance if you wrote the method in C++ (or some other compiled language) as an external method.

Another avenue would be to rewrite some of the costly JADE methods (eg. the internal String::pos method in C++). Remarkably, some of the JADE supplied primitive methods are written in JADE!

Craig.

ConvertFromOldNGs
Posts: 5321
Joined: Wed Aug 05, 2009 5:19 pm

Re: Fastest csv file to StringArray method challenge.

Postby ConvertFromOldNGs » Fri Aug 07, 2009 2:18 pm

by Allistar Melville >> Mon, 16 Aug 1999 18:08:59 GMT

Indeed it is faster to implement this in a C++ dll. We have Jade tasks that call this method 10's of thousands of times and even a saving of
1 - 10 ms makes a big difference.

Try this:

extern "C" __declspec(dllexport) long JOMAPI cPos(const Character string[], const Character ch)
{
char *pdest;
int result;
pdest = strchr( string, ch );
result = pdest - string + 1;
if( result < 0 )
result=0;
return result;
}

Regards,
Allistar.

------------------------------------------------------------------
Allistar Melville (BSc) Home: allistar@ihug.co.nz \_
Software Developer Work: allistar@focussoft.co.nz </'
Auckland, NEW ZEALAND /)
(/`
"Science built the Academy, superstition the inquisition."
[Robert G. Ingersoll] ------------------------------------------------------------------

ConvertFromOldNGs
Posts: 5321
Joined: Wed Aug 05, 2009 5:19 pm

Re: Fastest csv file to StringArray method challenge.

Postby ConvertFromOldNGs » Fri Aug 07, 2009 2:18 pm

by John Porter >> Mon, 6 Dec 1999 22:58:12 GMT

No one has posted any benchmark results for this, so I will.

First, the original method as posted (actually I first fixed the bug where it crashes if the last field contains a comma). Using this test data

"hello",918,"hell,o",918,744.12,"line 1",918,744.12,"line, 1"

and 100 repetitions, the original method took 41.31ms per repetition
on my speedy Pentium 120 with Jade 5.0.14 in single user mode.

Secondly, I took Krull's advice and created sa in the caller, to
avoid the recreates. That saved almost 3ms per rep, bringing the time down to 38.58. Note that this only works if every record has the same number of fields - otherwise the sa.clear takes about the same amount
of time as the create. Passing back the number of fields found would
be faster than clearing the array though, if a little more clumsy.
The remaining tests assume all records have the same number of
fields.

Then I implemented another of Krull's suggestions and used a
temporary variable instead of sa until ready to add it to the
array. This saved a whopping 23.62ms, bringing the time per rep down
to 14.96ms.

Replacing the sa:=sat with sa.add(sat) did not help, however. It increased the time marginally to 15.18. Perhaps using the add is more than offset by having to do the clear (whether or not the records
contain the same number of fields).

There were several places self.length was referenced, so I saved off
the length in a temporary variable and used that later. This produced
a tiny saving, resulting in 14.92ms per rep.

Then I replaced this code

len:=self.pos(",",pos);
len:=len-pos;
if self.pos(",",pos)=0 and pos<=self.length then
len:=self.length-pos+1;
endif;

with this

len:=self.pos(",",pos);
if len=0 and pos<=slen then
len:=slen-pos+1;
else
len:=len-pos;
endif;

This cut in half the number of calls to pos (except for commas within quotes). This saved another 4ms, yielding 10.95ms per rep.

The purpose of the trimRight is unclear, unless there is an application-specific purpose. I cannot see anywhere this method is
adding spaces so I removed the trimRight, saving 2ms and bringing the time down to 8.93.

Finally, I "inlined" everything (ie, completely re-wrote it),
resulting in the source shown below. This more than cut the prior
time in half, giving a final result of 3.92ms per repetition. If all records do not contain the same number of fields, the number of
fields found should be passed back.

Net result: 41.31ms drops to 3.92ms, 10.5 times as fast. It would be interesting to see results from a C++ routine.

Cheers,
John P
==================================================================================

convertCsv2(sa:StringArray io);

vars
i,ndx,pos,slen:Integer;begin

slen:=self.length;
i:=1;
ndx:=1;
while i<=slen do :bigLoop
if self='"' then // found string token
i:=i+1;
pos:=i;
while self<>'"' do
i:=i+1;
if i>slen then // dangling quote at end of record
break bigLoop;
endif;
endwhile;
sa[ndx]:=self[pos:i-pos];
while self<>"," do
i:=i+1;
if i>slen then
break bigLoop;
endif;
endwhile;
else
pos:=i; // found numeric token
while self<>"," do
i:=i+1;
if i>slen then
sa[ndx]:=self[pos:end];
break bigLoop;
endif;
endwhile;
sa[ndx]:=self[pos:i-pos];
endif;
ndx:=ndx+1;
i:=i+1; // position beyond the comma
endwhile bigLoop;
end;


Return to “Tips and Techniques”

Who is online

Users browsing this forum: No registered users and 4 guests