transactions and imports

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 12:44 pm

by John Munro >> Wed, 15 Dec 2004 15:30:18 GMT

Lets say you're installing your application on a new client site. They want to start with a database that is pre-populated from their legacy system. You have a hundred megabyte csv file with millions of lines in it. When you import it, millions of objects will be created. Do you

(a) have one massive transaction?
(b) one transaction per line?
(c) a transaction every x lines?

Whichever option you choose will have a big effect on how long it will take to do the import because of the sheer number of lines and therefore changes to the database.

The up side of (a) is that if the import fails, the whole thing will be rolled back and you won't have a partially-complete database. The down sides are that the more you do in a transaction the slower it gets (so by the millionth object it will be taking a very long time per change), and you will need a lot of disk space to hold the massive journal file.

The up side of (b) is that if the import fails part way through, all of the lines that imported successfully will have imported, and once you repair the faulting line you can continue the import from there. The down side is that going into transaction state and coming out again a million times is very slow.

The up side of (c) is that you can test the import with various values of x to find an optimum number of lines per transaction, which should give you the best possible speed. This value will depend on what your import does for each line, so you will have to discover this optimum value for each import you have - there is no universal best number. The down side of option (c) is that if the import fails part way through a transaction, when you restart it you will have to re-import all of the lines within that transaction up to the failing one, and working out what line that is can be tricky. It's also conceptually inelegant because it invalidates the whole atomic-change transaction concept for the sake of speed.

At present we have implemented (c) and are trying to avoid the down sides by having a validation pass which first checks (as best it can) that the data is clean and the import won't fail, before it actually starts importing.

Opinions/advice/comments?

--
John Munro

FileVision UK Ltd.

Atlantic House
Imperial Way
Reading
RG2 0TD

Telephone: +44 (0) 118 903 6066
Fax: +44 (0) 118 903 6100
Email: john.munro@filevision.com
Web: http://www.filevision.com

The contents of this communication are confidential and are only intended to be read by the addressee. We apologize if you receive this communication in error and ask that you contact FileVision UK Ltd. immediately to arrange for its return. The use of any information contained in this communication by an unauthorized person is strictly prohibited. FileVision UK Ltd. cannot accept responsibility for the accuracy or completeness of this communication as it is being transmitted over a public network. If you suspect this message may have been intercepted or amended, please inform FileVision UK Ltd.

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 12:44 pm

by allistar >> Wed, 15 Dec 2004 20:13:13 GMT

Hi John,
For something like populating a blank database I wouldn't be too concerned about not having an atomic transaction. If the import has issues you can always quite quickly copy the a backup of the system directory to get back to the beginning. This would be quicker in both development hourse and also run-time hours to roll back to a fresh state.

Given that, I would go with c) myself. b) would incur a huge amount of disk I/O and would therefore be a lot slower than c). Instead of planning c) based on the number of objects created, you could do it based on time. Say every 30 seconds so a commit. I would probably do it based on the number of objects and set that number to somewhere between 1000 and 5000, depending on the size of the objects. (Updating a BLOB or SLOB is essentially like updating another object, so I would count that as an extra object for purposes of committing).

The best number to choose is dependant on a number of factors, including harware configuration. Unless you must squeeze every minute out of the import process, I would pick a number, like 2000, and stick with that. The differene between 2000 and 3000 for a million objects would be only a few % of the total time (if that).

Regards,
Allistar.
--
------------------------------------------------------------------
Allistar Melville
Software Developer, Analyst allistar@silvermoon.co.nz
Auckland, NEW ZEALAND

Silvermoon Software
Specialising in JADE development and consulting
Visit us at: http://www.silvermoon.co.nz
*NEW* Simple web access to Jade at: www.silvermoon.co.nz/jhp.html ------------------------------------------------------------------

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 12:44 pm

by cnwjhp1 >> Thu, 16 Dec 2004 3:06:40 GMT

John,

We do this fairly regularly, sometimes with hundreds of millions of records. We developed a C++ .dll to scan the line and return a field at a time, and have a method on the class to populate the attributes:

coynum:=call getVal(s,i,t).Integer;
maint:=call getVal(s,i,t);
name_1:=call getVal(s,i,t);

This greatly reduced cpu consumption for large quantities of data.

We use your option C, usually with commits set to something between 30 seconds and 10 minutes. The optimum value does depend on the data being loaded, so you just have to experiment if load time is important.

As Allistar suggested, just back up the database before the load and reload if anything goes wrong. This also enables you to turn on write caching on the disk. Turn it off again after the load if you want it on for production. Hopefully you are not using raid 5 disk, as that is slow in a bulk load situation.

We have a restart function, and exercise it regularly eg to change commit times. If you save off your restart info (line count or file.currentOffset) in the database itself, Jade will back out your restart info when it backs out the transaction, so you can easily restart at the right spot.

You will want [JadeClient].ObjectCacheSizeLimit set fairly large, eg 200M. If you set it too large, it may slow down after a couple hours. Apparently there is a performance issue (cpu consumption) when a single user has lots and lots of buffers allocated, so you can't use larger values eg 1000M like you can in multiuser situations. The optimal setting depends on your object sizes - the issue is with the number of buffers but you specify the cache size in bytes.

Setting DefFileGrowthIncrement to a reasonable size can reduce fragmentation, which can improve load times. The default is 2k or 4k, which is way too small for big map files. We use 1M to 10M for big map files, and maybe 100K for smaller ones.

Cheers,
John P

Forums

transactions and imports

transactions and imports

Re: transactions and imports

Re: transactions and imports

Who is online