by John Munro >> Wed, 15 Dec 2004 15:30:18 GMT
Lets say you're installing your application on a new client site. They want to start with a database that is pre-populated from their legacy system. You have a hundred megabyte csv file with millions of lines in it. When you import it, millions of objects will be created. Do you
(a) have one massive transaction?
(b) one transaction per line?
(c) a transaction every x lines?
Whichever option you choose will have a big effect on how long it will take to do the import because of the sheer number of lines and therefore changes to the database.
The up side of (a) is that if the import fails, the whole thing will be rolled back and you won't have a partially-complete database. The down sides are that the more you do in a transaction the slower it gets (so by the millionth object it will be taking a very long time per change), and you will need a lot of disk space to hold the massive journal file.
The up side of (b) is that if the import fails part way through, all of the lines that imported successfully will have imported, and once you repair the faulting line you can continue the import from there. The down side is that going into transaction state and coming out again a million times is very slow.
The up side of (c) is that you can test the import with various values of x to find an optimum number of lines per transaction, which should give you the best possible speed. This value will depend on what your import does for each line, so you will have to discover this optimum value for each import you have - there is no universal best number. The down side of option (c) is that if the import fails part way through a transaction, when you restart it you will have to re-import all of the lines within that transaction up to the failing one, and working out what line that is can be tricky. It's also conceptually inelegant because it invalidates the whole atomic-change transaction concept for the sake of speed.
At present we have implemented (c) and are trying to avoid the down sides by having a validation pass which first checks (as best it can) that the data is clean and the import won't fail, before it actually starts importing.
Opinions/advice/comments?
--
John Munro
FileVision UK Ltd.
Atlantic House
Imperial Way
Reading
RG2 0TD
Telephone: +44 (0) 118 903 6066
Fax: +44 (0) 118 903 6100
Email: john.munro@filevision.com
Web: http://www.filevision.com
The contents of this communication are confidential and are only intended to be read by the addressee. We apologize if you receive this communication in error and ask that you contact FileVision UK Ltd. immediately to arrange for its return. The use of any information contained in this communication by an unauthorized person is strictly prohibited. FileVision UK Ltd. cannot accept responsibility for the accuracy or completeness of this communication as it is being transmitted over a public network. If you suspect this message may have been intercepted or amended, please inform FileVision UK Ltd.