by
allistar >> Fri, 13 Dec 2002 10:07:49 GMT
This is an interesting algorithm. Thanks for sharing it, it was fun to interpret and implement it.
Generating the tree isnt so slow if you stop at the end of each word (when you hit "whitespace"). That way, the maximum depth of the tree is the largest word and your tree size is roughly proportional to the vocabulary.
Good idea, you could also cut out punctuation as well.
The search tree size is several times bigger then the document itself though (mostly due to all the links back to the occurances in the source document), although you could cut dramatically it if you only held indexed words (not every possible sub string), and if you only held the actual pointers to the
"occurances" at the last letter of the word in the search tree. Although if you limit it like this, you might be better off just using a basic word dictionary like you describe before.
Wouldnt it also be much more effeciently handled in jade if it was stored in
some kind of flat structure in a binary field? That way you wouldnt be holding and retrieving thousands of objects per document.
You could store it in a large binary and implement the structure as a hash table, which could improve performance during searching but would greatly complicate adding and in particular altering the original strings.
Using a tree structure, a search for a 5 digit substring would require exactly 5 "getAtKey" calls (which would be 5 trips to the server to lock the respective dictionaries - although I would make it a server execution method to minimise this overhead). The drawback of a large binary is that the entire binary would need to e shipped to the client when it works on it. That binary would be huge and this fact alone would probably prevent the binary way of doing it from being useful.
Using a tree structure with objects is nice from a maintenance point of view. If the original string were to be deleted or altered updating the tree structure would be stright forward.
I'm pleased someone gave it a go and got it to work. How did it perform during searches?
Allistar.