Page 1 of 1

searching for a substring in a larger string

Posted: Fri Aug 07, 2009 12:17 pm
by ConvertFromOldNGs
by pillick >> Mon, 9 Dec 2002 4:01:45 GMT

well, I have to write an algorithm for this problem, and since each search will be run over many thousands of strings I guess I had better write a good one...

I'm thinking the boyer-moore algorithm would be appropriate. Has anyone done this allready? Or know of a better algorithm to use?

Re: searching for a substring in a larger string

Posted: Fri Aug 07, 2009 12:17 pm
by ConvertFromOldNGs
by allistar >> Mon, 9 Dec 2002 8:08:13 GMT

The Boyer-Moore algoeithm (and derivitives of it) are pretty much the norm for searching standard text.

If you are searching for a whole word in a series of words (i.e. a sentence, paragraph or other piece of large english text) then what we have done in Jade is break the text into words and add those to a sorted collection. Then when you are searching for a particular word it is simply a matter of doing a .getAtKey on the dictionary to find all occurrances of the word.

Of course that only works when you are searching for words, not substrings.

If you want to minimise search time and don't mind about bloating the database the other way is to use a hashtable to spend up the lookups. In Jade I would store this as a trie, like this:

Class: WordSearcherNode
properties:
letter: Character;
allSuffixNodes: WordSearcherNodeDict; (keyed by "letter")
myParentNode: WordSearcherNode; (inversed to "allSuffixNodes")
allSourcePointers: (a pointer to the location of this string)

Once the trie has been built (which would be quite expensive, but only needs doing once and can be made a background process) to search for a word (like "apple") you would do this:

- Look for the node in the collection of root nodes for "a".
(if not found then the substring doesn't exist)
- in that "a" node look in "allSuffixNode" for a "p".
(if not found then the substring doesn't exist)
- in that "p" node look in "allSuffixNode" for a "p".
(if not found then the substring doesn't exist)
- in that "p" node look in "allSuffixNode" for a "l".
(if not found then the substring doesn't exist)
- in that "l" node look in "allSuffixNode" for a "e".
(if not found then the substring doesn't exist)

Now you have the "WordSearcherNode" object that represents "apple". It has a collection of all the places where it occurs.

In this example you have only done 5 .getAtKey searches to search every string that you have indexed (which could be a lot indeed - millions).

I say again that this will take up a LOT of database space, but will result in VERY fast substring searches.

Hope this helps in some way,
Allistar.

Re: searching for a substring in a larger string

Posted: Fri Aug 07, 2009 12:17 pm
by ConvertFromOldNGs
by allistar >> Tue, 10 Dec 2002 22:32:53 GMT

Another way of doing this that wouldn't require so much database storage as the above proposal is:

for each string you want to be able to search on break it up into all possible 5 character substrings.

So the string "1234567890" would consist of:
12345
23456
34567
45678
56789
67890
7890
890
90
0

These substrings then get indexed into a collection with a reference back to the location they come from (something identifying the source object, property name and string position).

Then searching for a substring in the entire database is a very simple ...getAtKey. Using this method you could search for substrings of any length, the .getAtKey would return the closest match for a 5 length matching substring. Using iteration would reveal other shorter substring matches, and for longer substring matches you would have to look at the source of the string.

If you make the 5 in the above example a larger it would be quicker to search for larger sub-strings, but will take up more database space. The smaller that number the slower the search, but the less database space will be taken up.

This method would still consume copious amounts of database space, but would result in very faster substring searches over a potentially HUGE search space.

The nicest thing is that you don't have to use Boyer-Moore over and over on every string in the search space, which would be twice as slow for twice the search space.

I may implement this to see how is works in reality when I get some spare time (yeah right!).

Regards,
Allistar.

Re: searching for a substring in a larger string

Posted: Fri Aug 07, 2009 12:17 pm
by ConvertFromOldNGs
by wxv >> Wed, 11 Dec 2002 22:38:46 GMT

This is an interesting algorithm. Thanks for sharing it, it was fun to interpret and implement it.

Generating the tree isnt so slow if you stop at the end of each word (when you hit "whitespace"). That way, the maximum depth of the tree is the largest word and your tree size is roughly proportional to the vocabulary.

The search tree size is several times bigger then the document itself though (mostly due to all the links back to the occurances in the source document), although you could cut dramatically it if you only held indexed words (not every possible sub string), and if you only held the actual pointers to the "occurances" at the last letter of the word in the search tree. Although if you limit it like this, you might be better off just using a basic word dictionary like you describe before.

Wouldnt it also be much more effeciently handled in jade if it was stored in some kind of flat structure in a binary field? That way you wouldnt be holding and retrieving thousands of objects per document.

Wilfred.

Re: searching for a substring in a larger string

Posted: Fri Aug 07, 2009 12:17 pm
by ConvertFromOldNGs
by allistar >> Fri, 13 Dec 2002 10:07:49 GMT
This is an interesting algorithm. Thanks for sharing it, it was fun to interpret and implement it.

Generating the tree isnt so slow if you stop at the end of each word (when you hit "whitespace"). That way, the maximum depth of the tree is the largest word and your tree size is roughly proportional to the vocabulary.

Good idea, you could also cut out punctuation as well.
The search tree size is several times bigger then the document itself though (mostly due to all the links back to the occurances in the source document), although you could cut dramatically it if you only held indexed words (not every possible sub string), and if you only held the actual pointers to the
"occurances" at the last letter of the word in the search tree. Although if you limit it like this, you might be better off just using a basic word dictionary like you describe before.

Wouldnt it also be much more effeciently handled in jade if it was stored in
some kind of flat structure in a binary field? That way you wouldnt be holding and retrieving thousands of objects per document.

You could store it in a large binary and implement the structure as a hash table, which could improve performance during searching but would greatly complicate adding and in particular altering the original strings.

Using a tree structure, a search for a 5 digit substring would require exactly 5 "getAtKey" calls (which would be 5 trips to the server to lock the respective dictionaries - although I would make it a server execution method to minimise this overhead). The drawback of a large binary is that the entire binary would need to e shipped to the client when it works on it. That binary would be huge and this fact alone would probably prevent the binary way of doing it from being useful.

Using a tree structure with objects is nice from a maintenance point of view. If the original string were to be deleted or altered updating the tree structure would be stright forward.

I'm pleased someone gave it a go and got it to work. How did it perform during searches?

Allistar.

Re: searching for a substring in a larger string

Posted: Fri Aug 07, 2009 12:17 pm
by ConvertFromOldNGs
by wxv >> Mon, 16 Dec 2002 20:22:43 GMT
Using a tree structure, a search for a 5 digit substring would require exactly 5 "getAtKey" calls (which would be 5 trips to the server to lock the respective dictionaries - although I would make it a server execution method to minimise this overhead). The drawback of a large binary is that the entire binary would need to e shipped to the client when it works on it. That binary would be huge and this fact alone would probably prevent the binary way of doing it from being useful.

Good point. I was stuck on the idea of having to retrieve the whole index, but thats not really relevant when your only ever lookups over a small portion.
I'm pleased someone gave it a go and got it to work. How did it perform during searches?

Lightning quick, as is to be expected i guess. By stopping at whitespace (and punctuation) the index didnt take very long to generate either.

Cheers,

Wilfred.