Improved searching logic help

sideshow · Postby **sideshow** » Tue Apr 09, 2013 2:46 pm

I hope you can help me to speed up my logic for searching in Jade.

We have a set up where:
A Customer has multiple regions they can subscribe to
A Customer has multiple classification they can subscribe to
A Customer has a flag as Active.

We have a product that can be assigned multiple regions and classifications.

Collections
allActiveCustomersByOrg – From Region Class
allActiveCustomersByOrg - From Classification Class

The way we currently search to find a corresponding match between the product and customer is:
Iterate a collection of allActiveCustomersByOrg from the Region Class matching the product regions and then do an includes to see if the customer includes a matching classification.

This is fine for small searches but we are iterating around 5 million customers and 3 million don’t match due to not matching a classification. The results are much the same if we iterate allActiveCustomersByOrg from the Classification Class and do the includes the other way.

What ideas do you have to try to improve the searching?
This is a very simplified version of events but it holds the main point of what I am trying to do.

Thanks

ghosttie · Postby **ghosttie** » Wed Apr 10, 2013 1:26 am

You could have a collection of active Customers by Classification on Region and a collection of active Customers by Region on Classification (you haven't mentioned what Org or product are).

You could have a collection of all Customers by Region and Classification, although it may not be a good idea to have a global collection for locking reasons.

allistar · Postby **allistar** » Wed Apr 10, 2013 9:12 am

How many regions and how many classifications are there? The answer to this would help to come up with a good solution.

When I face performance issues walking large sets of data the solution is most often to introduce additional data structures to improve search time. It's the typical trade off between disk space and speed.

sideshow · Postby **sideshow** » Wed Apr 10, 2013 2:51 pm

Hi Guys,

Thanks for the replies.

There are 700 Classifications and 150 Regions
Around 500,000 customers and the size of allActiveCustomersByOrg collection on each Classification totaled together is 15 million.
The matching routine takes 6 – 10 hours to complete. This is after doubling the performance by tweeking as much as i could inside the current setup without big data changes.

The allActiveCustomersByOrg Collections are ordered by the Organisation that supplies the product.
So a customer subscribes to an organisation and the region and classification they would like the product for.

As for creating the collections:
The references between the customer and classification is a many to many join. The same for the customer and region. The Join between the product and the region and classification is a many to many.
This means using member key dictionary I cannot add the rall…. to the keys.

I like what you are saying Allistar. I could potentially add another class so I could remove the many to many joins.

For some background context we are much like trademe – Matching customers with organisations products using regions and classifications.

allistar · Postby **allistar** » Thu Apr 11, 2013 9:46 am

Hi sideshow,

So the search takes a customer and an organisation and finds all products that match, based on the classifications and regions the customer is assigned to, and the classifications and regions a product is for? So the search always takes the customer and organisation as input, and provides a collection of products as output?

You know this already, but 6 to 10 hours is an eternity for a query.

Is the product/organisation/classification/customer data fairly stable? I.e. it's less common to be creating these instances than it is to be performing the search?

Suggestion 1: The sledgehammer approach:

Create a class that represents a Customer/Product/Organisation combination:

Code: Select all

OrganisationCustomerProduct:
myCustomer (child) (inversed to allCustomerProducts on Customer)
myProduct (child) (inversed to allCustomerProducts on Product)
myOrganisation (child) (inversed to allCustomerProducts on Organisation)

Write a script that creates instances of this class based on the products that a customer can see. You'll need to update these objects as customers/regions/classifications etc are changed.

The upside: searching is now instant - you have the collection of products for a customer/organisation.

The downside: it'll take a long time to create those instances, there will be a LOT of them, and you'll need to manage contention when refreshing the data. I would do this with a "single writer process" mechanism such as a server background process that is the only thing allowed to update these objects, using updateLocks.

Suggestion 2: Somewhere in between suggestion 1 (full caching of results) and what you have now (no caching of results).

Create a class that represents all products for a particular organisation/classification combination, like this:

Code: Select all

OrganisationClassification:
myOrganisation (child) inversed to allOrgClassRegions on Organisation
myClassification (child) inversed to allOrgClassRegions on Classification
allProductRegions keyed by myRegion (peer-peer) inversed to allOrgClassRegions on Classification

ProductRegion:
allProducts
myRegion
myOrgansationClassification

There would be an MKD holding the OrganisationClassification objects at some global level keyed by:

myOrganisation
myClassification

JADEish pseudocode for searching is something like this:

Code: Select all

foreach classification in customer.allInterectedClassifications do
  orgClassification := root.allOrganisationClassifications.getAtKey(searchedForOrganisation, classification);

  if (orgClassification = null) then continue;
  iter := orgClassification.allProductRegions.createIterator();

  foreach region in customer.allInterestedRegions do
    productRegion := orgClassification.allProductRegions.getAtKey(region);
    if (productRegion = null) then continue;
    iter.startAtObject(productRegion);
    
    while (iter.next(productRegion) and (productRegion.myRegion = region)) do
      //copy the products on this ProductRegion instance to our results
      //may need to filter out duplicates.
      productRegion.allProducts.copy(queryResult);
    endwhile;
  endforeach;
endforeach;

This approach is still caching results, but at a lower lever, so the time to build these cached structures will be lower and the database bloat will be less than the first solution. Searching speed will be slightly slower than the first solution but should be no where near as slow as what you're seeing now. To relieve contention during updates have a single writer thread responsible for updating these structures.

Anyway, I hope this helps you find a way forward.

sideshow · Postby **sideshow** » Mon Apr 15, 2013 5:00 pm

Cool thanks. Great help.
Exactly what i was looking for.

allistar · Postby **allistar** » Mon Apr 15, 2013 9:44 pm

Out of interest, if you implement it this way I would love to see the improvement it has made to search times.

Forums

Improved searching logic help

Improved searching logic help

Re: Improved searching logic help

Re: Improved searching logic help

Re: Improved searching logic help

Re: Improved searching logic help

Re: Improved searching logic help

Re: Improved searching logic help

Who is online