Maintaining collections that are frequently updated

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 11:26 am

by geoff >> Mon, 18 Mar 2002 22:21:15 GMT

Take an order entry system as an example. The root object is likely to have the following collections:

* allCustomers
* allProducts
* allOrders

All the collections are large but 'allOrders' could potentially become a bottleneck if orders come in thick and fast. The bottleneck would result in lock exceptions on the clients as they all attempt to update the 'allOrders' collection at the same time.

I first look at a typical system design that does nothing about this bottleneck, then suggest an alternative approach.

A typical design might have the following references defined in the Order class:

* myCustomer (with an inverse allOrders in the Customer class)
* myProduct (with an inverse allOrders in the Product class)
* myRoot (with an inverse allOrders in the Root class)

The first two references make a lot of sense in terms of the model and make good use of JADE's automatic inverse maintenance. I am less happy about every Order object (and every Product and Customer object) having a 'myRoot' reference. Why set the 'myRoot' reference of an object when the target can be guaranteed to be 'app.myRoot'? The justification is to invoke JADE's automatic inverse maintenance. However, why not simply add the object directly to the collection on the Root object.

What I am suggesting is replacing the following instruction in the method that sets initial values for an Order
self.myRoot := app.myRoot;
with
app.myRoot.allOrders.add(self);
To cater for deletions of Order objects, the following instruction is needed in the destructor
app.myRoot.allOrders.delete(self);

The previous suggestion does nothing to alleviate the bottleneck caused by the transaction that creates an Order object having to update the 'allOrders' collection on the Root object.

The alternative approach I am suggesting separates the creation of an Order object from the related update to the 'allOrders' collection. With this approach, a client can create an Order object in parallel with other clients because there is no single-threading caused by contention over the 'allOrders' collection on the Root object.

A background task running on the server will fix up the 'allOrders' collection by adding the new Order objects as quickly as it can. A drawback to this approach is that the collection is not always up-to-date. That is the price you pay for increased parallelism. How critical is a slight lag in updating the collection? That is the question you must answer before adopting this approach.

Here is an outline of a possible implementation of this alternative approach.

When an Order object is created, instead of adding to the 'allOrders' collection, create an instance of a new class called 'AddToOrdersCollection'. The 'AddToOrdersCollection' instance would have a reference called 'order' to the newly created order.

A background task on the server would periodically create a virtual collection 'AddToOrdersCollection.instances' to see if any orders need adding to the 'allOrders' collection. When an order is added, the corresponding instance of 'AddToOrdersCollection' would be deleted. Note the use of a virtual collection 'AddToOrdersCollection.instances' (rightly frowned upon in a production system). But if a proper collection of instances of AddToOrdersCollection were maintained and used for this purpose, we would simply be substituting contention on this new collection for contention on the 'allOrders' collection.

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 11:26 am

by cnwdad1 >> Tue, 19 Mar 2002 0:42:00 GMT

Geoff,

Do you need a single collection for all of the Ordrs ? Why not just have an exclusive collection of Orders on the Customer class ?

regards,
darrell.

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 11:26 am

by cdshearer >> Tue, 19 Mar 2002 2:12:03 GMT

Yeah, but most businesses have an order number that identifies an order. It's convenient to be able to lookup an order using the order number without having to know who the customer is.

I think Geoff was illustrating a general architectural principle of reducing contention on a single global collection. I've used the technique in the past, but mostly to separate out a long-running update from the main object creation. For example, a user enters a Offer to buy/sell into the system, targeted at lots of different users. Each user has a collection of offers targeted at them to look at... so potentially there are lots of collections to update. I broke the transaction into two parts - the submission of the offer in one transaction, and had a server application processing new offers and sending them to targeted users.

In the above example, I had a separate reference on my Offer object which populated a collection on the Root object of unprocessed offers. The server application listened for notifications on update of the collection and fired off the targeting process.

I've been trying to think of a way of avoiding using the .instances virtual collection. My initial reaction to this problem would be to make setting the root reference the very last thing in the transaction then the collection would be locked for the minimum time, so as to avoid going down this path. Still, there is the time taken for the transaction to commit that could be a bottleneck if many users are attempting to update that collection. I guess the bottom line is that if the collection is locked for 200mS then you can receive no more than 1 order every 200mS.

I also think that there is some danger in this with people designing multiple global collections on order with different sort orders. It's not uncommon to see 3 or more global collections of this type keyed on different things... (or more if the developer has a relational database background...)

One comment I'd make on the advisability of maintaining the collection end of an inverse... it doesn't matter much from an efficiency perspective, but it's more code to maintain. The "update the singular end of the relationship and have the multiple end automatically maintained" is a common pattern so I'd advise sticking to it unless there is a really good reason not to. Incidentally, even if the myRoot reference always points to app.myRoot, having it is good practise anyway. What if, in the future, you want to move to a multi-company system and Root is no longer a singleton?

Good stuff, anyway...

Craig

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 11:26 am

by cnwkpd1 >> Thu, 21 Mar 2002 23:07:45 GMT

We've found a need to do the same thing in Jasper, to reduce the contention on some large collections we have.

We haven't come up with a mechanism to remove BOTH the runtime contention AND the Class::instances problem, but our solution uses instances only at startup time of the background application, and then has no runtime contention on the objects created to do the large collection updating.

It works like this:

Background App does
process OffineActor.allInstances // to pick up any objects created while the app was down (shouldn't be any)
register for OfflineActor create notifications
process OfflineActor.allInstances // to cover potential timing hole of objects created before register
go to idle state, waiting for creates

sysNotification just processes the OfflineActor (we have multiple subclasses to do different offline tasks), then deletes it.

Online application does whatever updating.creates it normally does, then creates an approriate OfflineActor subclass which has all the necessary references to do it's job. In Geoff's original example, it would have a myOrder and myRoot reference, and the "processing" method is just myOrder.setMyRoot(myRoot). These references are not inversed, so there's no contention in creating the OfflineActor objects, but of course the processing needs to handle exceptions due to invalid object references.

Kevin

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 11:27 am

by cdshearer >> Fri, 22 Mar 2002 0:27:47 GMT

I wonder if all this offline updating of collections will be necessary in JADE 6.0 when there will be partial locking of collections during updating (so I'm told).

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 11:27 am

by nik >> Fri, 22 Mar 2002 10:38:02 GMT

I wonder if all this offline updating of collections will be necessary in JADE 6.0 when there will be partial locking of collections during updating (so I'm told).

Depends on what you mean by 'partial locking' of collections. If you simply replace locking the entire collection with locking collection blocks for transaction duration it won't improve concurrency for the sequential insert scenario discussed in this thread. In the order entry example, let's say the allOrders dictionary was keyed by 'order number' and 'order number's are assigned sequentially. In that usage scenario inserts will occur in the rightmost block in the BTree so a block level locking solution will incur equivalent contention.

Anyway, whatever this talked about 'partial locking' solution is, I have a sneaking suspicion that it won't make JADE 6.0.

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 11:27 am

by cdshearer >> Sun, 24 Mar 2002 21:32:16 GMT

I wonder if all this offline updating of collections will be necessary in JADE 6.0 when there will be partial locking of collections during updating (so I'm told).

Depends on what you mean by 'partial locking' of collections. If you simply replace locking the entire collection with locking collection blocks for transaction duration it won't improve concurrency for the sequential insert scenario discussed in this thread. In the order entry example, let's say the allOrders dictionary was keyed by 'order number' and 'order number's are assigned sequentially. In that usage scenario inserts will occur in the rightmost block in the BTree so a block level locking solution will incur equivalent contention.

Yes, of course! I guess the only time when this would be of benefit would be if a hashing algorithm (or even a generated order number that wasn't sequential) was used so that sequentially inserted objects end up in different collection blocks, but then I guess that collection update would be less efficient as a different collection block would need to be loaded into cache.

Still, partial locking would reduce contention on the collection when it was simply being looked up.

Ahhh... there are no easy solutions!

Anyway, whatever this talked about 'partial locking' solution is, I have a sneaking suspicion that it won't make JADE 6.0.

So, what WILL (or MIGHT if you like) be in 6.0?

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 11:27 am

by nik >> Mon, 25 Mar 2002 10:54:37 GMT

Yes, of course! I guess the only time when this would be of benefit would be if a hashing algorithm (or even a generated order number that wasn't sequential) was used so that sequentially inserted objects end up in different collection blocks, but then I guess that collection update would be less efficient as a different collection block would need to be loaded into cache.

Still, partial locking would reduce contention on the collection when it was simply being looked up.

If you want to avoid contention between readers and updaters, an alternative is to implement a form of MVCC (multi version concurrency control). With an MVCC solution, multiple readers would see a consistent view of the collection while allowing a single concurrent updater to update it's version of the collection. Only when the updater committed its changes would the changes become visible to readers. This strategy can support higher levels of concurrency between sequential readers (e.g reporting tasks) and random updaters.

Ahhh... there are no easy solutions!

If there was a known easy solution, someone would have implemented it some time ago.

Prior to the 5.2 release, the various BTree based 'physical indexes' in the database engine employed "block level" locking, held for the transaction duration. The transaction duration locking strategy, which was required to support index block level redo/undo for recovery and transaction rollback, incurred a degree of transaction queueing at commit and in some cases could result in deadlock exceptions on commit; The 5.2 release now only locks index blocks for the duration of a BTree insert or remove operation.

Some of the issues that add complexity to the collection case are:
a) The fact that collections are distributed across nodes, so any concurrency mechanism has to deal with keeping collections consistent across nodes.
b) Certain behavioural semantics impact on the locking strategy, for example handling 'no duplicates' constraints. If you choose to handle this pessimistically, then your locking strategy has to ensure that concurrent transactions can't violate this constraint. Sound simple? Consider: transaction T1 removes an entry with key k1, and transaction T2 attempts to add an entry with key k1. Can these two transaction overlap before they commit? If not, what locking strategy is appropriate to preserve the constraint while providing high concurrency?

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 11:27 am

by cdshearer >> Mon, 25 Mar 2002 21:17:27 GMT

Yes, of course! I guess the only time when this would be of benefit would be if a hashing algorithm (or even a generated order number that wasn't sequential) was used so that sequentially inserted objects end up in different collection blocks, but then I guess that collection update would be less efficient as a different collection block would need to be loaded into cache.

Still, partial locking would reduce contention on the collection when it was simply being looked up.

If you want to avoid contention between readers and updaters, an alternative is to implement a form of MVCC (multi version concurrency control). With an MVCC solution, multiple readers would see a consistent view of the collection while allowing a single concurrent updater to update it's version of the collection. Only when the updater committed its changes would the changes become visible to readers. This strategy can support higher levels of concurrency between sequential readers (e.g reporting tasks) and random updaters.

This sounds like something I've just implemented, actually. I have a transient "cache" object that makes transient copies of persistent collections for its clients. Clients can request the same collection multiple times, and the "cache" will hand back the transient copy instead of going to the persistent collection. Updates to the persistent collection cause the cache object to receive a notification that it needs to refresh. Each time a client requests the collection a refresh flag is checked, and if set, the transient collection is refreshed from the persistent collection.
Looks like I have a formal name for this now...

Ahhh... there are no easy solutions!

If there was a known easy solution, someone would have implemented it some time ago.

Yes, although I would note that there a lot of things that would be easy to implement that nobody has the resources to dedicate to implementing...

Prior to the 5.2 release, the various BTree based 'physical indexes' in the database engine employed "block level" locking, held for the transaction duration. The transaction duration locking strategy, which was required to support index block level redo/undo for recovery and transaction rollback, incurred a degree of transaction queueing at commit and in some cases could result in deadlock exceptions on commit; The 5.2 release now only locks index blocks for the duration of a BTree insert or remove operation.

Some of the issues that add complexity to the collection case are:
a) The fact that collections are distributed across nodes, so any concurrency mechanism has to deal with keeping collections consistent across nodes.
b) Certain behavioural semantics impact on the locking strategy, for example handling 'no duplicates' constraints. If you choose to handle this pessimistically, then your locking strategy has to ensure that concurrent transactions can't violate this constraint. Sound simple? Consider: transaction T1 removes an entry with key k1, and transaction T2 attempts to add an entry with key k1. Can these two transaction overlap before they commit? If not, what locking strategy is appropriate to preserve the constraint while providing high concurrency?

I agree, this sounds like a problem... I'm glad you guys have to think about this stuff, and not me!

ConvertFromOldNGs · Postby **ConvertFromOldNGs** » Fri Aug 07, 2009 11:27 am

by wxv >> Tue, 26 Mar 2002 21:29:40 GMT

This sounds like something I've just implemented, actually. I have a transient "cache" object that makes transient copies of persistent collections for its clients.....

"Looks like I have a formal name for this now..."

Jade scalability problems?

You lose all the benefits of the Jade OO model (a single integrated data model) if you have to cache everything in transient structures (also what if large collections???).

Would another option possibly be to manuelly model the root collections in your your own custom data structure i.e. a linked list or a BTree??? Thus no locking on the built-in jade collections (there are none). Clumsy and more ineffecient the the built-in collections, but tidier then caching.

Forums

Maintaining collections that are frequently updated

Maintaining collections that are frequently updated

Re: Maintaining collections that are frequently updated

Re: Maintaining collections that are frequently updated

Re: Maintaining collections that are frequently updated

Re: Maintaining collections that are frequently updated

Re: Maintaining collections that are frequently updated

Re: Maintaining collections that are frequently updated

Re: Maintaining collections that are frequently updated

Re: Maintaining collections that are frequently updated

Re: Maintaining collections that are frequently updated

Who is online