[Bizgres-general] Statement Queuing take II - Resource Scheduling

Mark Kirkwood mkirkwood at greenplum.com
Wed Jul 5 00:53:26 UTC 2006


RESOURCE SCHEDULER - Design Review
==================


Discussion
----------

Over the last few weeks the basic internals for locking and queuing [1] 
have been built, resulting a roughly runnable framework to test the 
basic scheduling functions - lock, wait, release, wakeup a process 
requesting a resource [2].

During the build process, when data structure and architectural 
challenges presented themselves, various decisions were made and now it 
seems like a good idea to present these for discussion and critique. A 
self critique is presented as well, to highlight what I see the main 
issues to be.


Design Description
------------------

1/ Data Structure Relationships

These are shown using some ascii art E-R diagrams "|------<" being 
"one-to-many" and "|------|" meaning "1-1"):


role |------< resource-limits [ >------| limit-types ]
                   -
                   |
                   |
                   |
                   |
                   -
scheduler |------< queue |------| resource-counter and limit
                   -
                   |
                   |
                   |
                   |
                   -
                  resource-lock |------< resource-proclock



2/ System Initialization

A resource-limit is something like "active statement number = 10" or 
"active statement cost = 100000". At system initialization a queue is 
created for each one of these. The queue has additional elements - in 
particular the current value for the limit.  (in the current code, the 
role to resource-limit and resource-limit to queue setups are TODO).

There are a certain (fixed) number of possible limit types, relating to 
the type of resource-limit the queue implements. Two types will be 
provided initially - "active statement count" and "active statement 
cost" [3]. (in the current code the second of these is still TODO).

At system start one resource-scheduler structure is created in shared 
memory and the queues are attached to it as an array.

Resource-lock and Resource-proclock hashes are initialized in shared 
memory at system startup. The Resource-locks are structurally identical 
to standard locks, however the resource-proclocks (will) have additional 
elements (in the current code, they are the same - see "Queue Unlock" 
below).

There will be a new lockmethod entry for these resource-locks (in the 
current code, they share the standard lock table like userlocks, purely 
to ease debugging initially - for instance so they show up in pg_locks).


3/ Session Initialization

At session start, the queue for each possible limit-type, together with 
the role it belongs to are identified and cached.

There is still some discussion about how best to do this (in the current 
code this is still TODO - although Critique nos. 3 and 4 seem to suggest 
a good solution).


4/ Statement Execution

Immediately before a statement is executed, each of the queues is locked 
(in some fixed order), and they are then unlocked after execution. (in 
the current code this is still TODO).


5/ Queue Lock

Very similar to standard lock acquisition (LockAcquire), except the 
conflict resolution code is replaced by a check on the current and 
threshold values for the corresponding queue. Another difference is that 
the range of lockmodes used for standard locks are not required here, 
however a mode of "ExclusiveLock" is used - future proofing for 
comparison with standard locks in deadlock checks.

If a resource lock cannot be granted and the process sleeps, the process 
title is set to "queuing" instead of "waiting".


6/ Queue Unlock.

Very similar to standard lock release (LockRelease), except that the 
conflict resolution is replaced as in the lock case.

In a similar manner, waiter wakeup checks the current and threshold 
values for the corresponding queue for each process waiting to lock this 
queue. (in the current code, this only needs to test a trivial increment 
of the "max active" counter - for the other limit types, we need to know 
the potential increment that the waiting process will increase the 
counter by - at the moment the intention is to store this in the 
extended part of the resource proclock, altho there is a comment in the 
code that incorrectly mentions the proc structure).


7/ Transaction End.

At transaction end, all resource locks are released. (in the current 
code, sub-transaction end is still TODO).


Design Critique
---------------

1/ The queues should be a hash table rather than an array, this will 
speed up finding "my queues" at session initialization (see 2).

2/ The queues and resource-locks could be merged into the resource-lock 
structure as they are 1 -> 1. The idea being that instead of needing to 
access a resource-lock *and* a queue, only a resource-lock is needed. 
This would make resource-lock no longer identical to lock in structure, 
similarly to the resource-proclock. Comments?

3/ The business of assembling resource limits from potentially many 
inherited roles could make it difficult to keep track of what effective 
limits the system is running under.
  A possibility is to create a resource queue object, and attach the 
limits to it, and roles can specify a resource queue - and only the 
current role's resource queue is *ever* used. This means instead of 
having to specify limits for each (login) role, you can have a few 
resource queues and specify one for each role (login role at least). 
This would require a new catalog object ("resource queue").


4/ Having to acquire locks on many queues just to execute 1 statement is 
less than desirable (they are probably all being acquired while holding 
the lock table's lwlock, so it may not be as bad as it seems).
  A possibility is to make the lock granularity at the "resource queue" 
level - (see above), so that only 1 lock operation is required for 
statement execution.
  This would mean that various "queues" would be subsumed into 1 
"resource queue", which would track counters for all the 
resource-limit(s). The new relationships would be:


role >------| resource-queue |------< resource-limits [ >------| 
limit-types ]
                   -
                   |
                   |
                   |
                   |
                   -
scheduler |------< queue |------< resource-counters and limits
                   -
                   |
                   |
                   |
                   |
                   -
                  resource-lock |------< resource-proclock


  This seems like the right approach, as it solves two problems - 
(roles-to-limits and multiple locks), and yet is still flexible. Comments?


5/ The potential increment to each limit type that a new statement will 
cause needs to be remembered for *each* process. The current thinking is 
to store this array of increments in the resource-proclock, so that they 
are readily available for checking by the wakeup checks. This *seems* 
like the correct place to store these - comments?


6/ The codebase location is src/backend/utils/resscheduler for this work 
- as it seemed more 'utils' than 'storage' (where the standard lockcode 
is). Comments?


7/ In the quest to avoid touching the existing lock code, several 
ancillary functions that are process related (ProcSleep, ProcWakeup, 
ProcLockWakeup) have essentially been duplicated (as ResProcSleep etc) 
in the scheduler code base (i.e in 
src/backend/utils/resscheduler/resqueue.c as opposed to proc.c). I am 
thinking that:

  - these should actually be in proc.c so they can use the signal 
setting code, (amongst other things - probably a no-brainer... ) and,

  - in the cases where the modifications are minimal, the original 
functions should be reused. In cases where they are significant (like 
for instance ResProcLockWakeup), then clearly it makes sense to call a 
new function. Comments?


8/ In a similar quest for separation the lock type functions 
(ResLockAcquire, ResLockRelease etc) are coded in 
src/backend/utils/resqueue.c and so cannot see any of the static 
variables in the standard lock code. Now in the light of the need for 
deadlock code sharing, it makes sense for some of  these to be 
accessible, so should the new functions actually be in the same file are 
the standard lock code? (i.e. have I taken the separation too far?).


9/ The various lockish housekeeping stuff like lockmode (always 
ExclusiveLock), lockmask/holdmask etc are being maintained even tho they 
are not really applicable for resource locking. I thought it might ease 
comparison with standard locks for deadlock detection code (for 
instance, the current code picks up lock to resource-lock deadlocks 
under some circumstances already), but is it worth the clutter, and is 
it the "right thing"(tm) to do?


10/The lockmethodid and lockmode parameters disappear from most of the 
function prototypes. in hindsight, it might be best to  put them in and 
use wrapper functions (like LockRelation does for LockAcquire) - the 
thought being that there could be more than one sort of resource-lock 
lockmethod...Comments?


11/The locallock optimization has not been implemented for resource 
locks, as it is envisaged that it will *always* be necessary to examine 
the shared counters for the respective resources. Comments?


Ongoing Questions
-----------------


1/ Identifying statement start and stop appears to be a little subtle, 
as the code works with Lists of statements that are parsed and planned 
and executed - we want to be careful not to take a lock for side-effect 
statements such as triggered or selected SQL procedures (log_statement 
and log_duration logic will help here). In addition cursors need to be 
handled (suggest taking lock on the EXECUTE event, and holding the lock 
across all the FETCHes until the cursor is closed)? Also some simple 
sanity checking is required - we don't want to take a lock for 
statements like BEGIN or COMMIT.

2/ Deadlocking with combinations of resource and standard locks still 
needs some further discussion. Due to the non deterministic nature of 
the resource lock conflict rules, in many cases a deadlock will actually 
clear after a time as other processes release slots for the resource 
locks. More sophisticated deadlocks which are not self clearing are 
possible (e.g. a scenario where *all* current processes are waiting for 
locks that other ones have), but the area in between is still a bit grey.


References
----------

[1] Resource Scheduling proposal 
http://pgfoundry.org/pipermail/bizgres-general/2006-June/000492.html
[2] WIP against Bizgres CVS 2006-06-14 
http://homepages.paradise.net.nz/markir/download/bizgres/bizgres-resschedular-06-29.patch 

[3] Cost based limit rule 
http://pgfoundry.org/pipermail/bizgres-general/2006-June/000497.html




More information about the Bizgres-general mailing list