[Bizgres-general] Statement Queuing take II - Resource Scheduling
Mark Kirkwood
mkirkwood at greenplum.com
Wed Jul 5 00:53:26 UTC 2006
RESOURCE SCHEDULER - Design Review
==================
Discussion
----------
Over the last few weeks the basic internals for locking and queuing [1]
have been built, resulting a roughly runnable framework to test the
basic scheduling functions - lock, wait, release, wakeup a process
requesting a resource [2].
During the build process, when data structure and architectural
challenges presented themselves, various decisions were made and now it
seems like a good idea to present these for discussion and critique. A
self critique is presented as well, to highlight what I see the main
issues to be.
Design Description
------------------
1/ Data Structure Relationships
These are shown using some ascii art E-R diagrams "|------<" being
"one-to-many" and "|------|" meaning "1-1"):
role |------< resource-limits [ >------| limit-types ]
-
|
|
|
|
-
scheduler |------< queue |------| resource-counter and limit
-
|
|
|
|
-
resource-lock |------< resource-proclock
2/ System Initialization
A resource-limit is something like "active statement number = 10" or
"active statement cost = 100000". At system initialization a queue is
created for each one of these. The queue has additional elements - in
particular the current value for the limit. (in the current code, the
role to resource-limit and resource-limit to queue setups are TODO).
There are a certain (fixed) number of possible limit types, relating to
the type of resource-limit the queue implements. Two types will be
provided initially - "active statement count" and "active statement
cost" [3]. (in the current code the second of these is still TODO).
At system start one resource-scheduler structure is created in shared
memory and the queues are attached to it as an array.
Resource-lock and Resource-proclock hashes are initialized in shared
memory at system startup. The Resource-locks are structurally identical
to standard locks, however the resource-proclocks (will) have additional
elements (in the current code, they are the same - see "Queue Unlock"
below).
There will be a new lockmethod entry for these resource-locks (in the
current code, they share the standard lock table like userlocks, purely
to ease debugging initially - for instance so they show up in pg_locks).
3/ Session Initialization
At session start, the queue for each possible limit-type, together with
the role it belongs to are identified and cached.
There is still some discussion about how best to do this (in the current
code this is still TODO - although Critique nos. 3 and 4 seem to suggest
a good solution).
4/ Statement Execution
Immediately before a statement is executed, each of the queues is locked
(in some fixed order), and they are then unlocked after execution. (in
the current code this is still TODO).
5/ Queue Lock
Very similar to standard lock acquisition (LockAcquire), except the
conflict resolution code is replaced by a check on the current and
threshold values for the corresponding queue. Another difference is that
the range of lockmodes used for standard locks are not required here,
however a mode of "ExclusiveLock" is used - future proofing for
comparison with standard locks in deadlock checks.
If a resource lock cannot be granted and the process sleeps, the process
title is set to "queuing" instead of "waiting".
6/ Queue Unlock.
Very similar to standard lock release (LockRelease), except that the
conflict resolution is replaced as in the lock case.
In a similar manner, waiter wakeup checks the current and threshold
values for the corresponding queue for each process waiting to lock this
queue. (in the current code, this only needs to test a trivial increment
of the "max active" counter - for the other limit types, we need to know
the potential increment that the waiting process will increase the
counter by - at the moment the intention is to store this in the
extended part of the resource proclock, altho there is a comment in the
code that incorrectly mentions the proc structure).
7/ Transaction End.
At transaction end, all resource locks are released. (in the current
code, sub-transaction end is still TODO).
Design Critique
---------------
1/ The queues should be a hash table rather than an array, this will
speed up finding "my queues" at session initialization (see 2).
2/ The queues and resource-locks could be merged into the resource-lock
structure as they are 1 -> 1. The idea being that instead of needing to
access a resource-lock *and* a queue, only a resource-lock is needed.
This would make resource-lock no longer identical to lock in structure,
similarly to the resource-proclock. Comments?
3/ The business of assembling resource limits from potentially many
inherited roles could make it difficult to keep track of what effective
limits the system is running under.
A possibility is to create a resource queue object, and attach the
limits to it, and roles can specify a resource queue - and only the
current role's resource queue is *ever* used. This means instead of
having to specify limits for each (login) role, you can have a few
resource queues and specify one for each role (login role at least).
This would require a new catalog object ("resource queue").
4/ Having to acquire locks on many queues just to execute 1 statement is
less than desirable (they are probably all being acquired while holding
the lock table's lwlock, so it may not be as bad as it seems).
A possibility is to make the lock granularity at the "resource queue"
level - (see above), so that only 1 lock operation is required for
statement execution.
This would mean that various "queues" would be subsumed into 1
"resource queue", which would track counters for all the
resource-limit(s). The new relationships would be:
role >------| resource-queue |------< resource-limits [ >------|
limit-types ]
-
|
|
|
|
-
scheduler |------< queue |------< resource-counters and limits
-
|
|
|
|
-
resource-lock |------< resource-proclock
This seems like the right approach, as it solves two problems -
(roles-to-limits and multiple locks), and yet is still flexible. Comments?
5/ The potential increment to each limit type that a new statement will
cause needs to be remembered for *each* process. The current thinking is
to store this array of increments in the resource-proclock, so that they
are readily available for checking by the wakeup checks. This *seems*
like the correct place to store these - comments?
6/ The codebase location is src/backend/utils/resscheduler for this work
- as it seemed more 'utils' than 'storage' (where the standard lockcode
is). Comments?
7/ In the quest to avoid touching the existing lock code, several
ancillary functions that are process related (ProcSleep, ProcWakeup,
ProcLockWakeup) have essentially been duplicated (as ResProcSleep etc)
in the scheduler code base (i.e in
src/backend/utils/resscheduler/resqueue.c as opposed to proc.c). I am
thinking that:
- these should actually be in proc.c so they can use the signal
setting code, (amongst other things - probably a no-brainer... ) and,
- in the cases where the modifications are minimal, the original
functions should be reused. In cases where they are significant (like
for instance ResProcLockWakeup), then clearly it makes sense to call a
new function. Comments?
8/ In a similar quest for separation the lock type functions
(ResLockAcquire, ResLockRelease etc) are coded in
src/backend/utils/resqueue.c and so cannot see any of the static
variables in the standard lock code. Now in the light of the need for
deadlock code sharing, it makes sense for some of these to be
accessible, so should the new functions actually be in the same file are
the standard lock code? (i.e. have I taken the separation too far?).
9/ The various lockish housekeeping stuff like lockmode (always
ExclusiveLock), lockmask/holdmask etc are being maintained even tho they
are not really applicable for resource locking. I thought it might ease
comparison with standard locks for deadlock detection code (for
instance, the current code picks up lock to resource-lock deadlocks
under some circumstances already), but is it worth the clutter, and is
it the "right thing"(tm) to do?
10/The lockmethodid and lockmode parameters disappear from most of the
function prototypes. in hindsight, it might be best to put them in and
use wrapper functions (like LockRelation does for LockAcquire) - the
thought being that there could be more than one sort of resource-lock
lockmethod...Comments?
11/The locallock optimization has not been implemented for resource
locks, as it is envisaged that it will *always* be necessary to examine
the shared counters for the respective resources. Comments?
Ongoing Questions
-----------------
1/ Identifying statement start and stop appears to be a little subtle,
as the code works with Lists of statements that are parsed and planned
and executed - we want to be careful not to take a lock for side-effect
statements such as triggered or selected SQL procedures (log_statement
and log_duration logic will help here). In addition cursors need to be
handled (suggest taking lock on the EXECUTE event, and holding the lock
across all the FETCHes until the cursor is closed)? Also some simple
sanity checking is required - we don't want to take a lock for
statements like BEGIN or COMMIT.
2/ Deadlocking with combinations of resource and standard locks still
needs some further discussion. Due to the non deterministic nature of
the resource lock conflict rules, in many cases a deadlock will actually
clear after a time as other processes release slots for the resource
locks. More sophisticated deadlocks which are not self clearing are
possible (e.g. a scenario where *all* current processes are waiting for
locks that other ones have), but the area in between is still a bit grey.
References
----------
[1] Resource Scheduling proposal
http://pgfoundry.org/pipermail/bizgres-general/2006-June/000492.html
[2] WIP against Bizgres CVS 2006-06-14
http://homepages.paradise.net.nz/markir/download/bizgres/bizgres-resschedular-06-29.patch
[3] Cost based limit rule
http://pgfoundry.org/pipermail/bizgres-general/2006-June/000497.html
More information about the Bizgres-general
mailing list