[Bizgres-general] Statement Queuing take II - Resource Scheduling

Mark Kirkwood mkirkwood at greenplum.com
Mon Jun 12 02:59:03 UTC 2006


STATEMENT QUEUING (TAKE II) - RESOURCE SCHEDULING
===========================


Discussion
----------

It is common that database servers are sized and configured for average
or expected loads and usage. While this makes sense from a cost and
practicality viewpoint, when extreme loads or usage arrive the result is
often a drastic reduction in performance, or even service failure (e.g.
due to the OOM killer if on Linux).

In typical OLTP type applications TP monitors and connection pools are
often enough to prevent these extreme situations developing.

In many DSS type applications the situation is somewhat different. In
some cases the query tools do not support or work well with connection
pools, and even for ones that do, the nature of the typical DSS workload
is such that any reasonable pool size will allow extreme loading to develop.

The reason for this workload difference from the OLTP case is that there
is typically a much larger range of possible queries in a DSS system -
particularly if arbitrary user constructed ones are allowed. In addition
the nature of DSS queries means that they *may* scan anything from a
small to a significant percentage of the *entire* database - this
contrasts with a typical OLTP system where the queries are all known,
and typically crafted to scan as small a dataset as possible.

This huge variance of query dataset size means that it is not sufficient
to merely limit the number of concurrent users for the DSS case.


Requirement
-----------

Thus - for DSS systems, some additional mechanism to control workload
and usage (over and above limiting concurrent users) is required.

This mechanism should not be too intrusive with respect to performance,
and should be intelligent about when to apply controls.


Design Overview
---------------

The approach is:

1/ Limit the the number of simultaneously executing queries.
2/ Add other useful limiters incrementally over time.


Limit the the number of simultaneously executing queries:

The idea here is to prevent all currently connected sessions from
executing (expensive) queries simultaneously, via a queuing system of
some kind.

The granularity of the queue is the significant thing to decide, the two
most obvious contenders being:

- Transaction level.
- Statement level.

Transaction level is attractive, because it is considerably simpler to
implement and avoids any deadlock issues[1]. However it is highly
desirable to be able to make additional decisions based on the nature of
the statements contained in the transaction, which is not possible using
this level of granularity alone. Also there is the possibility of of
idle backends occupying a slot in the queue until they timeout, which is
undesirable.

Statement level suffers from complexity and the necessity to handle
deadlocks with the lock manager (as it increases the likelihood of
them), and the meaning of wait limits may become a little confusing, as
a new source of waiting is introduced. However it allows the possibility
of examining the statement(s) and making decisions based on attributes
thereof (e.g. COST).

It is felt that the statement level granularity is the one to pursue at
this point. It will remain to be seen how obtrusive the deadlocking
issues are! However, given the nature of typical DSS workloads (i.e
mainly SELECT queries) locking issues may not be all that serious for
the intended workloads. In addition the ability to inspect statements
provides a mechanism to side-step most obvious source of deadlocks, by
(for instance) skipping the need to queue for UPDATE or INSERT
statements, or for any statements with a low estimated COST.

For those end-user query tools that submit single statement
transactions, the difference in granularities is to some extent moot[1].


Add other useful limiters incrementally over time:

The idea here is to prevent a query executing if doing so would cause
system resources to be exhausted or over-allocated (e.g. cpu, memory,
swap, possibly IO, network/interconnect for MPP).

Clearly a granularity of statement level makes sense here, as at
transaction start there is not yet any information about what statements 
are to be executed.

A challenge for this sort of resource scheduling is in obtaining the
information from the host OS platform in order to determine the limits - 
a plugin or port may be needed for each one.

It is envisaged that the initial development for the resource scheduler
will leave this section as a stub, for subsequent work to implement.


Design
------

Resource scheduling is enabled by a configuration parameter
(resource_scheduler=on|*off*). This requires a server restart to change.

A global parameter (max_resource_queues) determines how many distinct
queues to track waiters can be created.

To control the simultaneously active queries CREATE|ALTER ROLE command
will be amended to allow the maximum number of active statements to be
specified (active_statements=M|*-1*). The default will be (-1) i.e - no
statement limiting.

There is an additional control to allow small COST statements to execute 
immediately (active_statement_cost_threshold=h|*0*). There should be no 
need for additional timeout and deadlock parameters, as the already 
existing ones can be used for this purpose.

A request for a active statement slot will be made after planning, and
will be held until the statement is finished.

If there are no free slots, then statements will queued (FIFO order),
until there are some free slots (or one of the various timeouts is reached).

It is envisaged that there will be new parameters for each additional 
limiter (e.g. max_work_memory=K for a memory limiter), these will also 
queue similarly to the active statement case.


Implementation
--------------

A construct is required that works like:

Grant me if condition == true
otherwise sleep

This looks a little like a standard Postgres lock, but unlike it in
detail - for instance standard locks have deterministic conflict rules,
whereas these will be determined by the conditions at the time (e.g. in
conflict if active_statement_count == M).

Because the lock manager is critical for system performance, it seems
wise to avoid touching it as much as possible [2], however there *has*
to be interaction with the deadlock detector, because our statement
level granularity will create the possibility of deadlocks [1][2].

Therefore a new resource object will be created in shared memory,
together with a lwlock to control access to its structures, and its own
conflict routines.

To handle deadlocks, there will be code added into the deadlock detector
to work out if a standard lock and a resource manager lock are deadlocked.

To handle the queuing of backends for limited resources a set of queues
will be created in shared memory (possibly max_resource_queues of them
added to the end of the resource object itself).

It is envisaged that the new lock objects will be structurally similar
to the standard LOCK structures, complete with a LOCKTAG (say
LOCKTAG_RESOURCE_QUEUE). One id used within the tag will identify the
resource (e.g. simultaneous statements vs say total cpu), plus there
will be a another id for the ROLE - to make it fast to identify which
queue and what the limit value is for this lock.


Discussion Points
-----------------

1/ This resource management functionality is targeted at DSS/Data
Warehouse workloads/systems - is it suitable for batch or financial 
year-end type workloads too? (It is envisaged that there *will* be a 
performance hit if enabled on a typical OLTP systems needing a high 
level concurrent activity).

2/ Is there a need for separate timeout parameters for these resource
locks (or is two very similar parameters even more likely to confuse
than a possible behavior change to the existing ones?).

3/ The resource locks will be released at statement finish - is this
possible to detect reliably?

4/ Do we want this ROLE related control, or should there be a global
parameter that controls all connections except the superuser?

5/ There is to be one lock per resource limit *and* ROLE, is this
necessary? - could we work out the ROLE whenever we examine the queue?
(would that be too big a performance hit)?

6/ Is the decision to make the resource locks distinct from the standard 
locks a good one? For instance it may make the deadlock detection code 
more complex, and it probably requires an additional LOCKTABLE hash. On 
the other had it could well increase robustness by not touching the lock 
code.

7/ Catalog changes and additions need to be thought about (e.g. for just 
simultaneous statement limits, we could add an element to the ROLE 
catalog, but it may be better to create a new ROLE_RESOURCES catalog). 
Ideas?

No doubt there are other things to discuss that I have not thought of :-).

References
----------
[1] Transactional Statement Queuing
[http://lists.pgfoundry.org/pipermail/bizgres-general/2006-February/000397.html]

[2] Private mail from Kurt Harriman where a collection of ideas like
this where discussed.





More information about the Bizgres-general mailing list