[Bizgres-general] Off-list Re: [ENG] Re: Statement Queuing take II - Resource Scheduling (Running with cost and cursors)

Gavin Sherry swm at alcove.com.au
Mon Aug 7 02:04:50 UTC 2006


On Fri, 4 Aug 2006, Luke Lonergan wrote:

> We also need to be able to time-share statement workload within different
> statement queues:
>
> Consider the situation where we set up a queue for ad-hoc queries that
> recognize the use of sorting in statements and that queue allows one query
> to run at a time.  There may be other queues for lighter weight queries.
> During normal use, ad-hoc queries will run consecutively which is OK when
> they are all of reasonable duration, say ten minutes each.
>
> Now someone runs an ad-hoc query that is going to take a much longer time to
> run, say one day.  Every following ad-hoc query will wait the whole day for
> the queue to clear.
>
> The proven way to remove this issue is to time share the single queue slot
> for some number of runnable entries, say two.  During each time slot, say
> one minute, each runnable statement in the queue will rotate into execution.
> After it's time is done, a SIGSUSPEND is sent to it's PID and the next
> runnable statement rotates in and so on.

There are a few issues. This is a classic scheduling problem. The problem
is, the workloads we're talking about are pretty pathological. The reason
is, they are resource hungry *and* long running.

The long running bit is maybe the hardest to solve. What that implies is
holding locks for a long time. Thankfully, PostgreSQL is MVCC so that
problem isn't as bad as it could be. But, if you suspend a process while
it is holding a buffer lock (or pretty much any of the key LW locks), the
system will stop running -- since the running processes will not be able
to get low level locks which usually take only a few hundred cycles to
obtain. Luckily, we already address this structurally in the code.

To take advantage this for scheduling, we'd have to do the following.
For simplicity, say there's a new backend which looks after scheduling.
This process decides that a given backend's time is over. It takes the
procArrayLock and sets a flag in the backend's PGPROC structure. It then
signals the backend with something like SIGUSR2. The backend is
interrupted but by default just registers that it has been signalled and
determines that it's time slice is over. When next it calls
CHECK_FOR_INTERRUPTS() it will signal itself with SIGSUSPEND and it will
stop running. It is guaranteed that at this time it is not in a critical
section.

That doesn't mean it's not in the middle of a serialized mode UPDATE. But,
I do not think we need to handle conflict serialized writes for DW
scenarios -- or at least we shouldn't.

Here's the the problem: it's not actually that simple :-(. This thread
explains the problem:

http://archives.postgresql.org/pgsql-patches/2006-07/msg00039.php

Actually, this system is exactly the system we're talking about. Charles
is a engineer at Fujitsu and the system he is looking after is a heavily
utilised reporting system. There are very long queries and there are
ad-hoc queries. Part of the application is a report designer which
unskilled users can access. They have a tendency to design reports which
hammer the reporting server so there is a rule that any query from these
users is killed after X seconds. The problem is, the queries were not
dying. The cause is the lack of CHECK_FOR_INTERRUPTS() in the in memory
sort code.

Now, it turns out that we cannot just long jump our of the qsort()
comparator because it will leak potentially very large amounts of memory.
That's such an issue for us because we plan to take up processing later
on.

This, however, leads to another problem (which I see Mark Kirkwood has
raised). If we pause a qsort() we might be consuming close to two times
work_mem. How? Well, this is from qsort() in glibc:

      if (size / pagesize > (size_t) phys_pages)
        _quicksort (b, n, s, cmp);
      else
        {
          /* It's somewhat large, so malloc it.  */
          int save = errno;
          char *tmp = malloc (size);
          if (tmp == NULL)
            {
              /* Couldn't get space, so use the slower algorithm
                 that doesn't need a temporary array.  */
              __set_errno (save);
              _quicksort (b, n, s, cmp);
            }
          else
            {
              __set_errno (save);
              msort_with_tmp (b, n, s, cmp, tmp);
              free (tmp);
            }
        }

The first branch is taken if we think that the data to be sorted will
exceed main memory (this actually seems like a pretty poor test, but
anyway). If we think it will, we call _quicksort() which is quick sort
with no temporary memory required. Generally, however, we take the second
path where we allocate temporary space the size of the data to be sorted.
Hence, two times work_mem (I bring this up not just because it is
relevant to this discussion but because it is relevant to anyone working
on resource management).

So, what can we do about this. There are only two things, as far as I can
tell. Either we count this memory use more accurately and use the
information in our scheduler or we swap data out to disk.

The implementation of the former would see us accurately counting memory
usage for each query and limiting newly queued queries. In effect, we may
recalculate work_mem on the fly for new queries if we thought they were
going to have a fairly short run time. There are two ways we could count
memory usage: either count up all allocations via palloc(), which implies
having our own qsort()[1] implementation or we could plugin to the host OS
at a low level and ask the VM how much memory is allocated to a particular
process. I'm pretty sure most OSes would allow us to do this some how.

The problem with this approach is that new queries would be affected by
queries already running. They would be starved of resources. This is a
pretty poor solution to the problem of scheduling.

So, we look at swapping data out to disk. This isn't trivial to do in user
space. What we would have to do is identify memory intensive areas of the
code where we may be stopped (like, during a sort). We would have to
engineer that code to make it restartable. This wouldn't be so hard, I
think, with our own sort code. Equally, we can exert control over hash
aggregation and make it spill to disk in such a way as we can easily
restart it.

The thing is, this looks a lot like we're designing an OS inside of the
database engineer. Even though on paper these things seem like Just a
Matter of Programming, they're going to be pretty hard to implement.

Finally, and this may have been addressed already but I didn't see it, how
are we going to handle prioritisation of the queue. The minute queue
scheduling works, people are going to ask for that. The easiest way I can
see is to implement it on a per role basis or via GUC.

Thanks,

Gavin


[1] qsort() in glibc is not actually quicksort. That's the problem. If we
have free memory then qsort becomes a merge sort.


More information about the Bizgres-general mailing list