[Bizgres-general] pg_kill

Gavin Sherry swm at alcove.com.au
Mon Aug 21 05:47:20 UTC 2006


On Mon, 21 Aug 2006, Mark Kirkwood wrote:

> Gavin Sherry wrote:
> > Mark,
> >
> > I saw your checkin of this. What do you propose doing to address the
> > underlying issues pg_terminate_backend() actually had?
> >
> >
> >
>
> At the time we discussed this (June), the intention was to make this
> available for those folks that really needed a remote kill - along with
> the warning.
>
> I see that this has come up on -hackers again, and I think the idea of
> setting up a workload to drive out any potential issues is the way to
> go. Whether we do this or wait for someone else (!) is a good question.

I think the problem could/should be attacked slightly differently.
Firstly, it would be useful to understand in which regions of the code
SIGINT will have an overall different effect to SIGTERM. Generally
speaking, when a SIGINT is received and the query is (eventually)
terminated, the backend will exit after a while anyway (assuming the
application is behaving properly. So, there are two issues: the amount of
time it takes to respond to a SIGINT and the time it takes for the backend
to exit when it is idle in transaction.

Now, the only way to address the first issue is to add more
CHECK_FOR_INTERRUPTS() around the place. There is pretty good coverage at
the moment. One area which is poor is qsort() in tstore_performsort(). We
cannot process interrupts there because we might leak memory (some glibc
qsort() implementations use malloc()d memory). In my experience, this is
one of the places that frustrate people -- because with very large
work_mem we might sort for some time. Now, leaking backend local memory
here is not an issue because we're about to die. If anyone knows other
places that long running queries sit in for a while and which do not have
CHECK_FOR_INTERRUPTS(), let me know.

In the second case, the problem is that a SIGINT will not terminate a
transaction block. The backend will remain idle in transaction, holding
locks, etc. What we really need is a SIGINT which does a proc_exit() after
CleanupTransaction(). I think if we can make a TERM'd backend do an
AbortTransaction() in this situation, we should be able to address the
dangling resource issues.

The key problem, however, is that it's not entirely clear where SIGTERM
will cause issues. The proposal on hackers is to 'test thoroughly'. We all
know that we could have one thousand machines running for one thousand
years and not create a condition in which a problem with SIGTERM is
demonstrated. That is, to paraphrase Bruce's response to Tom, it's hard to
disprove that a bug exists. The other problem is, it is hard to provide a
formal proof or model of something as complex as PostgreSQL and
demonstrate that we will not hit a problem.

So, what to do. Well, the easiest thing would be to take a concurrent
workload with good coverage (like the regression test system) and randomly
SIGTERM concurrent backends. The question is, what are we looking for
here? How will we tell that something has gone wrong?

> It is stretching my memory a bit, but I believe that the thinking back
> in April (when we discussed this at Greenplum) was that given our
> typical workload is typically composed of relatively few concurrent
> users mainly running SELECT queries, the risk is lessened.

Well, SELECTs take out locks, interact with the buffer manager, consuming
resources and stuff too :-).

Thanks,

Gavin


More information about the Bizgres-general mailing list