There has been a long-standing issue with the dsource server beginning to eat all the resources. I happen to believe it's in my modifications to Trac that allow dsource to host multiple projects and such, and will get on to fixing that real soon now. Probably upgrade to Trac 0.11 and Genshi while I'm at it.
Maide and I used gdb to track it down to a probable bug in _sre.c, the Python regexp code. It seems that an endless loop happens here, basically hosing the Apache / mod_python process (or if compiled with threads, the thread).
Still, even though we may have found a bug in Python (and we're looking to upgrade to Python 2.5 because the changelog says some work has been done on this front), it's not good to have introduced the code to Trac that exposed this.
So, until I can get some time away from \${dayjob} the issue remains. What to do? How about come up with a brutal hack that works, but is embarrassing. This script basically parses 'uptime' and if the short and medium term usage items are over thresholds, we stop Apache, wait for it to die, and then restart it. OMFG:
#!/usr/bin/env python
import commands
import os, sys
from time import localtime, sleep, strftime
DEV = False
MAX_ATTEMPTS = 24
LOGFILE = "/var/log/restarts.log"
def send_oh_shit_mail():
SENDMAIL = "/usr/sbin/sendmail" # sendmail location
p = os.popen("%s -t" % SENDMAIL, "w")
p.write("To: admin@dsource.org\n")
p.write("Subject: dsource screwed!\n")
p.write("\n") # blank line separating headers from body
p.write(":(\n")
sts = p.close()
if sts != 0:
print "Sendmail exit status", sts
def stop_apache():
result = commands.getstatusoutput("/etc/init.d/apache2 stop")
print result
def wait_for_apache_to_die():
cmd = "ps -ef | grep apache | grep -v grep | wc -l"
count = 2
attempts = 0
while count > 1 and attempts < MAX_ATTEMPTS:
attempts += 1
result = commands.getstatusoutput(cmd)
count = int(result[1])
print "%s - apache instances: %s" \
% (strftime("%a, %d %b %Y %H:%M:%S", localtime()), count)
sleep(5) # seconds
if count > 1:
send_oh_shit_mail()
def start_apache():
result = commands.getstatusoutput("/etc/init.d/apache2 start")
print result
def write_to_log(msg):
f = open(LOGFILE, 'a')
f.write(msg)
def get_nums():
uptime = commands.getstatusoutput("uptime")[1]
nums = uptime[uptime.find("load average: ")+14:].split(", ")
return [float(num) for num in nums]
def main():
try:
short, medium, long = get_nums()
print short, medium, long
if short > 3 or DEV:
if medium > 2 or DEV:
dt = strftime("%a, %d %b %Y %H:%M:%S", localtime())
write_to_log("restarting Apache: %s %s %s - %s\n" \
% (short, medium, long, dt))
stop_apache()
wait_for_apache_to_die()
start_apache()
except Exception, e:
sys.stderr.write("error: %s\n" % str(e))
sys.exit(1)
if __name__ == "__main__":
main()