Bug#536823: I also see these losses of the uptimed database

December 07th, 2010 - 07:40 am ET by Martin Steigerwald | Report spam

Ted, I cc'd you. Could you please have a look at the save_records function
in the middle of my mail and tell us whether its safe to use on Ext4 at
least. I understand there might be a problem when using it on XFS, as XFS
doesn't cover the rename case. Thanks.


Hi!

It ate it, about 13 days ago - on my ThinkPad T42:

shambhala:~> uprecords | cut -c1-66
# Uptime | System
-+-
1 10 days, 21:01:41 | Linux 2.6.37-rc3-tp42 Fri Nov 26
2 2 days, 02:09:03 | Linux 2.6.37-rc3-tp42 Wed Nov 24
3 0 days, 13:59:05 | Linux 2.6.37-rc3-tp42 Tue Nov 23
4 0 days, 06:40:23 | Linux 2.6.36-tp42-gtt-vr Tue Nov 23
-> 5 0 days, 02:04:05 | Linux 2.6.37-rc3-tp42
6 0 days, 00:41:55 | Linux 2.6.37-rc3-tp42 Tue Nov 23
-+-
1up in 0 days, 04:36:19 | at Tue Dec 7
no1 in 10 days, 18:57:37 | at Sat Dec 18
up 13 days, 22:36:12 | since Tue Nov 23
down 0 days, 00:06:49 | since Tue Nov 23
%up 99.966 | since Tue Nov 23

I don't remember what might have happened at that time.

Its not the first time. I already restored it from a backup in october:

shambhala:~> ls -l /var/spool/uptimed
insgesamt 28
-rw-r--r-- 1 daemon daemon 11 7. Dez 10:50 bootid
-rw-r--r-- 1 root root 254 7. Dez 12:35 records
-rw-r--r-- 1 daemon daemon 9806 3. Màr 2010 records-2010-03-03-aus-dem-
rsync-backup
-rw-r--r-- 1 daemon daemon 1450 9. Màr 2010 records-2010-03-09-
unvollstaendig
-rw-r--r-- 1 daemon daemon 254 7. Dez 12:30 records.old

As you see the last working backup here is 9802 bytes, way bigger than the
current file.

This is on a

shambhala:~> df -hT /var/spool/uptimed
Dateisystem Typ Size Used Avail Use% Eingehàngt auf
/dev/mapper/shambhala-debian
ext4 20G 14G 5,5G 72% /

and a quite recent kernel 2.6.36 / 2.6.37-rc3 which has the Ext4 safeguard
for the rename and truncate case which was introduced in 2.6.30 I believe
- that it will flush written data *before* renaming the file. But according
to libuptimed/urec.d

247 void save_records(int max, time_t log_threshold) {
248 »·······FILE *f;
249 »·······Urec *u;
250 »·······int i = 0;
251 »·······
252 »·······f = fopen(FILE_RECORDS".tmp", "w");
253 »·······if (!f) {
254 »·······»·······printf("uptimed: cannot write to %s", FILE_RECORDS);
255 »·······»·······return;
256 »·······}
257
258 »·······for (u = urec_list; u; u = u->next) {
259 »·······»·······/* Ignore everything below the threshold */
260 »·······»·······if (u->utime >= log_threshold) {
261 »·······»·······»·······fprintf(f, "%lu:%lu:%s", (unsigned long)u-

utime, (unsigned long)u->btime, u->sys);


262 »·······»·······»·······/* Stop processing when we've logged the max
number specified. */
263 »·······»·······»·······if ((max > 0) && (++i >= max)) break;
264 »·······»·······}
265 »·······}
266 »·······fclose(f);
267 »·······rename(FILE_RECORDS, FILE_RECORDS".old");
268 »·······rename(FILE_RECORDS".tmp", FILE_RECORDS);
269 }

uptimed uses the rename case. Thus I do not get, *why* it ate my old
records again.

Nonetheless, I think there should be a safeguard, like using the old file
if the current one is empty.

I would also keep more than one backup given the small size of this file.
Maybe logrotate can do this while keeping the original file instead of
truncating it.

I have the following configuration:

shambhala:~> cat /etc/uptimed.conf
# Uptimed configuration file.

# Interval to write the logfile with in seconds.
UPDATE_INTERVAL=300

# Maximum number of entries in logfile. Set to 0 for unlimited.
LOG_MAXIMUM_ENTRIES=0

# Minimum uptime that must be reached for it to be considered a record.
LOG_MINIMUM_UPTIMED=1h
[...]

An option to fsync() would be fine, thus people here can easily test,
whether fsync helps in that case.

Then there is the slight chance that uptimed gets confused during runtime
and writes out an empty configuration file by accident. But I find this
highly unlikely.

I will restore as much as possible from my backup. Its easily possible to
combine the contents of a backup and a new records file.

I also lost the records on a Lenny => Squeeze update on my Dell
workstation at work. So this is three losses within just a few month. In
the current state, uptimed is hardly usable for me.

For now I done a backup for myself as fcrontab jobs:

# Backup der uptimed-Datenbank
@ 1d cp -p /var/spool/uptimed/records ~/Backup/uptimed/records-$(date
+%Y-%M-%d)
@ 30d find ~/Backup/uptimed/ -name "records-*" -and -mtime +30 -delete

Something like that should go into uptimed or a cron-job that comes with
the package. Could be a cron.daily or at least cron.weekly job (using some
directory in /var for backups).

So, I hope this was enough constructive feedback to show what can be done
about it. I can craft up a cron-job for the uptimed package if you want
that does the backup. I am not that much into C programming currently, but
eventually I could come up with a patch for uptimed as well.

But I think this bug needs acknowledgment as being serious cause data loss
is involved. Just denying that there is a problem, doesn't help proceeding
further. A user of uptimed IMHO rightly does not care whether its a
problem in the kernel, the filesystem, or the userspace program.

Ciao,
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7






To UNSUBSCRIBE, email to debian-bugs-dist-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
email Follow the discussionReplies 1 replyReplies Make a reply

Replies

#1 Martin Steigerwald
December 11th, 2010 - 07:50 am ET | Report spam

Am Dienstag 07 Dezember 2010 schrieb Martin Steigerwald:
I will restore as much as possible from my backup. Its easily possible
to combine the contents of a backup and a new records file.



I forgot to copy the records file out of my backup before creating the next
backup. And since I do not yet store mutiple revisions in my backup, I
lost lots of entries.

I tried to combine whats left with

cat old-file > records-new
cat old-file2 >> recodrs-new
cat records >> records-new

sort -n -t ":" -f2 records-new | uniq records-sorted
sort -rf -t ":" -f1 records-sorted > records-final

cp -p records records-2010-12-11.bak
mv records-final records

But I lost to much:

:~> uprecords | egrep "( up | down |%up)" | cut -c 1-66
up 504 days, 01:46:37 | since Thu Sep 18
down 309 days, 22:35:13 | since Thu Sep 18
%up 61.924 | since Thu Sep 18

Thus I can start from scratch. I had over 92% of uptime before. Hmm, I
don't know how this can happen, since my last backup I could restore was
from oktober. It can't have 200 days downtime. Maybe I try merging again.
Maybe I did something wrong with above merge commands.

I consider loosing all records from time to time to be an epic failure of
either uptimed or the filesystem. But its an Ext4 with those delayed
allocation work-arounds that should help most applications.

I wonder whether I am better of with purging uptimed and be done with it.
Since I do not understand why this happens as the rename case should be
safe and also do not know what to do about it.

I think I will start with a new records file and put my backup cronjobs
everywhere where uptimed is running, for now.

Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7






To UNSUBSCRIBE, email to
with a subject of "unsubscribe". Trouble? Contact

Similar topics