[RFC PATCH 0/4 V2] introduce: livedump

May 25th, 2012 - 05:40 am ET by YOSHIDA Masanori | Report spam
MIME-Version: 1.0

Changes in V2:
- A little more comments are added.
- Operation using tools/livedump/livedump is simpliefied.
- Previous 5 patches are arranged to 4 patches.
([3/5] and [4/5] are merged)
- The patchset is rebased onto v3.4.
- crash-6.0.6 is required (which was 6.0.1 previously).


The following series introduces the new memory dumping mechanism Live Dump,
which let users obtain a consistent memory dump without stopping a running
system.

Such a mechanism is useful especially in the case where very important
systems are consolidated onto a single machine via virtualization.
Assuming a KVM host runs multiple important VMs on it and one of them
fails, the other VMs have to keep running. However, at the same time, an
administrator may want to obtain memory dump of not only the failed guest
but also the host because possibly the cause of failture is not in the
guest but in the host or the hardware under it.

Live Dump is based on Copy-on-write technique. Basically processing is
performed in the following order.
(1) Suspends processing of all CPUs.
(2) Makes pages (which you want to dump) read-only.
(3) Resumes all CPUs
(4) On page fault, dumps a page including a fault address.
(5) Finally, dumps the rest of pages that are not updated.

Currently, Live Dump is just a simple prototype and it has many
limitations. I list the important ones below.
(1) It write-protects only kernel's straight mapping areas. Therefore
memory updates from vmap areas and user space don't cause page fault.
Pages corresponding to these areas are not consistently dumped.
(2) It supports only x86-64 architecture.
(3) It can only handle 4K pages. As we know, most pages in kernel space are
mapped via 2M or 1G large page mapping. Therefore, the current
implementation of Live Dump splits all large pages into 4K pages before
setting up write protection.
(4) It allocates about 50% of physical RAM to store dumped pages. Currently
Live Dump saves all dumped data on memory once, and after that a user
becomes able to use the dumped data. Live Dump itself has no feature to
save dumped data onto a disk or any other storage device.

This series consists of 4 patches.

Ths 1st patch adds notifier-call-chain in do_page_fault. This is the only
modification against the existing code path of the upstream kernel.

The 2nd patch introduces "livedump" misc device.

The 3rd patch introduces feature of write protection management. This
enables users to turn on write protection on kernel space and to install a
hook function that is called every time page fault occurs on each protected
page.

The last patch introduces memory dumping feature. This patch installs the
function to dump content of the protected page on page fault. At the same
time, it lets users to access the dumped data via the misc device
interface.


***How to test***
To test this patch, you have to apply the attached patch to the source code
of crash[1]. This patch can be applied to the version 6.0.6 of crash. In
addition to this, you have to configure your kernel to turn on
CONFIG_DEBUG_INFO.

[1]crash, http://people.redhat.com/anderson/c...0.6.tar.gz

At first, kick the script tools/livedump/livedump as follows.
# livedump dump

At this point, all memory image has been saved (also on memory). Then you
can analyze the image by kicking the patched crash as follows.
# crash /dev/livedump /boot/System.map /boot/vmlinux.o

By the following command, you can release all resources of livedump.
# livedump release



YOSHIDA Masanori (4):
livedump: Add memory dumping functionality
livedump: Add write protection management
livedump: Add the new misc device "livedump"
livedump: Add notifier-call-chain into do_page_fault


arch/x86/Kconfig | 29 ++
arch/x86/include/asm/traps.h | 2
arch/x86/include/asm/wrprotect.h | 47 +++
arch/x86/mm/Makefile | 2
arch/x86/mm/fault.c | 7
arch/x86/mm/wrprotect.c | 618 ++++++++++++++++++++++++++++++++++++++
kernel/Makefile | 1
kernel/livedump-memdump.c | 237 +++++++++++++++
kernel/livedump-memdump.h | 45 +++
kernel/livedump.c | 129 ++++++++
tools/livedump/livedump | 28 ++
11 files changed, 1145 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/wrprotect.h
create mode 100644 arch/x86/mm/wrprotect.c
create mode 100644 kernel/livedump-memdump.c
create mode 100644 kernel/livedump-memdump.h
create mode 100644 kernel/livedump.c
create mode 100755 tools/livedump/livedump

Signature

MIME-Version: 1.0

diff --git a/filesys.c b/filesys.c
index 5c45a8f..80f5918 100755
a/filesys.c
+++ b/filesys.c
@@ -167,6 +167,7 @@ memory_source_init(void)
return;

if (!STREQ(pc->live_memsrc, "/dev/mem") &&
+ !STREQ(pc->live_memsrc, "/dev/livedump") &&
STREQ(pc->live_memsrc, pc->memory_device)) {
if (memory_driver_init())
return;
@@ -187,6 +188,9 @@ memory_source_init(void)
strerror(errno));
} else
pc->flags |= MFD_RDWR;
+ } else if (STREQ(pc->live_memsrc, "/dev/livedump")) {
+ if ((pc->mfd = open("/dev/livedump", O_RDONLY)) < 0)
+ error(FATAL, "/dev/livedump: %s", strerror(errno));
} else if (STREQ(pc->live_memsrc, "/proc/kcore")) {
if ((pc->mfd = open("/proc/kcore", O_RDONLY)) < 0)
error(FATAL, "/proc/kcore: %s",
diff --git a/main.c b/main.c
index 5a5e19c..8628cde 100755
a/main.c
+++ b/main.c
@@ -436,6 +436,19 @@ main(int argc, char **argv)
pc->writemem = write_dev_mem;
pc->live_memsrc = argv[optind];

+ } else if (STREQ(argv[optind], "/dev/livedump")) {
+ if (pc->flags & MEMORY_SOURCES) {
+ error(INFO,
+ "too many dumpfile arguments");
+ program_usage(SHORT_FORM);
+ }
+ pc->flags |= DEVMEM;
+ pc->dumpfile = NULL;
+ pc->readmem = read_dev_mem;
+ pc->writemem = write_dev_mem;
+ pc->live_memsrc = argv[optind];
+ pc->program_pid = 1;
+
} else if (is_proc_kcore(argv[optind], KCORE_LOCAL)) {
if (pc->flags & MEMORY_SOURCES) {
error(INFO,

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 6 repliesReplies Make a reply

Similar topics

Replies

#1 YOSHIDA Masanori
May 25th, 2012 - 05:40 am ET | Report spam
Introduces the new misc device "livedump".
This device will be used as interface between livedump and user space.
Right now, the device only has empty ioctl operation.

***ATTENTION PLEASE***
I think debugfs is more suitable for this feature, but currently livedump
uses the misc device for simplicity. This will be fixed in the future.

Signed-off-by: YOSHIDA Masanori
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc:
Cc: Kevin Hilman
Cc: "Rafael J. Wysocki"
Cc: Peter Zijlstra
Cc:


arch/x86/Kconfig | 15 ++++++++++
kernel/Makefile | 1 +
kernel/livedump.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 99 insertions(+), 0 deletions(-)
create mode 100644 kernel/livedump.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c9866b0..4c97583 100644
a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1729,6 +1729,21 @@ config CMDLINE_OVERRIDE
This is used to work around broken boot loaders. This should
be set to 'N' under normal conditions.

+config LIVEDUMP
+ bool "Live Dump support"
+ depends on X86_64
+ help
+ Set this option to 'Y' to allow the kernel support to acquire
+ a consistent snapshot of kernel space without stopping system.
+
+ This feature regularly causes small overhead on kernel.
+
+ Once this feature is initialized by its special ioctl, it
+ allocates huge memory for itself and causes much more overhead
+ on kernel.
+
+ If in doubt, say N.
+
endmenu

config ARCH_ENABLE_MEMORY_HOTPLUG
diff --git a/kernel/Makefile b/kernel/Makefile
index cb41b95..f095e7a 100644
a/kernel/Makefile
+++ b/kernel/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
obj-$(CONFIG_PADATA) += padata.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
+obj-$(CONFIG_LIVEDUMP) += livedump.o

$(obj)/configs.o: $(obj)/config_data.h

diff --git a/kernel/livedump.c b/kernel/livedump.c
new file mode 100644
index 0000000..3103292
/dev/null
+++ b/kernel/livedump.c
@@ -0,0 +1,83 @@
+/* livedump.c - Live Dump's main
+ * Copyright (C) 2012 Hitachi, Ltd.
+ * Author: YOSHIDA Masanori
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+ * MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+
+#define DEVICE_NAME "livedump"
+
+#define LIVEDUMP_IOC(x) _IO(0xff, x)
+
+static long livedump_ioctl(
+ struct file *file, unsigned int cmd, unsigned long arg)
+{
+ switch (cmd) {
+ default:
+ return -ENOIOCTLCMD;
+ }
+}
+
+static int livedump_open(struct inode *inode, struct file *file)
+{
+ if (!try_module_get(THIS_MODULE))
+ return -ENOENT;
+ return 0;
+}
+
+static int livedump_release(struct inode *inode, struct file *file)
+{
+ module_put(THIS_MODULE);
+ return 0;
+}
+
+static const struct file_operations livedump_fops = {
+ .unlocked_ioctl = livedump_ioctl,
+ .open = livedump_open,
+ .release = livedump_release,
+};
+static struct miscdevice livedump_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = DEVICE_NAME,
+ .fops = &livedump_fops,
+};
+
+static int livedump_module_init(void)
+{
+ int ret;
+
+ ret = misc_register(&livedump_misc);
+ if (WARN(ret,
+ "livedump: Failed to register livedump on misc device."
+ ))
+ return ret;
+
+ return 0;
+}
+module_init(livedump_module_init);
+
+static void livedump_module_exit(void)
+{
+ misc_deregister(&livedump_misc);
+}
+module_exit(livedump_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Livedump kernel module");

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#2 YOSHIDA Masanori
May 25th, 2012 - 07:20 am ET | Report spam
Hi, Peter

Thank you for quick reply.

Yes, I know that PF in NMI handling is dangerous, and so livedump doesn't
protect such pages that can be updated during NMI handling.
Such pages are listed in [3/4] as "sensitive pages".

Currently, I regard the following pages as sensitive pages in [3/4].
- Kernel/Exception/Interrupt stacks
- Page table structure
- All task_struct
- ".data" section of kernel
- All per_cpu areas

However, I can't assure these pages are enough to avoid PF in NMI handling.
Do you have any idea to enumerate sensitive pages correctly?

Thank you.


On 2012/05/25 18:25, Peter Zijlstra wrote:
On Fri, 2012-05-25 at 18:12 +0900, YOSHIDA Masanori wrote:
Live Dump is based on Copy-on-write technique. Basically processing is
performed in the following order.
(1) Suspends processing of all CPUs.
(2) Makes pages (which you want to dump) read-only.
(3) Resumes all CPUs
(4) On page fault, dumps a page including a fault address.



Suppose a PF is in progress when all this happens, you mark all RO, then
an NMI happens, from the NMI context we'll generate another PF to update
a vmap area, this will again PF because you mucked about and marked
things RO.

You're now at 3 PFs, which is instant reboot.

I don't think this is going to work.




To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#3 YOSHIDA Masanori
May 25th, 2012 - 08:20 am ET | Report spam
Hi, Peter,

I agree that notifier isn't suitable for this purpose.

In the next version, I plan to replace this with a simple
callback as follows.

if (very_unlikely(under_live_dump))
...

In addition to this, since PF handler is a very hot path,
I will cover the condition with "#ifdef CONFIG_WRPROTECT".

Thank you.


On 2012/05/25 18:19, Peter Zijlstra wrote:
On Fri, 2012-05-25 at 18:12 +0900, YOSHIDA Masanori wrote:

This patch adds notifier-call-chain that is called in do_page_fault.
Livedump uses this to check if page fault is caused by livedump, and if so,
the fault is handled by livedump's handler function. Otherwise, it is
handled by the original page fault handler.



No, please no notifiers..






To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#4 Vivek Goyal
June 04th, 2012 - 05:10 pm ET | Report spam
On Fri, May 25, 2012 at 06:12:07PM +0900, YOSHIDA Masanori wrote:

[..]
(4) It allocates about 50% of physical RAM to store dumped pages. Currently
Live Dump saves all dumped data on memory once, and after that a user
becomes able to use the dumped data. Live Dump itself has no feature to
save dumped data onto a disk or any other storage device.



People complain when kdump reserves 128M of memory when system crashes.
I am skeptical that reserving 50% of memory for livedumps is going to fly.

Thanks
Vivek
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#5 H. Peter Anvin
June 04th, 2012 - 05:40 pm ET | Report spam
On 05/25/2012 02:12 AM, YOSHIDA Masanori wrote:

Such a mechanism is useful especially in the case where very important
systems are consolidated onto a single machine via virtualization.
Assuming a KVM host runs multiple important VMs on it and one of them
fails, the other VMs have to keep running. However, at the same time, an
administrator may want to obtain memory dump of not only the failed guest
but also the host because possibly the cause of failture is not in the
guest but in the host or the hardware under it.

Live Dump is based on Copy-on-write technique. Basically processing is
performed in the following order.
(1) Suspends processing of all CPUs.
(2) Makes pages (which you want to dump) read-only.
(3) Resumes all CPUs
(4) On page fault, dumps a page including a fault address.
(5) Finally, dumps the rest of pages that are not updated.

Currently, Live Dump is just a simple prototype and it has many
limitations. I list the important ones below.
(1) It write-protects only kernel's straight mapping areas. Therefore
memory updates from vmap areas and user space don't cause page fault.
Pages corresponding to these areas are not consistently dumped.
(2) It supports only x86-64 architecture.
(3) It can only handle 4K pages. As we know, most pages in kernel space are
mapped via 2M or 1G large page mapping. Therefore, the current
implementation of Live Dump splits all large pages into 4K pages before
setting up write protection.
(4) It allocates about 50% of physical RAM to store dumped pages. Currently
Live Dump saves all dumped data on memory once, and after that a user
becomes able to use the dumped data. Live Dump itself has no feature to
save dumped data onto a disk or any other storage device.




I am very concerned about the impact of this patch versus its value...
losing half the RAM means the value is extremely limited and the other
limitations above indicates that the cost is very very high.

At the same time, the guest can be dumped without any special tricks.

-hpa

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
Help Create a new topicNext page Replies Make a reply
Search Make your own search