[RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids

November 17th, 2011 - 06:50 am ET by Pavel Emelyanov | Report spam
Gentlemen, please, find some time for this, your ACK/NACK on the API proposal
is required badly.


There's currently a work in progress with checkpoint-restore functionality
in the userspace. Most of the API for doing this kernel already provides, but
sometimes it's not enough. One of the required things is the ability to
create a process with its pids (in different pid namespaces) set to some
given values, rather than generated. Currently kernel doesn't allow for this,
so an API extension is required.

The proposal is to introduce the CLONE_CHILD_USEPIDS flag for clone() syscall
and pass the pids values in the child_tidptr. In order not to introduce the
hole for the pid-reuse attack, using this flag will result in EPERM in case
the pid namespace we're trying to create pid in has at least one pid (except
for the init's one) generated with regular fork()/clone().

Currently Tejun and Oleg are worrying only about the intrusiveness of this
approach, although Oleg agrees, that it solves all the problems it should. The
previous attempts to implement the similar stuff stopped, but no objections
against this were expressed. So the decision of whether it's OK to go this
way or not is required.


The API will be used like in the code below

/* restore new pid namespace with an init in it */
pid = clone(CLONE_NEWPID);
if (pid)
return 0;

/*
* init of a new pid namespace.
* recreate the process tree
*/

restore_children:
while (1) {
pid = next_pid_from_image();
if (!pid)
/* no more children */
break;

pid = clone(CLONE_CHILD_USEPIDS, &pid);
if (pid == 0)
goto restore_children;
}

/*
* the process tree is recreated, can proceed with restoring
* other stuff
*/


Thanks,
Pavel
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 36 repliesReplies Make a reply

Similar topics

Replies

#11 Tejun Heo
November 18th, 2011 - 06:40 pm ET | Report spam
Hello,

On Thu, Nov 17, 2011 at 08:01:03PM +0400, Pavel Emelyanov wrote:
> Yes, personally I'd prefer /proc/set_last_pid (or something similar) which
> simply writes to pid_ns->last_pid. Perhaps it is less convenient from the
> user-space pov (serialization, security) but it is much simpler.

Yes, this is also possible. I have a working prototype of /proc/sys/kernel/ns_last_pid
with the security issue solved, but setting sysctl then cloning seems more obfuscating
to me than just passing an array of pids to clone.



Do you mind sharing the patch? It doesn't have to be perfect. I'm
just curious how it looks. IMHO the suggested pid array passing is
good enough and not too intrusive but, if there's something simpler
from kernel side, given that this is a very specialized interface, I
think we definitely need to consider that.

Thank you.

tejun
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#12 Pavel Emelyanov
November 21st, 2011 - 04:20 am ET | Report spam
On 11/19/2011 03:30 AM, Tejun Heo wrote:
Hello,

On Thu, Nov 17, 2011 at 08:01:03PM +0400, Pavel Emelyanov wrote:
Yes, personally I'd prefer /proc/set_last_pid (or something similar) which
simply writes to pid_ns->last_pid. Perhaps it is less convenient from the
user-space pov (serialization, security) but it is much simpler.



Yes, this is also possible. I have a working prototype of /proc/sys/kernel/ns_last_pid
with the security issue solved, but setting sysctl then cloning seems more obfuscating
to me than just passing an array of pids to clone.



Do you mind sharing the patch?



Sure! First of all, we need to change the ctl_table_root->permission callback to pass
the required operations (MAY_XXX) into it, rather than just getting the mode allowed.
The API change it like in the patch below (plus we need to patch the net/ sysctl's, since
they use this API):

a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -1052,10 +1052,12 @@ struct ctl_table_root {
struct ctl_table_set default_set;
struct ctl_table_set *(*lookup)(struct ctl_table_root *root,
struct nsproxy *namespaces);
- int (*permissions)(struct ctl_table_root *root,
- struct nsproxy *namespaces, struct ctl_table *table);
+ int (*permissions)(struct nsproxy *namespaces,
+ struct ctl_table *table, int op);
};

/* struct ctl_table_header is used to maintain dynamic lists of
a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1706,7 +1706,7 @@ void register_sysctl_root(struct ctl_table_root *root)
* some sysctl variables are readonly even to root.
*/

-static int test_perm(int mode, int op)
+int sysctl_test_perm(int mode, int op)
{
if (!current_euid())
mode >>= 6;
@@ -1722,11 +1722,9 @@ int sysctl_perm(struct ctl_table_root *root, struct ctl_table *table, int op)
int mode;

if (root->permissions)
- mode = root->permissions(root, current->nsproxy, table);
+ return root->permissions(current->nsproxy, table, op);
else
- mode = table->mode;
-
- return test_perm(mode, op);
+ return sysctl_test_perm(table->mode, op);
}

static void sysctl_set_parent(struct ctl_table *parent, struct ctl_table *table)



Then I introduce the kernel.ns_last_pid sysctl that is allows for MAY_OPEN | MAP_WRITE for
the namespace's init only and allows for MAY_WRITE for anyone else. Thus, if we want to
write to this file from non-init task it must have the respective fd inherited from the init
on fork. It works OK for checkpoint/restore.

The patch is:


diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index e9c9adc..3686a07 100644
a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -15,6 +15,7 @@
#include <linux/acct.h>
#include <linux/slab.h>
#include <linux/proc_fs.h>
+#include <linux/sysctl.h>

#define BITS_PER_PAGE (PAGE_SIZE*8)

@@ -191,9 +192,54 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
return;
}

+static int pid_ns_ctl_permissions(struct nsproxy *namespaces,
+ struct ctl_table *table, int op)
+{
+ int mode = 0644;
+
+ if ((op & MAY_OPEN) &&
+ current != namespaces->pid_ns->child_reaper)
+ /*
+ * Writing to this sysctl is allowed only for init
+ * and to whoever it grands the open file
+ */
+ mode &= ~0222;
+
+ return sysctl_test_perm(mode, op);
+}
+
+static struct ctl_table_root pid_ns_root = {
+ .permissions = pid_ns_ctl_permissions,
+};
+
+static int pid_ns_ctl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table tmp = *table;
+ tmp.data = &current->nsproxy->pid_ns->last_pid;
+ return proc_dointvec(&tmp, write, buffer, lenp, ppos);
+}
+
+static struct ctl_table pid_ns_ctl_table[] = {
+ .permissions = pid_ns_ctl_permissions,
+};
+
+static int pid_ns_ctl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table tmp = *table;
+ tmp.data = &current->nsproxy->pid_ns->last_pid;
+ return proc_dointvec(&tmp, write, buffer, lenp, ppos);
+}
+
+static struct ctl_table pid_ns_ctl_table[] = {
+ {
+ .procname = "ns_last_pid",
+ .maxlen = sizeof(int),
+ .proc_handler = pid_ns_ctl_handler,
+ },
+ { }
+};
+
+static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
+
static __init int pid_namespaces_init(void)
{
pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+
+ setup_sysctl_set(&pid_ns_root.default_set, NULL, NULL);
+ register_sysctl_root(&pid_ns_root);
+ __register_sysctl_paths(&pid_ns_root, current->nsproxy,
+ kern_path, pid_ns_ctl_table);
+
return 0;
}


It doesn't have to be perfect. I'm just curious how it looks.
IMHO the suggested pid array passing is good enough and not too intrusive
but, if there's something simpler from kernel side, given that this is a
very specialized interface, I think we definitely need to consider that.



Well, after a bit more thinking I found one more pros for this sysctl - when restoring
a container we'll have the possibility to set the last_pid to what we want to prevent the
pids reuse after the restore.

Thank you.




To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#13 Tejun Heo
November 21st, 2011 - 06:00 pm ET | Report spam
Hello, Pavel.

On Mon, Nov 21, 2011 at 01:15:02PM +0400, Pavel Emelyanov wrote:
Then I introduce the kernel.ns_last_pid sysctl that is allows for MAY_OPEN | MAP_WRITE for
the namespace's init only and allows for MAY_WRITE for anyone else. Thus, if we want to
write to this file from non-init task it must have the respective fd inherited from the init
on fork. It works OK for checkpoint/restore.

The patch is:


diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index e9c9adc..3686a07 100644
a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -15,6 +15,7 @@
#include <linux/acct.h>
#include <linux/slab.h>
#include <linux/proc_fs.h>
+#include <linux/sysctl.h>

#define BITS_PER_PAGE (PAGE_SIZE*8)

@@ -191,9 +192,54 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
return;
}

+static int pid_ns_ctl_permissions(struct nsproxy *namespaces,
+ struct ctl_table *table, int op)
+{
+ int mode = 0644;
+
+ if ((op & MAY_OPEN) &&
+ current != namespaces->pid_ns->child_reaper)
+ /*
+ * Writing to this sysctl is allowed only for init
+ * and to whoever it grands the open file
+ */
+ mode &= ~0222;
+
+ return sysctl_test_perm(mode, op);
+}
+
+static struct ctl_table_root pid_ns_root = {
+ .permissions = pid_ns_ctl_permissions,
+};



Hmmm... I hope this could be prettier. I'm having trouble following
where the MAY_OPEN comes from. Can you please explain? Can't we for
now allow this for root and then later allow CAP_CHECKPOINT that
Cyrill suggested? Or do we want to allow setting pids even w/o CR for
NS creator?

+static int pid_ns_ctl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table tmp = *table;
+ tmp.data = &current->nsproxy->pid_ns->last_pid;
+ return proc_dointvec(&tmp, write, buffer, lenp, ppos);
+}



Probably better to call set_last_pid() on write path instead?

Well, after a bit more thinking I found one more pros for this
sysctl - when restoring a container we'll have the possibility to
set the last_pid to what we want to prevent the pids reuse after the
restore.



Hmmm... I personally like this one better. Restoring multilevel pids
would be more tedious but should still be possible and I really like
that it's staying out of clone path and all modifications are to ns
and pid code. Oleg, what do you think?

Thank you.

tejun
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#14 Pavel Emelyanov
November 22nd, 2011 - 06:20 am ET | Report spam
+static int pid_ns_ctl_permissions(struct nsproxy *namespaces,
+ struct ctl_table *table, int op)
+{
+ int mode = 0644;
+
+ if ((op & MAY_OPEN) &&
+ current != namespaces->pid_ns->child_reaper)
+ /*
+ * Writing to this sysctl is allowed only for init
+ * and to whoever it grands the open file
+ */
+ mode &= ~0222;
+
+ return sysctl_test_perm(mode, op);
+}
+
+static struct ctl_table_root pid_ns_root = {
+ .permissions = pid_ns_ctl_permissions,
+};



Hmmm... I hope this could be prettier. I'm having trouble following
where the MAY_OPEN comes from. Can you please explain?



From this calltrace:

pid_ns_ctl_permissions
sysctl_perm
proc_sys_permission
inode_permission
do_last <<<<< MAY_OPEN appears here
path_openat
do_filp_open
do_sys_open
sys_open


Can't we for now allow this for root and then later allow CAP_CHECKPOINT
that Cyrill suggested? Or do we want to allow setting pids even w/o CR
for NS creator?



I think that systemd guys can play with it. E.g. respawning daemons with predefined
pids sounds like an interesting thing to play with.

+static int pid_ns_ctl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table tmp = *table;
+ tmp.data = &current->nsproxy->pid_ns->last_pid;
+ return proc_dointvec(&tmp, write, buffer, lenp, ppos);
+}



Probably better to call set_last_pid() on write path instead?



Why? The usage of this sysctl is going to be synchronized by external locks,
so why should we care?

Well, after a bit more thinking I found one more pros for this
sysctl - when restoring a container we'll have the possibility to
set the last_pid to what we want to prevent the pids reuse after the
restore.



Hmmm... I personally like this one better. Restoring multilevel pids
would be more tedious but should still be possible and I really like
that it's staying out of clone path and all modifications are to ns
and pid code. Oleg, what do you think?

Thank you.




To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#15 Pedro Alves
November 22nd, 2011 - 07:10 am ET | Report spam
On Tuesday 22 November 2011 11:11:02, Pavel Emelyanov wrote:
> Can't we for now allow this for root and then later allow CAP_CHECKPOINT
> that Cyrill suggested? Or do we want to allow setting pids even w/o CR
> for NS creator?

I think that systemd guys can play with it. E.g. respawning daemons with predefined
pids sounds like an interesting thing to play with.



This whole userspace C/R stuff and being able to set the child's pid has potential
of being very useful for GDB too, allowing a much better reimplementation of its
old checkpointing feature [*], and allowing for a faster reverse debugging
implementation, by being able to do faster rewinding -- restore snapshot and replay
instructions up to N (by single stepping or running to breakpoint), rather than
manually undoing the effects of each instruction, one by one.

IOW, root only would be a shame from GDB's perspective.

[*] GDB has an old checkpointing feature ("checkpoint; info checkpoints" commands)
based on forcing the tracee to fork, and holding on the fork child behind the
scenes as a checkpoint. To restore the debugging state to a previous checkpoint,
gdb swaps the current debuggee for the fork child as transparently for the user
as was possible. Obviously this has a bunch of limitations and downsides like
only working on non-threaded programs, and the inferior's pid changing...

Pedro Alves
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
Help Create a new topicNext page Previous pageReplies Make a reply
Search Make your own search