[RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids

November 17th, 2011 - 06:50 am ET by Pavel Emelyanov | Report spam
Gentlemen, please, find some time for this, your ACK/NACK on the API proposal
is required badly.


There's currently a work in progress with checkpoint-restore functionality
in the userspace. Most of the API for doing this kernel already provides, but
sometimes it's not enough. One of the required things is the ability to
create a process with its pids (in different pid namespaces) set to some
given values, rather than generated. Currently kernel doesn't allow for this,
so an API extension is required.

The proposal is to introduce the CLONE_CHILD_USEPIDS flag for clone() syscall
and pass the pids values in the child_tidptr. In order not to introduce the
hole for the pid-reuse attack, using this flag will result in EPERM in case
the pid namespace we're trying to create pid in has at least one pid (except
for the init's one) generated with regular fork()/clone().

Currently Tejun and Oleg are worrying only about the intrusiveness of this
approach, although Oleg agrees, that it solves all the problems it should. The
previous attempts to implement the similar stuff stopped, but no objections
against this were expressed. So the decision of whether it's OK to go this
way or not is required.


The API will be used like in the code below

/* restore new pid namespace with an init in it */
pid = clone(CLONE_NEWPID);
if (pid)
return 0;

/*
* init of a new pid namespace.
* recreate the process tree
*/

restore_children:
while (1) {
pid = next_pid_from_image();
if (!pid)
/* no more children */
break;

pid = clone(CLONE_CHILD_USEPIDS, &pid);
if (pid == 0)
goto restore_children;
}

/*
* the process tree is recreated, can proceed with restoring
* other stuff
*/


Thanks,
Pavel
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 36 repliesReplies Make a reply

Similar topics

Replies

#21 Pavel Emelyanov
November 22nd, 2011 - 02:40 pm ET | Report spam
On 11/22/2011 08:44 PM, Linus Torvalds wrote:
On Tue, Nov 22, 2011 at 8:30 AM, Pavel Emelyanov wrote:

I think that systemd guys can play with it. E.g. respawning daemons with predefined
pids sounds like an interesting thing to play with.



But wouldn't CAP_CHECKPOINT be enough for systemd?



It would, but what's the point in granting to a systemd (which can be a container's
init by the way) the ability to use the _whole_ checkpoint/restore engine?



Christ, stop making it sound like we would *want* systemd to do even
more odd things.

Quite frankly, any feature that is sold with ".. and systemd can use
this fox Xyz", is a *misfeature* in my opinion. Core infrastructure
like systemd should use a *minimal* interface, not some random
extended features.



Damn, surely it should use a minimal! But our opinion doesn't prevent this daemon
from doing very weird stuff, and (believe it or not) I'm not trying to sell this
feature for systemd. Just trying to minimize the impact from systemd's use of it :(

I'm starting to really dislike this whole feature discussion.

Linus
.




Thanks,
Pavel
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#22 Oleg Nesterov
November 22nd, 2011 - 04:30 pm ET | Report spam
On 11/21, Tejun Heo wrote:

> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + struct ctl_table tmp = *table;
> + tmp.data = &current->nsproxy->pid_ns->last_pid;
> + return proc_dointvec(&tmp, write, buffer, lenp, ppos);
> +}

Probably better to call set_last_pid() on write path instead?



I am not sure... set_last_pid() is "special", it tries to avoid the
races with itself to prevent the pid-reuse. It doesn't hurt, but
perhaps

set_last_pid(pid_ns, pid_ns->last_pid, new_pid);

looks a bit confusing.

Hmm. On the second thought, perhaps this makes sense anyway. Just
to keep the "only set_last_pid() can change ->last_pid" property.

But this is almost cosmetic.

> Well, after a bit more thinking I found one more pros for this
> sysctl - when restoring a container we'll have the possibility to
> set the last_pid to what we want to prevent the pids reuse after the
> restore.

Hmmm... I personally like this one better. Restoring multilevel pid
would be more tedious but should still be possible and I really like
that it's staying out of clone path and all modifications are to ns
and pid code. Oleg, what do you think?



Obviously, I'd prefer this one too ;)

But. Personally I do not like the fact that only init can open this
file for writing... (I guess Pavel already hates me ;)

If we add this sysctl, then I think there should be some way to use
outside of "checkpoint-restore" world. For example, see the comment
from Pedro. This use-case looks unexpected (to me), but reasonable.
Or. Say, set_last_pid can be useful to test the pid-reuse races.

In any case. To me, it is not really good to have /proc/*/set_last_pid
without the ability to use it somehow on the running system.

Oleg.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#23 Tejun Heo
November 23rd, 2011 - 11:30 am ET | Report spam
Hello,

On Wed, Nov 23, 2011 at 04:20:44PM +0000, Pedro Alves wrote:
> Would CAP_CHECKPOINT be a shame too?

I think CAP_CHECKPOINT (or something through some LSM) would be
definitely better.

> I'm reluctant about priviledge
> through fd inheritance mostly because of its unusualness. I don't
> think priv management is a good problem space for small creative
> solutions. We're much better off with mundane mechanisms which people
> are already familiar with and is easy to account for.

fd inheritance wouldn't work for gdb; a user spawned gdb
wouldn't inherit an open fd to kernel.ns_last_pid from anywhere.



I see. So, let's do it for root for now and later add finer grained
CAP as necessary/viable. Pavel, what do you think?

Thanks.

tejun
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#24 Pedro Alves
November 23rd, 2011 - 11:30 am ET | Report spam
Hello Tejun,

On Tuesday 22 November 2011 15:33:26, Tejun Heo wrote:
Hello,

On Tue, Nov 22, 2011 at 12:04:38PM +0000, Pedro Alves wrote:
> This whole userspace C/R stuff and being able to set the child's pid has potential
> of being very useful for GDB too, allowing a much better reimplementation of its
> old checkpointing feature [*], and allowing for a faster reverse debugging
> implementation, by being able to do faster rewinding -- restore snapshot and replay
> instructions up to N (by single stepping or running to breakpoint), rather than
> manually undoing the effects of each instruction, one by one.
>
> IOW, root only would be a shame from GDB's perspective.

Would CAP_CHECKPOINT be a shame too?



I think CAP_CHECKPOINT (or something through some LSM) would be
definitely better.

I'm reluctant about priviledge
through fd inheritance mostly because of its unusualness. I don't
think priv management is a good problem space for small creative
solutions. We're much better off with mundane mechanisms which people
are already familiar with and is easy to account for.



fd inheritance wouldn't work for gdb; a user spawned gdb
wouldn't inherit an open fd to kernel.ns_last_pid from anywhere.

Pedro Alves
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#25 Tejun Heo
November 23rd, 2011 - 12:40 pm ET | Report spam
Hey, Oleg.

On Wed, Nov 23, 2011 at 06:26:36PM +0100, Oleg Nesterov wrote:
On 11/23, Tejun Heo wrote:
>
> I see. So, let's do it for root for now and later add finer grained
> CAP as necessary/viable.

Agreed.

Stupid off-topic question, I am just curious... Why /proc/sys
dislikes chmod ? I do not understand this code, but it seems
that it would be simple to change proc_sys_setattr() to respect
ATTR_MODE and update PROC_I(inode)->sysctl_entry->mode, no?



I have no idea either but suspect it's something historical. It came
from sysctl and sysctl didn't have any way to diddle with permissions,
so when later adding proc interface, it probably just stayed that way.

Thanks.

tejun
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
Help Create a new topicNext page Previous pageReplies Make a reply
Search Make your own search