[RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids

November 17th, 2011 - 06:50 am ET by Pavel Emelyanov | Report spam
Gentlemen, please, find some time for this, your ACK/NACK on the API proposal
is required badly.


There's currently a work in progress with checkpoint-restore functionality
in the userspace. Most of the API for doing this kernel already provides, but
sometimes it's not enough. One of the required things is the ability to
create a process with its pids (in different pid namespaces) set to some
given values, rather than generated. Currently kernel doesn't allow for this,
so an API extension is required.

The proposal is to introduce the CLONE_CHILD_USEPIDS flag for clone() syscall
and pass the pids values in the child_tidptr. In order not to introduce the
hole for the pid-reuse attack, using this flag will result in EPERM in case
the pid namespace we're trying to create pid in has at least one pid (except
for the init's one) generated with regular fork()/clone().

Currently Tejun and Oleg are worrying only about the intrusiveness of this
approach, although Oleg agrees, that it solves all the problems it should. The
previous attempts to implement the similar stuff stopped, but no objections
against this were expressed. So the decision of whether it's OK to go this
way or not is required.


The API will be used like in the code below

/* restore new pid namespace with an init in it */
pid = clone(CLONE_NEWPID);
if (pid)
return 0;

/*
* init of a new pid namespace.
* recreate the process tree
*/

restore_children:
while (1) {
pid = next_pid_from_image();
if (!pid)
/* no more children */
break;

pid = clone(CLONE_CHILD_USEPIDS, &pid);
if (pid == 0)
goto restore_children;
}

/*
* the process tree is recreated, can proceed with restoring
* other stuff
*/


Thanks,
Pavel
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 36 repliesReplies Make a reply

Similar topics

Replies

#31 Oleg Nesterov
November 25th, 2011 - 11:30 am ET | Report spam
On 11/25, Pavel Emelyanov wrote:

The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
pid on the current. The subsequent fork() uses this pid,



Oh. This is subjective, yes, but this doesn't clean to me.

Amd why?? On the running system PR_RESERVE_PID can obviously fail anyway.
It only helps to avoid the race with another fork.

* one more field on struct pid is OK, since it size doesn't change (32 bit level is
anyway not required, it's OK to reduce on down to 16 bits)



Even if sizeof is the same, the new member and the code which plays
with ->flags doesn't make the things better ;)

* yes, we have +1 member on task_struct :(



Yes, and this task_struct->rsv_pid acts as implicit parameter for the
next clone(). Doesn't look very nice to me. Plus the code complications.

Oleg, Tejun, do you agree with such an approach?



If set_last_pid doesn't work, I'd prefer CLONE_CHILD_USEPIDS.

Oleg.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#32 Pavel Emelyanov
November 25th, 2011 - 11:50 am ET | Report spam
On 11/25/2011 08:22 PM, Oleg Nesterov wrote:
On 11/25, Pavel Emelyanov wrote:

The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
pid on the current. The subsequent fork() uses this pid,



Oh. This is subjective, yes, but this doesn't clean to me.

Amd why?? On the running system PR_RESERVE_PID can obviously fail anyway.
It only helps to avoid the race with another fork.



No. It can fail if you try to allocate a pid with given number. The API allows for
pid generation. AFAIU this can help with Pedro's requirements to resurrect task with
the same pid value it used to have before.

* one more field on struct pid is OK, since it size doesn't change (32 bit level is
anyway not required, it's OK to reduce on down to 16 bits)



Even if sizeof is the same, the new member and the code which plays
with ->flags doesn't make the things better ;)

* yes, we have +1 member on task_struct :(



Yes, and this task_struct->rsv_pid acts as implicit parameter for the
next clone(). Doesn't look very nice to me. Plus the code complications.



Well, the last_pid is also an implicit parameter for the next clone() with sysctl
approach :) But the code complication is the problem, yes :(

Oleg, Tejun, do you agree with such an approach?



If set_last_pid doesn't work, I'd prefer CLONE_CHILD_USEPIDS.



OK, thanks.

Oleg.

.




To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#33 Konstantin Khlebnikov
November 27th, 2011 - 04:50 am ET | Report spam
Pavel Emelyanov wrote:
OK, here's another proposal that seem to suit all of us:

1. me wants to clone tasks with pids set
2. Pedro wants to fork task with not changing pids and w/o root perms
3. Oleg and Tejun want to have little intrusion into fork() path

The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
pid on the current. The subsequent fork() uses this pid, this pid survives and keeps
its bit in the pidmap after detach. The 2nd fork() after the 1st task death thus
can reuse the same pid again. This basic thing doesn't require root perms at all
and safe against pid reuse problems. When requesting for pid reservation task may
specify a pid number it wants to have, but this requires root perms (CAP_SYS_ADMIN).

Pedro, I suppose this will work for your checkpoint feature in gdb, am I right?

Few comments about intrusion:

* the common path - if (pid !=&init_struct_pid) - on fork is just modified
* we have -1 argument to copy_process
* one more field on struct pid is OK, since it size doesn't change (32 bit level is
anyway not required, it's OK to reduce on down to 16 bits)
* no clone flags extension
* no new locking - the reserved pid manipulations happen under tasklist_lock and
existing common paths do not require more of it
* yes, we have +1 member on task_struct :(

Current API problems:

* Only one fork() with pid at a time. Next call to PR_RESERVE_PID will kill the
previous reservation (don't know how to fix)
* No way to fork() an init of a pid sub-namespace with desired pid in current
(can be fixed for a flag for PR_RESERVE_PID saying that we need a pid for a
namespace of a next level)

* No way to grab existing pid for reserve (can be fixed, if someone wants this)



We can add flag to sys_wait4(), and stash pid in wait_task_zombie(), right before release_task()
code will looks something like this:

- if (p != NULL)
+ if (p != NULL) {
+ if ((wo->wo_flags & WCATCHPID) && !current->pid_stash) {
+ struct pid *pid = task_pid(p);
+
+ pid->flags |= PID_STASHED;
+ current->pid_stash = get_pid(pid);
+ }
release_task(p);
+ }

And next fork() creates child with the same pid.
So, struct pid will work like boomerang =)
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#34 Oleg Nesterov
November 27th, 2011 - 12:40 pm ET | Report spam
On 11/27, Konstantin Khlebnikov wrote:

We can add flag to sys_wait4(), and stash pid in wait_task_zombie(), right before release_task()
code will looks something like this:

- if (p != NULL)
+ if (p != NULL) {
+ if ((wo->wo_flags & WCATCHPID) && !current->pid_stash) {
+ struct pid *pid = task_pid(p);
+
+ pid->flags |= PID_STASHED;
+ current->pid_stash = get_pid(pid);
+ }
release_task(p);
+ }

And next fork() creates child with the same pid.
So, struct pid will work like boomerang =)



Like PR_RESERVE_PID, this can only help if the tracee (or whatever)
is single-threaded and it is the natural child.

Personally I do not think such a limited interface makes sense.

Oleg.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#35 Tejun Heo
November 27th, 2011 - 01:50 pm ET | Report spam
Hello, Pavel.

On Fri, Nov 25, 2011 at 02:14:56PM +0400, Pavel Emelyanov wrote:
OK, here's another proposal that seem to suit all of us:

1. me wants to clone tasks with pids set
2. Pedro wants to fork task with not changing pids and w/o root perms
3. Oleg and Tejun want to have little intrusion into fork() path

The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
pid on the current. The subsequent fork() uses this pid, this pid survives and keeps
its bit in the pidmap after detach. The 2nd fork() after the 1st task death thus
can reuse the same pid again. This basic thing doesn't require root perms at all
and safe against pid reuse problems. When requesting for pid reservation task may
specify a pid number it wants to have, but this requires root perms (CAP_SYS_ADMIN).

Pedro, I suppose this will work for your checkpoint feature in gdb, am I right?

Few comments about intrusion:

* the common path - if (pid != &init_struct_pid) - on fork is just modified
* we have -1 argument to copy_process
* one more field on struct pid is OK, since it size doesn't change (32 bit level is
anyway not required, it's OK to reduce on down to 16 bits)
* no clone flags extension
* no new locking - the reserved pid manipulations happen under tasklist_lock and
existing common paths do not require more of it
* yes, we have +1 member on task_struct :(

Current API problems:

* Only one fork() with pid at a time. Next call to PR_RESERVE_PID will kill the
previous reservation (don't know how to fix)
* No way to fork() an init of a pid sub-namespace with desired pid in current
(can be fixed for a flag for PR_RESERVE_PID saying that we need a pid for a
namespace of a next level)
* No way to grab existing pid for reserve (can be fixed, if someone wants this)

Oleg, Tejun, do you agree with such an approach?



Hmmm... Any attempt to reserve PIDs without full control over the
namespace is futile. It can never be complete / reliable. Let's just
forget about it. If anyone, including gdb, wants to have fun with CR,
let them manage namespace too; otherwise, it's never gonna be
reliable.

If you take the above out, setting last_pid is as simple as it gets
and good enough. It's essentially few tens of lines of code to add
userland interface for setting one pid_t value. Let's restrict
manipulation to root for now and see whether finer grained CAP_* makes
sense as we go along.

Thanks.

tejun
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
Help Create a new topicNext page Previous pageReplies Make a reply
Search Make your own search