The Linux kernel: Character devices

11. Character devices

An ASR33 Teletype - origin of the abbreviation tty.

We meet several kinds of objects (character devices, tty drivers, line disciplines). Each registers itself at kernel initialization time (or module insertion time), and can afterwards be found when an open() is done.

11.1 Registration

A character device announces its existence by calling register_chrdev(). The call

register_chrdev(major, name, fops);

stores the given name (a string) and fops (a struct file_operations *) in the entry of the array chrdevs[] indexed by the integer major, the major device number of the device.

(Devices have a number, the device number, a combination of major and minor device number. Traditionally, the major device number gives the kind of device, and the minor device number is some kind of unit number. However, there are no rules - it is best to consider a device number a cookie, without known structure.)

This stored entry is used again when the device is opened: The filesystem recognizes that the file that is being opened is a special device file, and invokes init_special_inode(). This routine does

void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
{
        inode->i_mode = mode;
        if (S_ISCHR(mode)) {
                inode->i_fop = &def_chr_fops;
                inode->i_rdev = to_kdev_t(rdev);
                inode->i_cdev = cdget(rdev);
        } else
                ...
}

Here to_kdev_t() converts the user mode version of the device number to the kernel version of the device number. The cdget() returns the struct char_device for this major number. It finds it using a hash table, and if we did not have it already, a new one is allocated. In all cases, the reference count of this struct is increased by one. The struct looks like

struct char_device {
        struct list_head        hash;
        atomic_t                count;
        dev_t                   dev;
        atomic_t                openers;
        struct semaphore        sem;
};

Here hash is a link in the chain of devices with the same hash, count is the number of references - each cdget() increases and each cdput() decreases this by one, and if it becomes zero, the struct is removed from the hash chain and freed. The field dev stores the device number, the only thing we know about this device. The fields openers and sem are unused. Access to the hash table is protected by the cdev_lock spinlock.

Finally the last item from init_special_inode():

struct file_operations def_chr_fops = {
        .open = chrdev_open,
};

That is, we cannot do anything with the character device except opening it, and when we do chrdev_open() is called.

On struct char_device

What is the use of this struct? After removing the unused fields openers and sem we see that we just have a struct in a hash chain and a reference count. It has no function at all, and all related code can be deleted (from 2.5.59).

11.2 Opening

The system call routine sys_open() calls filp_open(), and that calls dentry_open(), which does

f->f_op = fops_get(inode->i_fop);

In other words, the file f_op is a copy of the inode i_fop. (This fops_get() returns its argument, but increments a reference count in case these file operations live in a module.) Finally, dentry_open() calls the inode open routine if there is one:

if (f->f_op && f->f_op->open)
        f->f_op->open(inode,f);

Thus, it is here that chrdev_open() is called.

int chrdev_open(struct inode *inode, struct file *filp) {
        int ret = -ENODEV;

        filp->f_op = get_chrfops(major(inode->i_rdev), minor(inode->i_rdev));
        if (filp->f_op) {
                ret = 0;
                if (filp->f_op->open)
                        ret = ret = filp->f_op->open(inode,filp);
        }
        return ret;
}

And the routine get_chrfops() retrieves the struct file operations * that was registered:

struct file_operations *get_chrfops(unsigned int major, unsigned int minor) {
        return fops_get(chrdevs[major].fops);
}

(The actual routine checks whether the device did register already, and if not does a request_module("char-major-N") first, where N is the major number.)

We see that the inode fops remains unchanged, so that its open still points to chrdev_open(), but the file fops is changed and now points to what the device registered.

11.3 The tty driver

Let us focus on /dev/tty1, the first virtual console. Most code lives in drivers/char, in the files tty_io.c and n_tty.c and vt.c.

Registration

A tty driver announces its existence by calling tty_register_driver(). This call does a register_chrdev() (with tty_fops) and hangs the driver in the chain tty_drivers.

That chain is used by get_tty_driver(), a routine that given a device number finds the tty driver that handles the device with that number.

get_tty_driver

This routine is used in two places: in fs/char_dev.c:get_chrfops() and in tty_io.c:init_dev(), called from tty_open. The latter use was expected, but what is this strange first use?

#define is_a_tty_dev(ma)        (ma == TTY_MAJOR || ma == TTYAUX_MAJOR)
#define need_serial(ma,mi) (get_tty_driver(mk_kdev(ma,mi)) == NULL)

static struct file_operations *
get_chrfops(unsigned int major, unsigned int minor) {
        ...
        ret = fops_get(chrdevs[major].fops);
        if (ret && is_a_tty_dev(major) && need_serial(major,minor)) {
                fops_put(ret);
                ret = NULL;
        }
        if (!ret) {
                char name[20];
                sprintf(name, "char-major-%d", major);
                request_module(name);
                ret = fops_get(chrdevs[major].fops);
        }
        return ret;
}

The idea here is that majors 4 and 5 (TTY_MAJOR and TTYAUX_MAJOR) may be served by several modules. Indeed, /dev/tty1 has major,minor 4,1 and is a virtual console, while /dev/ttyS1 has major,minor 4,65 and is a serial line. Thus, in drivers/serial/core.c:uart_register_driver() we see a call of tty_register_driver(), and this former routine is called, e.g., to register serial8250_reg, defined as

struct uart_driver serial8250_reg = {
        .owner                  = THIS_MODULE,
        .driver_name            = "serial",
        .dev_name               = "ttyS%d",
        .major                  = TTY_MAJOR,
        .minor                  = 64,
        .nr                     = UART_NR,
        .cons                   = SERIAL8250_CONSOLE,
};

while vt.c:vty_init() calls tty_register_driver() to register console_driver with major = TTY_MAJOR and minor_start = 1.

Opening

As we saw above, opening a character device ends up with calling the open routine from the struct file_operations registered by the device. In the case of a tty, the open routine in tty_fops is tty_open.

The routine tty_open is long and messy, with a lot of special purpose code for controlling ttys, for pseudottys, etc. In the ordinary case the essential part is

tty_open(struct inode *inode, struct file *filp) {
        struct tty_struct *tty;
        kdev_t device = inode->i_rdev;

        init_dev(device, &tty);
        file->private_data = tty;
        tty->driver.open(tty,file);
}

Thus, first of all, we create a tty_struct. Next, a pointer to this tty_struct is stored in the private_data field of the file struct, so that we can find it later, for example in tty_read():

tty_read(struct file *file, char *buf, size_t count, ...) {
        struct tty_struct *tty = file->private_data;
        (tty->ldisc.read)(tty,file,buf,count);
}

Finally we call the open routine of the driver. The field tty->driver was set in init_dev():

init_dev(kdev_t device, struct tty_struct **ret_tty) {
        struct tty_driver *driver = get_tty_driver(device);
        struct tty_struct *tty = alloc_tty_struct();

        initialize_tty_struct(tty);
        tty->device = device;
        tty->driver = *driver;
        (tty->ldisc.open)(tty);
        *ret_tty = tty;
}

Note that the entire struct tty_driver is copied in the assignment, so that individual fields can be changed without damaging the struct that was registered. However, this is never done, so having a copy is a waste of memory.

Line disciplines

The line discipline gives the protocol on the serial line. Each line discipline has a number, and the normal one is called N_TTY (0). Line disciplines are registered by tty_register_ldisc(), by storing a struct tty_ldisc in the array ldiscs[] (where the index is the line discipline number).

The normal discipline is registered by console_init(), as first among the registered disciplines:

void __init console_init(void) {
        memset(ldiscs, 0, sizeof(ldiscs));
        tty_register_ldisc(N_TTY, &tty_ldisc_N_TTY);
        ...
}

The call

        initialize_tty_struct(tty);

we saw in init_dev(), does among other things

        tty->ldisc = ldiscs[N_TTY];

Thus, when tty->ldisc.open is called, it is the open field of the struct tty_ldisc_N_TTY. This struct lives in n_tty.c and its open field is n_tty_open.

More opening

After this preparation, finally tty->driver.open(tty,file) is called. Now that we had /dev/tty1 in mind, that is, one of the virtual consoles, let us see what routine this is. In vt.c:vty_init() we see

        console_driver.open = con_open;
        ...
        tty_register_driver(&console_driver);

So, our open routine is con_open(), an amusing open routine. It creates a virtual console if there wasn't one. So, if you have 8 virtual consoles but open /dev/tty23 then you have 9.

If you have lots of unused consoles and want to free the memory they take, use the command deallocvt.

Exercise Which keystroke changes to console 23?

Reading

The system call sys_read() is found in fs/read_write.c. It calls vfs_read(), and this calls file->f_op->read(). In our case, this is the read routine of tty_fops, which unsurprisingly is tty_read. And above we saw that this calls tty->ldisc.read, which is the read field of tty_ldisc_N_TTY, called read_chan. The code is in n_tty.c. It downs the semaphore tty->atomic_read, hangs itself in the wait queue tty->read_wait of waiters for input, goes to sleep if no input is available, copies input to the user buffer, ups the semaphore tty->atomic_read and returns. (Reality is much more complicated. Try to read the code.)

So, hopefully, somebody will fill the input buffer. Who?

Keyboard interrupts arrive at input/keyboard/atkbd.c:atkbd_interrupt(). It handles the keyboard protocol and converts scancode to keycode. Then input_report_key() is called, a define for input_event(), and this routine offers the event to all registered handlers.

Now keyboard.c:kbd_init() registers kbd_handler, and the result is that keyboard keystrokes will be handled by keyboard.c:kbd_event(), which calls kbd_keycode(). Here keyboard raw, mediumraw, xlate and unicode modes are handled, as is the magic sysrequest key. Scancodes have already been converted to keycodes, here we convert back (yecch) for raw mode, leave things for mediumraw mode, or further convert keycodes to characters using the keymap (set by the utility loadkeys). Finally we call

        put_queue(vc, byte);

with the resulting bytes. Here vc is the foreground virtual console.

Now

void put_queue(struct vc_data *vc, int ch) {
        struct tty_struct *tty = vc->vc_tty;

        tty_insert_flip_char(tty, ch, 0);
        con_schedule_flip(tty);
}

that is, put_queue() retrieves vc->vc_tty that was set by con_open(), and puts its stuff in the flip buffer. Then the work of transporting this to the read_buffer is scheduled. (In tty raw mode that is a plain copy, but in canonical mode we must react to special characters: the erase character erases, the interrupt character sends an interrupt, etc.) And when the transporting has been done, the bytes are ready to be read by a read() call.

Writing

Here things are entirely analogous. The system call sys_write() calls vfs_write(), and this calls file->f_op->write(). In our case, this is the write routine of tty_fops, which is tty_write. It does do_tty_write(tty->ldisc.write, ...) which downs the semaphore tty->atomic_write, possibly splits up the write into smaller chunks, calls its first argument and ups the semaphore again.

The write routine here is the write field of tty_ldisc_N_TTY, called write_chan. The code is in n_tty.c. It hangs itself in the wait queue tty->write_wait of waiters for room for output, tries to write by calling tty->driver.write, and if that fails to write everything goes to sleep.

Now our driver was console_driver with write routine con_write that calls do_con_write. Here very obscure things are done to handle escape sequences (cursor movement, screen colours, scrolling, etc. etc.), but in the normal case we see

        scr_writew((attr << 8) + byte, screenpos);

that actually writes the character and the (foreground / background / intensity) attributes. All very messy code - not a joy to behold.

11.4 Raw devices

Raw devices are character devices that can be bound to block devices. I/O from/to raw devices bypasses the block caches. Whether that is desirable depends on the application. Usually it is undesirable - there are all kinds of issues with raw devices. A main problem is that of coherency - the block device should not also be accessed directly. An annoyance is that I/O buffers must be aligned. Very few standard programs do this. The code for the raw device does set_blocksize(), so that bad things happen if the device was open already and using a different blocksize. Really, if raw is used it must be the only access path to the block device.

private_data

The block device belonging to a raw device is noted down in the private_data field of the file struct.

ioctls

There are two ioctls: RAW_SETBIND and RAW_GETBIND. The former connects a given raw device to a block device specified by major, minor. The latter reports on a connection. The file descriptor needed for the ioctl is that of the control raw device, with minor number zero. Unbinding is done by binding to major,minor = 0,0.

Binding is done by setting the i_mapping field of the raw device inode to the i_mapping field of the block device. After rebinding this will crash certain kernels because the inode for the block device may have gone away.

11.5 The random device

For security purposes Linux has the devices /dev/random and /dev/urandom. The former produces cryptographically strong bits, but may block when no entropy is available. The latter uses bits from the former when available, and a strong random generator otherwise, and does not block.

Exercise Try dd if=/dev/urandom of=/dev/null bs=1024 count=1000 and immediately afterwards dd if=/dev/random of=/dev/null bs=1024 count=1. The former produces (more than) a megabyte of pseudorandom bits in less than a second. Probably this will have exhausted the entropy pool, and the latter will block until some randomness arrives. Move the mouse a little.

Randomness is needed in-kernel, e.g. for TCP sequence numbers - these must be hard to predict by an attacker to prevent spoofing -, and in user space for passwords or secret keys used to protect something - say the key for the .Xauthority file to protect access to the X server. The random character device is a standard part of the kernel, not something one selects with a config option.

The random device is a subdevice of the mem (for memory) device. The character device major 1 has subdevices mem, kmem, null, port, zero, full, random, urandom, kmsg (for minors 1,2,3,4,5,7,8,9,11 - long ago minor 6 was /dev/core, while minor 10 was reserved for /dev/aio but when aio was implemented it was done differently).

Thus, the registration is found in drivers/char/mem.c

Randomness is stored in the entropy_store, which has an associated variable entropy_count counting available random bits. The routine random_read() sees whether we have some bits, and if so returns them, and otherwise sleeps. The routine urandom_read() just extracts some bits.

So the question is how to obtain randomness. Something nobody can predict even when all running software is known. The random device uses four sources, namely the routines add_X_randomness, for X = keyboard, mouse, disk, interrupt. The keyboard, and the mouse, and each IRQ, and each disk have an associated structure

struct timer_rand_state {
        __u32           last_time;
        __s32           last_delta,last_delta2;
        int             dont_count_entropy:1;
};

that remembers when we last did something, and the first and second order differences in the sequence of points in time. The routines add_keyboard_randomness() etc. call add_timer_randomness(), and the current time and the value contributed by the routine (keyboard scancode, mouse data, etc.) are mixed into the pool. In order to estimate the amount of entropy added, only the time is used, not the scancode (etc.) data.

Next Previous Contents