The Linux kernel: The Linux Virtual File System

8. The Linux Virtual File System

The Linux kernel implements the concept of Virtual File System (VFS, originally Virtual Filesystem Switch), so that it is (to a large degree) possible to separate actual "low-level" filesystem code from the rest of the kernel. The API of a filesystem is described below.

This API was designed with things closely related to the ext2 filesystem in mind. For very different filesystems, like NFS, there are all kinds of problems.

Four main objects: superblock, dentries, inodes, files

The kernel keeps track of files using in-core inodes ("index nodes"), usually derived by the low-level filesystem from on-disk inodes.

A file may have several names, and there is a layer of dentries ("directory entries") that represent pathnames, speeding up the lookup operation.

Several processes may have the same file open for reading or writing, and file structures contain the required information such as the current file position.

Access to a filesystem starts by mounting it. This operation takes a filesystem type (like ext2, vfat, iso9660, nfs) and a device and produces the in-core superblock that contains the information required for operations on the filesystem; a third ingredient, the mount point, specifies what pathname refers to the root of the filesystem.

Auxiliary objects

We have filesystem types, used to connect the name of the filesystem to the routines for setting it up (at mount time) or tearing it down (at umount time).

A struct vfsmount represents a subtree in the big file hierarchy - basically a pair (device, mountpoint).

A struct nameidata represents the result of a lookup.

A struct address_space gives the mapping between the blocks in a file and blocks on disk. It is needed for I/O.

8.1 Terminology

Various objects play a role here. There are file systems, organized collections of files, usually on some disk partition. And there are filesystem types, abstract descriptions of the way data is organized in a filesystem of that type, like FAT16 or ext2. And there is code, perhaps a module, that implements the handling of file systems of a given type. Sometimes this code is called a low-level filesystem, low-level since it sits below the VFS just like low-level SCSI drivers sit below the higher SCSI layers.

8.2 Filesystem type registration

A module implementing a filesystem type must announce its presence so that it can be used. Its task is (i) to have a name, (ii) to know how it is mounted, (iii) to know how to lookup files, (iv) to know how to find (read, write) file contents.

This announcing is done using the call register_filesystem(), either at kernel initialization time or when the module is inserted. There is a single argument, a struct that contains the name of the filesystem type (so that the kernel knows when to invoke it) and a routine that can produce a superblock.

The struct is of type struct file_system_type. Here the 2.2.17 version:

struct file_system_type {
    const char *name;
    int fs_flags;
    struct super_block *(*read_super) (struct super_block *, void *, int);
    struct file_system_type *next;
};

The call register_filesystem() hangs this struct in the chain with head file_systems, and unregister_filesystem() removes it again.

Accesses to this chain are protected by the spinlock file_systems_lock. There are no other writers. The main reader is of course the mount() system call (via get_fs_type()). Other readers are get_filesystem_list() used for /proc/filesystems, and the sysfs system call.

The code is in fs/filesystems.c.

static struct file_system_type tue_fs_type = {
        .owner          = THIS_MODULE,
        .name           = "tue",
        .get_sb         = tue_get_sb,
        .kill_sb        = kill_block_super,
        .fs_flags       = FS_REQUIRES_DEV,
};

static int __init init_tue_fs(void) {
        return register_filesystem(&tue_fs_type);
}

static void __exit exit_tue_fs(void)
{
        unregister_filesystem(&tue_fs_type);
}

8.3 Struct file_system_type

struct file_system_type {
        const char *name;
        int fs_flags;
        struct super_block *(*get_sb)(struct file_system_type *,
                int, char *, void *, struct vfsmount *);
        void (*kill_sb) (struct super_block *);
        struct module *owner;
        struct file_system_type *next;
        struct list_head fs_supers;
        struct lock_class_key s_lock_key;
        struct lock_class_key s_umount_key;
};

(In 2.4 there was no kill_sb(), and the role of get_sb() was taken by read_super(). The final parameter of get_sb() and the lock_class_key fields are present since 2.6.18.)

Let us look at the fields of the struct file_system_type.

name

Here the filesystem type gives its name ("tue"), so that the kernel can find it when someone does mount -t tue /dev/foo /dir. (The name is the third parameter of the mount system call.) It must be non-NULL. The name string lives in module space. Access must be protected either by a reference to the module, or by the file_systems_lock.

get_sb

At mount time the kernel calls the fstype->get_sb() routine that initializes things and sets up a superblock. It must be non-NULL. Typically this is a 1-line routine that calls one of get_sb_bdev, get_sb_single, get_sb_nodev, get_sb_pseudo.

The routines get_sb_single and get_sb_nodev are almost identical. Both are for virtual filesystems. The former is used when there can be at most one instance of the filesystem. (Now an old instance is used if there is one, but its flags may be changed.)

kill_sb

At umount time the kernel calls the fstype->kill_sb() routine to clean up. It must be non-NULL. Typically one of kill_block_super, kill_anon_super, kill_litter_super.

The first is normal for filesystems backed by block devices. The second for virtual filesystems, where the information is generated on the fly. The third for in-memory filesystems without backing store - they need an additional dget() when a file is created (so that their dentries always have a nonzero reference count and are not garbage collected), and the d_genocide() that is the difference between kill_anon_super and kill_litter_super does the balancing dput().

fs_flags

The fs_flags field of a struct file_system_type is a bitmap, an OR of several possible flags with mostly obscure uses only. The flags are defined in fs.h. This field was introduced in 2.1.43. The number of flags, and their meanings, varies. In 2.6.19 there are the four flags FS_REQUIRES_DEV, FS_BINARY_MOUNTDATA, FS_REVAL_DOT, FS_RENAME_DOES_D_MOVE.

FS_REQUIRES_DEV
The FS_REQUIRES_DEV flag (since 2.1.43) says that this is not a virtual filesystem - an actual underlying block device is required. It is used in only two places: when /proc/filesystems is generated, its absence causes the filesystem type name to be prefixed by "nodev". And in fs/nfsd/export.c this flag is tested in the process of determining whether the filesystem can be exported via NFS. Earlier there were more uses.
FS_BINARY_MOUNTDATA
The FS_BINARY_MOUNTDATA flag (since 2.6.5) is set to tell the selinux code that the mount data is binary, and cannot be handled by the standard option parser. (This flag is set for afs, coda, nfs, smbfs.)
FS_REVAL_DOT
The FS_REVAL_DOT flag (since 2.6.0test4) is set to tell the VFS code (in namei.c) to revalidate the paths "/", ".", ".." since they might have gone stale. (This flag is set for NFS.)
FS_RENAME_DOES_D_MOVE
The FS_RENAME_DOES_D_MOVE flag (since 2.6.19) says that the low-level filesystem will handle d_move() during a rename(). Earlier (2.4.0test6-2.6.19) this was called FS_ODD_RENAME and was used for NFS only, but now this is also useful for ocfs2. See also the discussion of silly rename.
FS_NOMOUNT (gone)
The FS_NOMOUNT flag (2.3.99pre7-2.5.22) says that this filesystem must never be mounted from userland, but is used only kernel-internally. This was used, for example, for pipefs, the implementation of Unix pipes using a kernel-internal filesystem (see fs/pipe.c). Even though the flag has disappeared, the concept remains, and is now represented by the MS_NOUSER flag.
FS_LITTER (gone)
The FS_LITTER flag (2.4.0test3-2.5.7) says that after umount a d_genocide() is needed. This will remove one reference from all dentries in that tree, probably killing all of them, which is necessary in case at creation time the dentries already got reference count 1. (This is typically done for an in-core filesystem where dentries cannot be recreated when needed.) This flag disappeared in Linux 2.5.7 when the explicit kill_super method kill_litter_super was introduced.
FS_SINGLE (gone)
The FS_SINGLE flag (2.3.99pre7-2.5.4) says that there is only a single superblock for this filesystem type, so that only a single instance of this filesystem may exist, possibly mounted in several places.
FS_IBASKET and FS_NO_DCACHE and FS_NO_PRELIM (gone)
The FS_IBASKET was defined in 2.1.43 but never used, and the definition disappeared in 2.3.99pre4. The FS_NO_DCACHE and FS_NO_PRELIM flags were introduced in 2.1.43, but were a mistake and disappeared again in 2.1.44. However, the definitions survived until Linux 2.5.22. For the purposes of these flags, see the comment in 2.1.43:dcache.c.

owner

The owner field of a struct file_system_type points at the module that owns this struct. When doing things that might sleep, we must make sure that the module is not unloaded while we are using its data, and do this with try_inc_mod_count(owner). If this fails then the module was just unloaded. If it succeeds we have incremented a reference count so that the module will not go away before we are done.

This field is NULL for filesystems compiled into the kernel.

Example of the use of owner - sysfs

There exists a strange SYSV system call sysfs that will return (i) a sequence number given a filesystem type, and (ii) a filesystem type given a sequence number, and (iii) the total number of filesystem types registered now. This call is not supported by libc or glibc.

These sequence numbers are rather meaningless since they may change any moment. But this means that one can get a snapshot of the list of filesystem types without looking at /proc/filesystems. For example, the program


#include <stdio.h>
#include <linux/unistd.h>

/* define the 3-arg version of sysfs() */
static _syscall3(int,sysfs,int,option,unsigned int,fsindex,char *,buf);

/* define the 1-arg version of sysfs() */
static int sysfs1(int i) {
    return sysfs(i,0,NULL);
}

main(){
    int i, tot;
    char buf[100];      /* how long is a filesystem type name?? */

    tot = sysfs1(3);
    if (tot == -1) {
        perror("sysfs(3)");
        exit(1);
    }

    for (i=0; i<tot; i++) {
        if (sysfs(2, i, buf)) {
            perror("sysfs(2)");
            exit(1);
        }
        printf("%2d: %s\n", i, buf);
    }
    return 0;
}

might give output like


 0: ext2
 1: minix
 2: romfs
 3: msdos
 4: vfat
 5: proc
 6: nfs
 7: smbfs
 8: iso9660

The kernel code for copying the names to user space is instructive:

static int fs_name(unsigned int index, char * buf)
{
        struct file_system_type * tmp;
        int len, res;

        read_lock(&file_systems_lock);
        for (tmp = file_systems; tmp; tmp = tmp->next, index--)
                if (index <= 0 && try_inc_mod_count(tmp->owner))
                                break;
        read_unlock(&file_systems_lock);
        if (!tmp)
                return -EINVAL;

        /* OK, we got the reference, so we can safely block */
        len = strlen(tmp->name) + 1;
        res = copy_to_user(buf, tmp->name, len) ? -EFAULT : 0;
        put_filesystem(tmp);
        return res;
}

In order to walk safely along a linked list we need the read lock. The routines that change links (like register_filesystem) need a write lock. Once the filesystem name with the desired index is found we cannot just copy this name to user space. Maybe the page we want to copy to was swapped out, and getting it back in core takes some time, and maybe the module is unloaded just at that point, and then, when we want to read the name we reference memory that is no longer present. The routine try_inc_mod_count() first gets the module unload lock, then looks whether the module still is present; if so it increases the module's refcount and returns 1 (after releasing the unload lock), otherwise it returns 0. After a successful return of try_inc_mod_count() we own a reference to the module, so that it cannot disappear while we are doing copy_to_user(). The put_filesystem() decreases the module's refcount again.

So this is how the owner field is used: it tells which module must be pinned when we do something with this struct. A module stays as long as its refcount is positive, but can disappear any moment when the refcount becomes zero.

In fs/filesystems.c there is a global variable

static struct file_system_type *file_systems;

that is the head of the list of known filesystem types. A register_filesystem adds the filesystem to the linked list, an unregister_filesystem removes it again. The field next is the link in this simply linked list. It must be NULL when register_filesystem is called, and is reset to NULL by unregister_filesystem. The list is protected by the file_systems_lock.

fs_supers

The fs_supers field of a struct file_system_type is the head of a list of all superblocks of this type. In each superblock the corresponding link is called s_instances. This list is protected by the spinlock sb_lock. This list is used in sget() for filesystems like NFS where we get a filehandle and must check each superblock of the given type whether it is the right one.

s_lock_key, s_umount_key

These are fields used when CONFIG_LOCKDEP is defined, and take no space otherwise. Used for lock validation.

8.4 Mounting

The mount system call attaches a filesystem to the big file hierarchy at some indicated point. Ingredients needed: (i) a device that carries the filesystem (disk, partition, floppy, CDROM, SmartMedia card, ...), (ii) a directory where the filesystem on that device must be attached, (iii) a filesystem type.

In many cases it is possible to guess (iii) given the bits on the device, but heuristics fail in rare cases. Moreover, sometimes there is no difference on the device, as for example in the case where a FAT filesystem without long filenames must be mounted. Is it msdos? or vfat? That information is only in the user's head. If it must be used later in an environment that cannot handle long filenames it should be mounted as msdos; if files with long names are going to be copied to it, as vfat.

The kernel does not guess (except perhaps at boot time, when the root device has to be found), and requires the three ingredients. In fact the mount system call has five parameters: there are also mount flags (like "read-only") and options, like for ext2 the choice between errors=continue and errors=remount-ro and errors=panic.

The code for sys_mount() is found in fs/namespace.c and fs/super.c. The connection with the filesystem type name is made in do_kern_mount():

        struct file_system_type *type = get_fs_type(fstype);
        struct super_block *sb;

        if (!type)
                return ERR_PTR(-ENODEV);
        sb = type->get_sb(type, flags, name, data);

and this is the only call of the get_sb() routine.

The code for sys_umount() is found in fs/namespace.c and fs/super.c. The counterpart of the just quoted code is the cleanup in deactivate_super():

        fs->kill_sb(s);

and this is the only call of the kill_sb() routine.

8.5 The superblock

The superblock gives global information on a filesystem: the device on which it lives, its block size, its type, the dentry of the root of the filesystem, the methods it has, etc., etc.

struct super_block {
        dev_t s_dev;
        unsigned long s_blocksize;
        struct file_system_type *s_type;
        struct super_operations *s_op;
        struct dentry *s_root;
        ...
}

struct super_operations {
        struct inode *(*alloc_inode)(struct super_block *sb);
        void (*destroy_inode)(struct inode *);
        void (*read_inode) (struct inode *);
        void (*dirty_inode) (struct inode *);
        void (*write_inode) (struct inode *, int);
        void (*put_inode) (struct inode *);
        void (*drop_inode) (struct inode *);
        void (*delete_inode) (struct inode *);
        void (*put_super) (struct super_block *);
        void (*write_super) (struct super_block *);
        int (*sync_fs)(struct super_block *sb, int wait);
        void (*write_super_lockfs) (struct super_block *);
        void (*unlockfs) (struct super_block *);
        int (*statfs) (struct super_block *, struct statfs *);
        int (*remount_fs) (struct super_block *, int *, char *);
        void (*clear_inode) (struct inode *);
        void (*umount_begin) (struct super_block *);
        int (*show_options)(struct seq_file *, struct vfsmount *);
};

This is enough to get started: the dentry of the root directory tells us the inode of this root directory (and in particular its i_ino), and sb->s_op->read_inode(inode) will read this inode from disk. Now inode->i_op->lookup() allows us to find names in the root directory, etc.

Each superblock is on six lists, with links through the fields s_list, s_dirty, s_io, s_anon, s_files, s_instances, respectively.

The super_blocks list

All superblocks are collected in a list super_blocks with links in the fields s_list. This list is protected by the spinlock sb_lock. The main use is in super.c:get_super() or user_get_super() to find the superblock for a given block device. (Both routines are identical, except that one takes a bdev, the other a dev_t.) This list is also used various places where all superblocks must be sync'ed or all dirty inodes must be written out.

The fs_supers list

All superblocks of a given type are collected in a list headed by the fs_supers field of the struct filesystem_type, with links in the fields s_instances. Also this list is protected by the spinlock sb_lock. See above.

The file list

All open files belonging to a given superblock are chained in a list headed by the s_files field of the superblock, with links in the fields f_list of the files. These lists are protected by the spinlock files_lock. This list is used for example in fs_may_remount_ro() to check that there are no files currently open for writing. See also below.

The list of anonymous dentries

Normally, all dentries are connected to root. However, when NFS filehandles are used this need not be the case. Dentries that are roots of subtrees potentially unconnected to root are chained in a list headed by the s_anon field of the superblock, with links in the fields d_hash. These lists are protected by the spinlock dcache_lock. They are grown in dcache.c:d_alloc_anon() and shrunk in super.c:generic_shutdown_super(). See the discussion in Documentation/filesystems/Exporting.

The inode lists s_dirty, s_io

Lists of inodes to be written out. These lists are headed at the s_dirty (resp. s_io) field of the superblock, with links in the fields i_list. These lists are protected by the spinlock inode_lock. See fs/fs-writeback.c.

8.6 Inodes

An (in-core) inode contains the metadata of a file: its serial number, its protection (mode), its owner, its size, the dates of last access, creation and last modification, etc. It also points to the superblock of the filesystem the file is in, the methods for this file, and the dentries (names) for this file.

struct inode {
        unsigned long i_ino;
        umode_t i_mode;
        uid_t i_uid;
        gid_t i_gid;
        kdev_t i_rdev;
        loff_t i_size;
        struct timespec i_atime;
        struct timespec i_ctime;
        struct timespec i_mtime;
        struct super_block *i_sb;
        struct inode_operations *i_op;
        struct address_space *i_mapping;
        struct list_head i_dentry;
        ...
}

In early times, struct inode would end with a union

        union {
                struct minix_inode_info         minix_i;
                struct ext2_inode_info          ext2_i;
                struct ext3_inode_info          ext3_i;
                struct hpfs_inode_info          hpfs_i;
                ...
        } u;

to store the filesystemtype specific stuff. One could go from inode to e.g. struct ext3_inode_info via inode->u.ext3_i. This setup was rather dissatisfactory, since it meant that a core data structure had to know about all possible filesystem types (even possible out-of-tree ones) and reserve enough room for the largest one among the struct foofs_inode_info. It also wasted memory.

In Linux 2.5.3 this system was changed, and instead of a big struct inode having a filesystemtype dependent part, we now have big filesystemtype dependent inodes, with a VFS part. Thus, struct ext3_inode_info has as its last field struct inode vfs_inode;, and given the VFS inode inode one finds the ext3 information via EXT3_I(inode), defined as container_of(inode, struct ext3_inode_info, vfs_inode). See also the discussion of container_of.

The methods of an inode are given in the struct inode_operations.

struct inode_operations {
        int (*create) (struct inode *, struct dentry *, int);
        struct dentry * (*lookup) (struct inode *, struct dentry *);
        int (*link) (struct dentry *, struct inode *, struct dentry *);
        int (*unlink) (struct inode *, struct dentry *);
        int (*symlink) (struct inode *, struct dentry *, const char *);
        int (*mkdir) (struct inode *, struct dentry *, int);
        int (*rmdir) (struct inode *, struct dentry *);
        int (*mknod) (struct inode *, struct dentry *, int, dev_t);
        int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *);
        int (*readlink) (struct dentry *, char *,int);
        int (*follow_link) (struct dentry *, struct nameidata *);
        void (*truncate) (struct inode *);
        int (*permission) (struct inode *, int);
        int (*setattr) (struct dentry *, struct iattr *);
        int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
        int (*setxattr) (struct dentry *, const char *, const void *, size_t, int);
        ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
        ssize_t (*listxattr) (struct dentry *, char *, size_t);
        int (*removexattr) (struct dentry *, const char *);
};

Each inode is on four lists, with links through the fields i_hash, i_list, i_dentry, i_devices.

The dentry list

All dentries belonging to this inode (names for this file) are collected in a list headed by the inode field i_dentry with links in the dentry fields d_alias. This list is protected by the spinlock dcache_lock.

The hash list

All inodes live in a hash table, with hash collision chains through the field i_hash of the inode. These lists are protected by the spinlock inode_lock. The appropriate head is found by a hash function; it will be an element of the inode_hashtable[] array when the inode belongs to a superblock, or anon_hash_chain if not.

i_list

Inodes are collected into lists that use the i_list field as link field. The lists are protected by the spinlock inode_lock. An inode is either unused, and then on the chain with head inode_unused, or in use but not dirty, and then on the chain with head inode_in_use, or dirty, and then on one of the per-superblock lists with heads s_dirty or s_io, see above.

i_devices

Inodes belonging to a given block device are collected into a list headed by the bd_inodes field of the block device, with links in the inode i_devices fields. The list is protected by the bdev_lock spinlock. It is used to set the i_bdev field to NULL and to reset i_mapping when the block device goes away.

8.7 Dentries

The dentries encode the filesystem tree structure, the names of the files. Thus, the main parts of a dentry are the inode (if any) that belongs to it, the name (the final part of the pathname), and the parent (the name of the containing directory). There are also the superblocks, the methods, a list of subdirectories, etc.

struct dentry {
        struct inode *d_inode;
        struct dentry *d_parent;
        struct qstr d_name;
        struct super_block *d_sb;
        struct dentry_operations *d_op;
        struct list_head d_subdirs;
        ...
}

struct dentry_operations {
        int (*d_revalidate)(struct dentry *, int);
        int (*d_hash) (struct dentry *, struct qstr *);
        int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
        int (*d_delete)(struct dentry *);
        void (*d_release)(struct dentry *);
        void (*d_iput)(struct dentry *, struct inode *);
};

Here the strings are given by

struct qstr {
        const unsigned char *name;
        unsigned int len;
        unsigned int hash;
};

Each dentry is on five lists, with links through the fields d_hash, d_lru, d_child, d_subdirs, d_alias.

Naming flaw

Some of these names were badly chosen, and lead to confusion. We should do a global replace changing d_subdirs into d_children and d_child into d_sibling.

Value of a dentry

The pathname represented by a dentry, is the concatenation of the name of its parent d_parent, a slash character, and its own name d_name.

However, if the dentry is the root of a mounted filesystem (i.e., if dentry->d_covers != dentry), then its pathname is the pathname of the mount point d_covers. Finally, the pathname of the root of the filesystem (with dentry->d_parent == dentry) is "/", and this is also its d_name.

The d_mounts and d_covers fields of a dentry point back to the dentry itself, except that the d_covers field of the dentry for the root of a mounted filesystem points back to the dentry for the mount point, while the d_mounts field of the dentry for the mount point points at the dentry for the root of a mounted filesystem.

The d_parent field of a dentry points back to the dentry for the directory in which it lives. It points back to the dentry itself in case of the root of a filesystem.

A dentry is called negative if it does not have an associated inode, i.e., if it is a name only.

We see that although a dentry represents a pathname, there may be several dentries for the same pathname, namely when overmounting has taken place. Such dentries have different inodes.

Of course the converse, an inode with several dentries, can also occur.

The above description, with d_mounts and d_covers, was for 2.4. In 2.5 these fields have disappeared, and we only have the integer d_mounted that indicates how many filesystems have been mounted at that point. In case it is nonzero (this is what d_mountpoint() tests), a hash table lookup can find the actual mounted filesystem.

d_hash

Dentries are used to speed up the lookup operation. A hash table dentry_hashtable is used, with an index that is a hash of the name and the parent. The hash collision chain has links through the dentry fields d_hash. This chain is protected by the spinlock dcache_lock.

d_lru

All unused dentries are collected in a list dentry_unused with links in the dentry fields d_lru. This list is protected by the spinlock dcache_lock.

d_child, d_subdirs

All subdirectories of a given directory are collected in a list headed by the dentry field d_subdirs with links in the dentry fields d_child. These lists are protected by the spinlock dcache_lock.

d_alias

All dentries belonging to the same inode are collected in a list headed by the inode field i_dentry with links in the dentry fields d_alias. This list is protected by the spinlock dcache_lock.

8.8 Files

File structures represent open files, that is, an inode together with a current (reading/writing) offset. The offset can be set by the lseek() system call. Note that instead of a pointer to the inode we have a pointer to the dentry - that means that the name used to open a file is known. In particular system calls like getcwd() are possible.

struct file {
        struct dentry *f_dentry;
        struct vfsmount *f_vfsmnt;
        struct file_operations *f_op;
        mode_t f_mode;
        loff_t f_pos;
        struct fown_struct f_owner;
        unsigned int f_uid, f_gid;
        unsigned long f_version;
        ...
}

Here the f_owner field gives the owner to use for async I/O signals.

struct file_operations {
        struct module *owner;
        loff_t (*llseek) (struct file *, loff_t, int);
        ssize_t (*read) (struct file *, char *, size_t, loff_t *);
        ssize_t (*aio_read) (struct kiocb *, char *, size_t, loff_t);
        ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
        ssize_t (*aio_write) (struct kiocb *, const char *, size_t, loff_t);
        int (*readdir) (struct file *, void *, filldir_t);
        unsigned int (*poll) (struct file *, struct poll_table_struct *);
        int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
        int (*mmap) (struct file *, struct vm_area_struct *);
        int (*open) (struct inode *, struct file *);
        int (*flush) (struct file *);
        int (*release) (struct inode *, struct file *);
        int (*fsync) (struct file *, struct dentry *, int datasync);
        int (*aio_fsync) (struct kiocb *, int datasync);
        int (*fasync) (int, struct file *, int);
        int (*lock) (struct file *, int, struct file_lock *);
        ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
        ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
        ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
        ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
        unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
};

Each file is in two lists, with links through the fields f_list, f_ep_links.

f_list

The list with links through f_list was discussed above. It is the list of all files belonging to a given superblock. There is a second use: the tty driver collects all files that are opened instances of a tty in a list headed by tty->tty_files with links through the file field f_list. Conversely, these files point back at the tty via their field private_data.

(This field private_data is also used elsewhere. For example, the proc code uses it to attach a struct seq_file to a file.)

The event poll list

All event poll items belonging to a given file are collected in a list with head f_ep_links, protected by the file field f_ep_lock. (For event poll stuff, see epoll_ctl(2).)

8.9 struct vfsmount

A struct vfsmount describes a mount. The definition lives in mount.h:

struct vfsmount {
        struct list_head mnt_hash;
        struct vfsmount *mnt_parent;    /* fs we are mounted on */
        struct dentry *mnt_mountpoint;  /* dentry of mountpoint */
        struct dentry *mnt_root;        /* root of the mounted tree */
        struct super_block *mnt_sb;     /* pointer to superblock */
        struct list_head mnt_mounts;    /* list of children, anchored here */
        struct list_head mnt_child;     /* and going through their mnt_child */
        atomic_t mnt_count;
        int mnt_flags;
        char *mnt_devname;              /* Name of device e.g. /dev/dsk/hda1 */
        struct list_head mnt_list;
};

Long ago (1.3.46) it was introduced as part of the quota code. There was a linked list of struct vfsmounts that contained a device number, device name, mount point name, mount flags, superblock pointer, semaphore, file pointers to quota files and time limits for how long an over-quota situation would be allowed. Nowadays quota have independent bookkeeping, and a struct vfsmount only describes a mount.

These structs are allocated by alloc_vfsmnt() and released by free_vfsmnt() in namespace.c.

mnt_hash

Vfsmounts live in a hash headed by mount_hashtable[]. The field mnt_hash is the link in the collision chain. This list does not seem to be protected by a lock. They are put into the hash by attach_mnt(), found there by lookup_mnt(), and removed again by detach_mnt(), all from namespace.c.

mnt_parent

Vfsmount for parent.

mnt_mountpoint

Dentry for the mountpoint. The pair (mnt_mountpoint, mnt_parent) (returned by follow_up()) will be the dentry and vfsmount for the parent. Used e.g. in d_path to return the pathname of a dentry.

mnt_root

Dentry for the root of the mounted tree.

mnt_sb

Superblock of the mounted filesystem.

mnt_mounts, mnt_child

The field mnt_mounts of a struct vfsmount is the head of a cyclic list of all submounts (mounts on top of some path relative to the present mount). The remaining links of this cyclic list are stored in the mnt_child fields of its submounting vfsmounts. (And each of these points back at us with its mnt_parent field.) Used in autofs4/expire.c and namespace.c (and nowhere else).

mnt_count

Keep track of users of this structure. Incremented by mntget, decremented by mntput. Initially 1. It will be 2 for a mount that may be unmounted. (Autofs also uses this to test whether a tree is busy.)

mnt_flags

The mount flags, like MNT_NODEV, MNT_NOEXEC, MNT_NOSUID. Earlier also MS_RDONLY (that now is stored in sb->s_flags) and MNT_VISIBLE (came in 2.4.0-test5, went in 2.4.5) that told whether this entry should be visible in /proc/mounts.

mnt_devname

Name used in /proc/mounts.

mnt_list

There was a global cyclic list vfsmntlist containing all mounts, used only to create the contents of /proc/mounts. These days we have per-process namespaces, and the global vfsmntlist has been replaced by current->namespace->list. This list is ordered by the order in which the mounts were done, so that one can do the umounts in reverse order. The field mnt_list contains the pointers for this cyclic list.

8.10 fs_struct

A struct fs_struct determines the interpretation of pathnames referred to by a process (and also, somewhat illogically, contains the umask). The typical reference is current->fs. The definition lives in fs_struct.h:


struct fs_struct {
        atomic_t count;
        rwlock_t lock;
        int umask;
        struct dentry * root, * pwd, * altroot;
        struct vfsmount * rootmnt, * pwdmnt, * altrootmnt;
};

Semantics of root and pwd are clear. Remains to discuss altroot.

altroot

In order to support emulation of different operating systems like BSD and SunOS and Solaris, a small wart has been added to the walk_init_root code that finds the root directory for a name lookup.

The altroot field of an fs_struct is usually NULL. It is a function of the personality and the current root, and the sys_personality and sys_chroot system calls call set_fs_altroot().

The effect is determined at kernel compile time. One can define __emul_prefix() in <asm/namei.h> as some pathname, say "usr/gnemul/myOS/". The default is NULL, but some architectures have a definition depending on current->personality. If this prefix is non-NULL, and the corresponding file is found, then set_fs_altroot() will set the altroot and altrootmnt fields of current->fs to dentry and vfsmnt of that file.

A subsequent lookup of a pathname starting with '/' will now first try to use the altroot. If that fails the usual root is used.

8.11 nameidata

A struct nameidata represents the result of a lookup. The definition lives in fs.h:


struct nameidata {
        struct dentry *dentry;
        struct vfsmount *mnt;
        struct qstr last;
        unsigned int flags;
        int last_type;
};

The typical use is:


        struct nameidata nd;
        error = user_path_walk(filename, &nd);
        if (!error)
                path_release(&nd);

where path_release() does


        dput(nd->dentry);
        mntput(nd->mnt);

The core of the routines user_path_walk_link and user_path_walk (which call __user_walk without or with the LOOKUP_FOLLOW flag) is the fragment


        if (path_init(name, flags, nd))
                error = path_walk(name, nd);

So the basic routines handling nameidata are path_init and path_walk. The former finds the start of the walk, the latter does the walking. (However, the former returns 0 in case it did the walking itself already.)

path_init

The routine path_init initialises the four fields dentry, mnt, flags, last_type. The flags field was given as an argument, and dentry and mnt are initialised to those of the current directory or those of the root directory depending on whether name starts with a '/' or not. It will always return 1 except in a certain obscure case discussed below, where the return 0 means that the complete lookup was done already. (And this case cannot occur for sys_chroot, that is why the code there needs not check the return value.)

A wart

(path_init will always return 1, except when name starts with a '/', in which case it returns whatever walk_init_root returns. walk_init_root will always return 1, except when current->fs->altroot is non-NULL and nd->flags does not contain LOOKUP_NOALT (for sys_chroot it does) and __emul_lookup_dentry succeeds, which it does when pathwalk succeeds - in this case no path_walk is required anymore)

Next Previous Contents