Related sites: Algorithms Group

I/O-efficient algorithms (2IL35)

 Introduction    Organisation    Schedule    Literature    Prerequisites    Grading

Instructor: Herman Haverkort (office: HG 7.35)
Course dates: February to June 2012

(Registration for this course is now closed. Students who did not register are welcome to our similar course next year.)

 Update 15 June: added the slides of the last three lectures.
 Update 17 February: included most of the lecture schedule, most of the literature that needs to be studied, the answers to the prerequisites test (use username "io" and password "io" to access the file), and some "small print" at the bottom (policies regarding attendance, grading, deadlines etc.).

## Introduction

### Contents

Traditionally, the running time of an algorithm is analysed in terms of the number of computational steps it performs (comparisons, additions, etc.). However, when working with huge amounts of data that do not fit in main memory, the number of disk accesses is much more important for the running time than the number of computational steps. Even when the data fit in main memory at once, a significant part of the running time of an algorithm that works on large amounts of data is determined by the way in which the algorithm exploits the cache: data that is cached can be accessed much faster than data that resides in main memory. Therefore it is important to design algorithms for large data sets in such a way, that the data needed next is always likely to be data recently brought into cache.

In this course we study so-called I/O-efficient algorithms: algorithms and data structures that are designed in order to keep the number of disk accesses or cache misses as small as possible. We discuss models for the design and analysis of such algorithms, basic algorithms, data structures and paradigms, and applications where I/O-efficiency makes the difference between algorithms that work and algorithms that do not work in practice. Furthermore, we will see I/O-efficiency at work in some simple experiments with huge data sets.

### Goals

The course will make the students familiar with the basic concepts and techniques for the design of I/O-efficient algorithms and data structures. The students will learn to recognize I/O-bottlenecks in algorithms, learn basic techniques to alleviate such bottlenecks, and will get an idea of the orders of magnitude of the parameters involved. The students will practice finding, reading, interpreting and explaining advanced literature on the topic. Furthermore the course offers opportunities for students to practice their presentation skills.

## Organisation

### Educational form

The course is taught in a mixed form. Some lectures are given by the instructor; most lectures are given by the students, who will also run a number of experiments.

The students collaborate in triples to prepare their lectures. Each triple gets assigned a particular topic and has to give a lecture of 2 x 45 minutes. Links to basic literature for the lectures are provided by the instructor. The students are expected to look up additional literature when necessary.

The instructor will advise the students on the preparation of their talks. The students will meet with the instructor at least two weeks before the lecture for advice on its contents, and at least one week before the lecture to discuss (the draft version of) the presentation. Each student is expected to spend about 40 hours on preparing the lecture. For some general advice on how to study the material and on how to prepare your talk, see How to give a seminar talk.

The students will also collaborate in (and also across) triples to study possible solutions for problems where I/O-efficiency is crucial. The problems studied are abstracted versions of fundamental problems occurring in real-life applications. The students will do experiments with different solutions and write short papers about them. Near the end of the course, the students review each other's papers and present their results to each other.

There is a homework assignment at the end of course.

### Registration

Registration for this course is now closed.

## Schedule

All meetings for this course---except one on 5 April---are on Fridays, from 15:45 to 17:30 in the lecture hall of Gemini-Zuid.

datetimeroomtopicliteraturelecturer(s)PC time
Fri10Feb15:45Gem-ZIntroductionMeyer: sections 1 (except 1.5), 2 (intro), 2.1.1, 3.1-3.2; Demaine: section 2; slidesHerman HaverkortN/A
Fri17Feb15:45Gem-ZBasics: stacks, queues, sorting, caching policiessee 1 Feb; in addition Duartes; Sleator, section 4; slides/notesHerman HaverkortN/A
Fri24Febno meeting (carnival vacation)N/A
Fri 2Mar15:45Gem-ZChallenges: graph traversals, permutations, list rankingMeyer: sections 1.5, 4.1, 4.2; Zeh: sections 2, 3; slides/notesHerman Haverkort27/2-4/3 Team ABM
Fri 9Mar15:45Gem-ZBreadth-first searchMeyer: sections 4.2, 4.3; and/or Zeh: section 6; slides directed BFS, slides undirected BFSTeam PQS5/3-11/3 Team ABM
Fri16Mar15:45Gem-ZOrdered lists and search trees; report by team ABM;Meyer: sections 2.1.2, 2.3 (intro), 2.3.1-2.3.4; Demaine: section 4.1 slides/notesHerman Haverkort12/3-18/3 Team ABM
Fri23Mar15:45Gem-ZPersistent B-trees and interval treesArge (2): sections 4, 6; slides persistent B-trees; slides internal interval trees; slides external interval treesTeam BNW19/3-25/3 Team MNW
Fri30Marno meeting (open door day)26/3-1/4 Team LMP
Thu 5Apr15:45Aud 1Priority queues, shortest paths and DFSMeyer: sections 4.4-7; and/or Zeh: sections 6.1, 7; slides DFS; slides SSSPTeam AT2/4-8/4 Team BNW
Fri13Aprno meeting (mid-semester break)9/4-15/4 Team LMP
Fri20Aprno meeting (mid-semester break)16/4-22/4 Team AT
Fri27Apr15:45Gem-ZPriority queues, time-forward processing and plane sweepingMeyer: sections 3.4, 3.5 or Zeh: section 4; Arge (1): sections 1-4; De Berg: section 2.1; slidesTeam LMP23/4-29/4 Team AT
Fri 4May15:45Gem-ZShortest paths on grids and planar graphsZeh: section 8; slidesTeam HLS30/4-6/5 Team GGT
Fri11Mayno meeting (lecturers ill)7/5-13/5 Team GGT
Fri18Mayno meeting (university closed on the day after Ancension Day)14/5-20/5 Team PQS
Fri25May15:45Gem-ZOperations on sparse matricesBender: sections 1-4, 7; slidesTeam GGT21/5-27/5 Team HLS
Fri 1Jun15:45Gem-ZData structures for points in the planeArge (2): section 8.2.1; Beckmann; Kamel; Van den Bercken; slidesTeam MNW28/5-3/6 Team MNW
Fri 8Jun15:45deadline almost-final (reviewable) reports on experiments with sorting & permutation, BFS, DFS, and time-forward processing4/6-10/6 Team BNW
15:45Gem-ZTopological sortingArge (3): sections 1-2; Ajwani; slidesTeam ABM
Fri15Jun15:45deadline almost-final (reviewable) reports on experiments with connected components, matrix operations, search trees, and text search structures11/6-17/6 Team PQS
15:45deadline reviews on experiments with sorting & permutation, BFS, DFS, and time-forward processing
15:45Gem-ZDiscussing experiments with sorting & permutation, BFS, DFS, and time-forward processing Teams ABM, LMP, AT, GGT
Fri22Jun15:45deadline reviews on experiments with connected components, matrix operations, search trees, and text search structures
15:45Gem-ZDiscussing experiments with connected components, matrix operations, search trees, and text search structures Teams HLS, BNW, MNW, PQS

At the end of each meeting, some time may be reserved to discuss the progress or results of early experimentation projects.

Note that attending all meetings is mandatory. In case this is a problem, contact Herman Haverkort in advance.

The teams and their assignments are as follows:

teammemberslectureexperimentation projectcomputer time
ABMAlex, Bauke, MartinTopological sorting (8/6)Sorting & Permutationuntil 18/3
ATAde, TengDepth-first search etc. (5/4)Depth-first search16/4-29/4
BNWBram, Nicolas, WilcoPersistent B-trees etc. (23/3)Matrix operations2/4-8/4; 4/6-10/6
GGTGijs, Gijs, TwanSparse matrices (25/5)Time-forward processing30/4-13/5
HLSHoisun, Luc, StefPlanar graphs (4/5)Connected components21/5-27/5
LMPLourens, Marlous, PieterTime-forward processing etc. (27/4)Breadth-first search26/3-1/4; 9/4-15/4
MNWMarcel, Niels, WouterData structures for points (11/5)Search trees19/3-25/3; 28/5-3/6
PQSPaul, Quirijn, SanderBreadth-first search (9/3)Text search structures14/5-20/5; 11/6-17/6

## Literature

[Ajwani]
D. Ajwani, A. Cosgaya-Lozana, and N. Zeh,
Engineering a topological sorting algorithm for massive graphs,
In International Workshop on Algorithm Engineering and Experiments (ALENEX) 2011, p139-150
Available from publisher.

[Arge (1)]
L. Arge,
The Buffer Tree: a technique for designing batched external data structures,
Algorithmica (2003) 37:1-24
Available from publisher.

[Arge (2)]
L. Arge,
External memory geometric data structures,
Lecture notes, 2005
Available from author's website.

[Arge (3)]
L. Arge and L. Toma,
Simplified external memory algorithms for planar DAGs.
In Proc. 9th Scandinavian Workshop on Algorithm Theory (SWAT) 2004, LNCS 3111:493-503
Available from publisher.

[Beckmann]
N. Beckmann and B. Seeger,
A revised R*-tree in comparison with related index structures,
In Proc. 35th ACM SIGMOD Int. Conf. on Management of Data (SIGMOD) 2009, p799-812
Available from publisher.

[Bender]
M. A. Bender, G. Stølting Brodal, R. Fagerberg, R. Jacob, and E. Vicari,
Optimal sparse matrix dense vector multiplication in the I/O-Model,
Theory of Computing Systems (2010) 47(4):934-962
Available from publisher.

[Van den Bercken]
J. van den Bercken and B. Seeger,
In Proc. 27th Int. Conf. on Very Large Data Bases (VLDB 2001), p461-470
Available from publisher.

[De Berg]
M. de Berg, M. van Kreveld, M. Overmars, and O. Cheong (a.k.a. Schwarzkopf),
Computational geometry: algorithms and applications,
2nd edition, Springer, 2000.

[Demaine]
E. Demaine,
Cache-Oblivious Algorithms and Data Structures,
Lecture notes of EEF Summer School on Massive Data Sets, Aarhus, 2002
Available from the author's website.

[Duartes]
G. Duartes,
What your computer does while you wait,
blog post, 30 Nov 2008;

[Kamel]
I. Kamel and C. Faloutsos,
Hilbert R-tree: An improved R-tree using fractals,
In Proc. of 20th Int. Conf. on Very Large Data Bases (VLDB) 1994, p500–509
Available from publisher.

[Meyer]
U. Meyer, P. Sanders, and J. Sibeyn (eds.),
Algorithms for Memory Hierarchies,
Lecture Notes in Computer Science 2625, Springer, 2003.
The whole book can be downloaded from the publisher's website. For now, the following parts are most relevant:

• Chapter 1: models and lower bounds (includes the permutation lower bound)
• Chapter 2: basic external memory data structures (includes stacks, queues, linked lists, B-trees)
• Chapter 3: a survey of techniques for designing I/O-efficient algorithms (includes scanning, sorting, time-forward processing, deterministic computation of maximal independent sets, list ranking, and Euler tours)
• Chapter 4: Elementary graph algorithms in external memory (includes BFS, DFS, and SSSP in directed and undirected graphs)
• Bibliography and index

[Sleator]
D. D. Sleator and R. E. Tarjan,
Amortized efficiency of list update and paging rules,
Communications of the ACM (1985) 28(2):202-208
Available from publisher.

[Zeh]
N. Zeh,
I/O-Efficient Graph Algorithms,
Lecture notes of EEF Summer School on Massive Data Sets, Aarhus, 2002
Available from author's website.

## Required prior knowledge

Students coming from the TU/e computer science bachelor programme must have completed Algorithms (2IL15) successfully.

If you got your prior education elsewhere, we will need to check if you have the proper prior knowledge to be able to complete the I/O-efficient algorithms course successfully. We will expect that you have experience with the following topics: incremental algorithms, divide and conquer, greedy algorithms, proofs with induction and invariants, O-notation and recurrences, sorting, search trees, depth-first and breadth-first search, minimum spanning trees, and shortest path algorithms. Experience with randomized algorithms is recommended.

To be able to assess whether there are serious deficiencies in prior knowledge, students are asked to make a prerequisites test. Here are the right answers to the test, including recommendations for further reading in case you got them wrong. Use username "io" and password "io" to access the file.

If you have insufficient prior experience with algorithms, you will first need to take Algorithms (2IL15), Advanced Algorithms (2IL45), or Geometric Algorithms (2IL55); to be able to take these courses you might need Data Structures (2IL05) first. (All of these courses are taught in English.)