Related sites: Algorithms Group

I/O-efficient algorithms (2IL35)

 Introduction    Organisation    Schedule    Literature    Prerequisites    Grading

Instructor: Herman Haverkort (office: HG 7.35)
Course dates: February to June 2012

(Registration for this course is now closed. Students who did not register are welcome to our similar course next year.)

 Update 15 June: added the slides of the last three lectures.
 Update 11 June: added the final homework assignment, deadline 8 July.
 Update 17 February: included most of the lecture schedule, most of the literature that needs to be studied, the answers to the prerequisites test (use username "io" and password "io" to access the file), and some "small print" at the bottom (policies regarding attendance, grading, deadlines etc.).

## Introduction

### Contents

Traditionally, the running time of an algorithm is analysed in terms of the number of computational steps it performs (comparisons, additions, etc.). However, when working with huge amounts of data that do not fit in main memory, the number of disk accesses is much more important for the running time than the number of computational steps. Even when the data fit in main memory at once, a significant part of the running time of an algorithm that works on large amounts of data is determined by the way in which the algorithm exploits the cache: data that is cached can be accessed much faster than data that resides in main memory. Therefore it is important to design algorithms for large data sets in such a way, that the data needed next is always likely to be data recently brought into cache.

In this course we study so-called I/O-efficient algorithms: algorithms and data structures that are designed in order to keep the number of disk accesses or cache misses as small as possible. We discuss models for the design and analysis of such algorithms, basic algorithms, data structures and paradigms, and applications where I/O-efficiency makes the difference between algorithms that work and algorithms that do not work in practice. Furthermore, we will see I/O-efficiency at work in some simple experiments with huge data sets.

### Goals

The course will make the students familiar with the basic concepts and techniques for the design of I/O-efficient algorithms and data structures. The students will learn to recognize I/O-bottlenecks in algorithms, learn basic techniques to alleviate such bottlenecks, and will get an idea of the orders of magnitude of the parameters involved. The students will practice finding, reading, interpreting and explaining advanced literature on the topic. Furthermore the course offers opportunities for students to practice their presentation skills.

## Organisation

### Educational form

The course is taught in a mixed form. Some lectures are given by the instructor; most lectures are given by the students, who will also run a number of experiments.

The students collaborate in triples to prepare their lectures. Each triple gets assigned a particular topic and has to give a lecture of 2 x 45 minutes. Links to basic literature for the lectures are provided by the instructor. The students are expected to look up additional literature when necessary.

The instructor will advise the students on the preparation of their talks. The students will meet with the instructor at least two weeks before the lecture for advice on its contents, and at least one week before the lecture to discuss (the draft version of) the presentation. Each student is expected to spend about 40 hours on preparing the lecture. For some general advice on how to study the material and on how to prepare your talk, see How to give a seminar talk.

The students will also collaborate in (and also across) triples to study possible solutions for problems where I/O-efficiency is crucial. The problems studied are abstracted versions of fundamental problems occurring in real-life applications. The students will do experiments with different solutions and write short papers about them. Near the end of the course, the students review each other's papers and present their results to each other.

There is a homework assignment at the end of course.

### Registration

Registration for this course is now closed.

## Schedule

All meetings for this course---except one on 5 April---are on Fridays, from 15:45 to 17:30 in the lecture hall of Gemini-Zuid.

datetimeroomtopicliteraturelecturer(s)PC time
Fri10Feb15:45Gem-ZIntroductionMeyer: sections 1 (except 1.5), 2 (intro), 2.1.1, 3.1-3.2; Demaine: section 2; slidesHerman HaverkortN/A
Sun12Feb23:59deadline for prerequisites test
Fri17Feb15:45Gem-ZBasics: stacks, queues, sorting, caching policiessee 1 Feb; in addition Duartes; Sleator, section 4; slides/notesHerman HaverkortN/A
Fri24Febno meeting (carnival vacation)N/A
Fri 2Mar15:45Gem-ZChallenges: graph traversals, permutations, list rankingMeyer: sections 1.5, 4.1, 4.2; Zeh: sections 2, 3; slides/notesHerman Haverkort27/2-4/3 Team ABM
Fri 9Mar15:45Gem-ZBreadth-first searchMeyer: sections 4.2, 4.3; and/or Zeh: section 6; slides directed BFS, slides undirected BFSTeam PQS5/3-11/3 Team ABM
Fri16Mar15:45Gem-ZOrdered lists and search trees; report by team ABM;Meyer: sections 2.1.2, 2.3 (intro), 2.3.1-2.3.4; Demaine: section 4.1 slides/notesHerman Haverkort12/3-18/3 Team ABM
Fri23Mar15:45Gem-ZPersistent B-trees and interval treesArge (2): sections 4, 6; slides persistent B-trees; slides internal interval trees; slides external interval treesTeam BNW19/3-25/3 Team MNW
Fri30Marno meeting (open door day)26/3-1/4 Team LMP
Thu 5Apr15:45Aud 1Priority queues, shortest paths and DFSMeyer: sections 4.4-7; and/or Zeh: sections 6.1, 7; slides DFS; slides SSSPTeam AT2/4-8/4 Team BNW
Fri13Aprno meeting (mid-semester break)9/4-15/4 Team LMP
Fri20Aprno meeting (mid-semester break)16/4-22/4 Team AT
Fri27Apr15:45Gem-ZPriority queues, time-forward processing and plane sweepingMeyer: sections 3.4, 3.5 or Zeh: section 4; Arge (1): sections 1-4; De Berg: section 2.1; slidesTeam LMP23/4-29/4 Team AT
Fri 4May15:45Gem-ZShortest paths on grids and planar graphsZeh: section 8; slidesTeam HLS30/4-6/5 Team GGT
Fri11Mayno meeting (lecturers ill)7/5-13/5 Team GGT
Fri18Mayno meeting (university closed on the day after Ancension Day)14/5-20/5 Team PQS
Fri25May15:45Gem-ZOperations on sparse matricesBender: sections 1-4, 7; slidesTeam GGT21/5-27/5 Team HLS
Fri 1Jun15:45Gem-ZData structures for points in the planeArge (2): section 8.2.1; Beckmann; Kamel; Van den Bercken; slidesTeam MNW28/5-3/6 Team MNW
Fri 8Jun15:45deadline almost-final (reviewable) reports on experiments with sorting & permutation, BFS, DFS, and time-forward processing4/6-10/6 Team BNW
15:45Gem-ZTopological sortingArge (3): sections 1-2; Ajwani; slidesTeam ABM
Fri15Jun15:45deadline almost-final (reviewable) reports on experiments with connected components, matrix operations, search trees, and text search structures11/6-17/6 Team PQS
15:45deadline reviews on experiments with sorting & permutation, BFS, DFS, and time-forward processing
15:45Gem-ZDiscussing experiments with sorting & permutation, BFS, DFS, and time-forward processing Teams ABM, LMP, AT, GGT
Fri22Jun15:45deadline reviews on experiments with connected components, matrix operations, search trees, and text search structures
15:45Gem-ZDiscussing experiments with connected components, matrix operations, search trees, and text search structures Teams HLS, BNW, MNW, PQS
Sun8Jul23:59deadline homework assignment

At the end of each meeting, some time may be reserved to discuss the progress or results of early experimentation projects.

Note that attending all meetings is mandatory. In case this is a problem, contact Herman Haverkort in advance.

The teams and their assignments are as follows:

teammemberslectureexperimentation projectcomputer time
ABMAlex, Bauke, MartinTopological sorting (8/6)Sorting & Permutationuntil 18/3
ATAde, TengDepth-first search etc. (5/4)Depth-first search16/4-29/4
BNWBram, Nicolas, WilcoPersistent B-trees etc. (23/3)Matrix operations2/4-8/4; 4/6-10/6
GGTGijs, Gijs, TwanSparse matrices (25/5)Time-forward processing30/4-13/5
HLSHoisun, Luc, StefPlanar graphs (4/5)Connected components21/5-27/5
LMPLourens, Marlous, PieterTime-forward processing etc. (27/4)Breadth-first search26/3-1/4; 9/4-15/4
MNWMarcel, Niels, WouterData structures for points (11/5)Search trees19/3-25/3; 28/5-3/6
PQSPaul, Quirijn, SanderBreadth-first search (9/3)Text search structures14/5-20/5; 11/6-17/6

## Literature

[Ajwani]
D. Ajwani, A. Cosgaya-Lozana, and N. Zeh,
Engineering a topological sorting algorithm for massive graphs,
In International Workshop on Algorithm Engineering and Experiments (ALENEX) 2011, p139-150
Available from publisher; local copy.

[Arge (1)]
L. Arge,
The Buffer Tree: a technique for designing batched external data structures,
Algorithmica (2003) 37:1-24
Available from publisher; local copy.

[Arge (2)]
L. Arge,
External memory geometric data structures,
Lecture notes, 2005
Available from author's website; local copy.

[Arge (3)]
L. Arge and L. Toma,
Simplified external memory algorithms for planar DAGs.
In Proc. 9th Scandinavian Workshop on Algorithm Theory (SWAT) 2004, LNCS 3111:493-503
Available from publisher; local copy.

[Beckmann]
N. Beckmann and B. Seeger,
A revised R*-tree in comparison with related index structures,
In Proc. 35th ACM SIGMOD Int. Conf. on Management of Data (SIGMOD) 2009, p799-812
Available from publisher; local copy.

[Bender]
M. A. Bender, G. Stølting Brodal, R. Fagerberg, R. Jacob, and E. Vicari,
Optimal sparse matrix dense vector multiplication in the I/O-Model,
Theory of Computing Systems (2010) 47(4):934-962
Available from publisher; local copy of authors' manuscript.

[Van den Bercken]
J. van den Bercken and B. Seeger,
An Evaluation of Generic Bulk Loading Techniques,
In Proc. 27th Int. Conf. on Very Large Data Bases (VLDB 2001), p461-470
Available from publisher; local copy.

[De Berg]
M. de Berg, M. van Kreveld, M. Overmars, and O. Cheong (a.k.a. Schwarzkopf),
Computational geometry: algorithms and applications,
2nd edition, Springer, 2000.

[Demaine]
E. Demaine,
Cache-Oblivious Algorithms and Data Structures,
Lecture notes of EEF Summer School on Massive Data Sets, Aarhus, 2002
Available from the author's website; local copy.

[Duartes]
G. Duartes,
What your computer does while you wait,
blog post, 30 Nov 2008;

[Kamel]
I. Kamel and C. Faloutsos,
Hilbert R-tree: An improved R-tree using fractals,
In Proc. of 20th Int. Conf. on Very Large Data Bases (VLDB) 1994, p500–509
Available from publisher; local copy.

[Meyer]
U. Meyer, P. Sanders, and J. Sibeyn (eds.),
Algorithms for Memory Hierarchies,
Lecture Notes in Computer Science 2625, Springer, 2003.
The whole book can be downloaded from the publisher's website. For now, the following parts are most relevant:

[Sleator]
D. D. Sleator and R. E. Tarjan,
Amortized efficiency of list update and paging rules,
Communications of the ACM (1985) 28(2):202-208
Available from publisher; local copy.

[Zeh]
N. Zeh,
I/O-Efficient Graph Algorithms,
Lecture notes of EEF Summer School on Massive Data Sets, Aarhus, 2002
Available from author's website; local copy.

## Required prior knowledge

Students coming from the TU/e computer science bachelor programme must have completed Algorithms (2IL15) successfully.

If you got your prior education elsewhere, we will need to check if you have the proper prior knowledge to be able to complete the I/O-efficient algorithms course successfully. We will expect that you have experience with the following topics: incremental algorithms, divide and conquer, greedy algorithms, proofs with induction and invariants, O-notation and recurrences, sorting, search trees, depth-first and breadth-first search, minimum spanning trees, and shortest path algorithms. Experience with randomized algorithms is recommended.

To be able to assess whether there are serious deficiencies in prior knowledge, students are asked to make a prerequisites test. Here are the right answers to the test, including recommendations for further reading in case you got them wrong. Use username "io" and password "io" to access the file.

If you have insufficient prior experience with algorithms, you will first need to take Algorithms (2IL15), Advanced Algorithms (2IL45), or Geometric Algorithms (2IL55); to be able to take these courses you might need Data Structures (2IL05) first. (All of these courses are taught in English.)

## Grading

Your grade will essentially be determined by adding up grades for:
• your presentation
• your experiments (and your report(s) about them)
• your review of another group's report
• your participation in class (which basically boils down to: the number of questions you asked in class)
• your homework assignment

All of the above assignments (presentation, experiments, review, your presence in class, homework) are mandatory. If you skip an assignment, your final grade will be a five or less.

Here is the small print for this course:

• Attending all meetings is mandatory. Absence without approval after prior notice: 1 point off your grade each time.
• The student is responsible for handing in assignments in time; the student is responsible for taking proper precaution against unexpected circumstances such as illness, broken laptops, network failures, other deadlines etc., by not doing things in the very last moment. That means that I will not accept your late assignment if you did not contact me at least three days before the deadline about the delay and with a reasonable suggestion for a solution. Depending on the consequences for the course as intended, handing in late may reduce your grade substantially.
• I will report any suspicions of fraud to the exam committee. If I can prove it (for example: if I find a fragment of more than two sentences of your work in an older source on the web while you did not indicate that it was a quote), your final grade will be a 1, independent of any additional measures taken by the department.