RobIn for IOI I/O
===== === === ===

by Tom Verhoeff (Chair IOI Scientific Committee)

July 2002, Version 0.7 (incomplete draft)
Documents RobIn Version 0.7

Text bracketed by <TBC: ...> is still To Be Completed.


Contents
--------
  Introduction
  RobIn Feature Summary
  Historical Intermezzo
  I/O in (Turbo/Free) Pascal and (Turbo) C/C++
  IOI-Style Meta-Format Restrictions
  User Requirements for RobIn
  How to Use RobIn
  How to Interpret RobIn Messages
  Design and Implementation of RobIn
  Limitations of RobIn
  Other Tools
  Concluding Remarks
  Acronyms
  Sample Programs


Introduction
------------
RobIn is a software package featuring a FreePascal Unit for ROBUST
INPUT from text files with an IOI-style format (IOI = International
Olympiad in Informatics).  RobIn was originally developed for
Turbo Pascal.

In this document, I provide some motivation for RobIn and I explain
its use and design.  Just in case you want to see it immediately,
the main implementation source file is named "RobIn.pas", its usage
is explained in the section "How to Use RobIn".  "Example.pas"
and "ValidEx.pas" are programs that illustrate how to use RobIn.

RobIn simplifies the development (in FreePascal) of

  *  /input validators/, viz. programs that check the validity of the
     input files produced by the organizers, and

  *  /output checkers/, viz. programs that check the format and/or
     correctness of output files produced by the competitors or
     their programs.

The description of an IOI programming task defines the FORMAT of every
input and output file involved.  The IOI organizers are responsible for
producing valid input files.  The competitors may assume that the
input files are valid, that is, their programs may crash on invalid
input without being penalized for it.  Likewise, the competitors are
responsible for delivering programs that produce correct output files.
When grading the programs, format restrictions are enforced
in a more relaxed way for competitor-generated output files than for
organizer-generated input files.

In the IOI competition, the file formats are kept simple because

  *  the IOI focuses on algorithms and not on I/O, which serves
     merely as an interface to the algorithm;

  *  the format must be easy to read and write by (1) ALL of the
     IOI programming languages, and (2) by the competitors
     (the latter mainly through editors, e.g. built into the IDE for
     each programming language).

The competitor programs must read the input files and must write the
output files.  We do not want competitors to waste their time on programming
input parsers, when the task is really about some complex algorithm.
The competitors most likely must also somehow put together various input
files for testing and they must also inspect the resulting output files.
This is often done "by hand" through an editor and dumping to the screen,
or even printing on paper.

The organizers of an IOI have the responsibility to define file formats
and to verify them.  Developing an input validator and an output checker
forces one to think carefully about the format involved.  This turns
out to be a great help in formulating the format definitions.

Development of programs for input validation and output checking
must not be underestimated.  Such programs must be prepared to handle
arbitrary data in these files and they should not crash but instead
report back any problems encountered in a clear fashion.

In this document, we often use the term INPUT FILE for files that are
to be read by competitor programs and that are created by (tools of)
the organizers.  We use the term OUTPUT FILE for files that are to be
created by competitors or their programs, and which are read by (tools
of) the organizers.  Both kind of files are input files for RobIn.


RobIn Feature Summary
----- ------- -------
RobIn reads and converts text file content, while checking

  *  file I/O for errors
  *  library calls for correct usage (parameters, call order)
  *  format definition for adherence to meta-format rules
  *  file content for correct format (incl. range check)
  *  task-specific conditions through AssertRIF and IsAbortedRIF

RobIn reports diagnostic messages in a log file:

  *  Diagnostic messages are numbered and taken from
     message definition files to simplify translation
  *  Level of diagnostic detail can be tuned
  *  Statistics are collected (per file processed and overall)
  *  Precise feedback is provided in terms of
       o  coordinates within the format being processed
       o  coordinates within the file being read


Historical Intermezzo
---------- ----------
Traditionally, the algorithms developed by the IOI competitors are
delivered as executable programs (since IOI2001 as source code) and they
are graded by executing them for a (necessarily limited) set of test cases.
They are not graded by inspecting the algorithm designs or program texts.
In the scientific and engineering world, it has long been known that
execution-based testing is a very poor way of measuring the performance
characterictics of algorithms (characteristics such as correctness,
speed, and memory usage).

Still, there are some (compelling?) reasons for execution-based
testing at the IOI:

  *  the language barrier,

  *  lack of agreement on alternatives, and

  *  the engineering principle that executable products need to be
     demonstrated by execution.

At the IOI, the competitors are high-school students and they are not
expected to be capable of adequate technical communication with
foreign organizers and evaluators, neither written nor oral.  This
is known as the language barrier.  Therefore, the tasks are translated
to the competitor's native language before the competition.  At the
IOI, a translation step in the opposite direction, from competitor
to organizer, is avoided by requiring the delivery of machine-executable
products.  (By the way, I still think the IOI should look for ways of
expanding the evaluation process to include non-execution-based
methods of verification.)

Another reason for execution-based testing at the IOI is that it
apprears hard to agree upon alternatives.  What constitutes a
design of an algorithm?  How should such a design be presented and
judged?  We have no international notation for algorithms, like they
do in mathematics.  (Though we should be able to get a long way;
after all, the delegation leaders have little trouble discussing
the tasks, and possible approaches and obstacles to solutions.)

Finally, a correctness proof alone is not good enough for an executable
product.  Obviously, we need to see the thing "fly" to believe it.
(In my opinion, the IOI hovers dangerously between theoretical computer
science and software engineering.)

For execution, the algorithms somehow need to be provided with various
input parameters, and the resulting output needs to be observed.  Since
IOI'94 in Sweden, such I/O is done through text files.  All input
parameters are offered in a text file, even if this concerns just a
single integer.  All output must be written to a text file.  The text
file interface provides a convenient abstraction from the programming
language used to express the algorithm.  Note that the IOI has always
allowed a choice among multiple programming languages (for lack of
agreement on a single language).

Alternatives to (the direct use of) text files are

   *  task-specific libraries that competitor programs must use,

   *  tasks that ask for parameterized routines instead of programs.

The organizers of an IOI can develop special-purpose libraries
to encapsulate I/O.  Such a library provides a means of I/O on
the language level: variables or parameters with native types
can be directly used; there is no need for formatting and conversions.
This use of libraries is now common practice for reactive programming
tasks.  There is still the issue of restrictions on the use of
the library (its interface definition), in particular, also
restrictions on the calling order of library routines.

Another alternative is to require that the competitors develop a
routine with input and output parameters in the programming language
of their choice.  For evaluation, the routine is embedded into an
evalution framework to produce an executable program.  You might
say that here the competitors develop a library consisting of a single
routine.  This may work better in the case of source code submission
than submission of object files.

The major disadvantage of both alternative approaches is that the
details of the interface now depend on the programming language.
Even with only two olympic programming languages, this is quite
a burden on the organization.  (But this does not stop us from
using it for reactive tasks.)

It seems to me that the current situation at the IOI reflects the
current state of computer science in general quite well.  We have
come a long way, but we are not "there" yet.


I/O in (Turbo/Free) Pascal and (Turbo) C/C++
--- --  ----------  ------ ---  -----  -----
Until IOI2000, the IOI programming languages were Turbo Pascal 7.0 and
Turbo C/C++ 3.0.  Since IOI2001, we have moved to other Pascal and C/C++
compilers, namely

  *  FreePascal (www.freepascal.org)

  *  GCC (Gnu Compiler Collecion: www.gnu.org/software/gcc/gcc.html).

As far as I can tell, this change in compilers has had little effect on
the I/O issues.

Another change since IOI2001 is the introduction of the Linux platform
for grading execution of submitted programs.  Competitors may develop
programs under either Linux or Windows.  One effect of this move is
that example input and output files must be prepared for both platforms,
because of differing end-of-line conventions.  However, the compilers
provide a proper abstraction of these I/O details.  Hence, the
platform has had little influence on the I/O aspects of programming.
Competitors will only notice the difference if they do I/O in a way that
bypasses the abstraction mechanism.

Let me consider some of the subtleties of text I/O in these languages,
and in particular text input.  First, some general remarks about text
files on the Unix/Linux and MS-DOS/Windows platform are necessary.

ASCII Character Set

The character set used for Unix/Linux and MS-DOS/Windows text files is
Extended ASCII (256 8-bit codes).  Of these, the "Standard" ASCII codes
(128 7-bit codes from 0 through 127, also called Low ASCII) include

   *  33 control codes (0 through 31, and 127),
   *  1 blank (SP, space, code 32),
   *  94 printing characters (Latin uppercase and lowercase letters,
      digits, and various symbols, with codes 33 through 126).

Of the control codes, the following are worth mentioning:

   BS   Backspace,            Ctrl-H, ^H, code   8, '\b'
   HT   Horizontal Tab, Tab   Ctrl-I, ^I, code   9, '\t'
   LF   Line Feed, New Line   Ctrl-J, ^J, code  10, '\n'
   VT   Vertical Tab          Ctrl-K, ^K, code  11, '\v'
   FF   Form Feed, New Page   Ctrl-L, ^L, code  12, '\f'
   CR   Carriage Return,      Ctrl-M, ^M, code  13, '\r'
   SUB  Substitute character, Ctrl-Z, ^Z, code  26
   ESC  Escape,               Ctrl-[, ^[, code  27
   DEL  Delete, Rub out,                  code 127

The extended codes (128 through 255, also called High ASCII) cover
adorned letters (with various accents for non-English languages), some
graphics and special symbols.

On the MS-DOS/Windows platform, the two-character combination CR/LF
(codes 13, 10) encodes the transition to a new line (end-of-line in Pascal
parlance, newline in C terminology).  Note that under Unix/Linux, this is
accomplished by a single LF, whereas under the MacOS, a single CR is used.
(On the platform-less web you can often see the disastrous consequences
of these differences.)

Under MS-DOS/Windows, the end of a text file is encoded by SUB (Ctrl-Z,
code 26).  Under Unix/Linux, the end of a text file is not encoded by
a special character, but is determined from its length.

The other control codes, with the possible exception of the Tab
(HT, Ctrl-I, code 9), nowadays have no generally accepted role in
text files.  The Tab is still used, but this is problematic, because
there is no standardized way to define tab settings in a text file.
A common default is to assume tab settings every 8th character position,
but many programmers do not configure their editor that way.

The characters LF (or CR/LF under MS-DOS/Windows), HT (tab), and SP
(blank) are collectively known as /whitespace/.  (Note: In C, also
vertical tab and form feed are treated as whitespace, but we do not.)

Text file I/O in FreePascal

Here are some FreePascal text I/O details (inherited from Turbo Pascal,
hence labeled TP below).  For convenience, we assume the following
variable declarations:

  var f: Text; c, d: Char; i: Integer; s: String;


  TP01. FreePascal does not implement the file window (in ISO Pascal
        denoted by f^) to look one character ahead into the input stream.
        It also lacks the get and put routines.  All I/O must be done
        through the (standard) eof/eoln, read/readln, and write/writeln
        routines.

  TP02. Reading a single character c by read(f,c) does NOT skip leading
        whitespace.  Leading blanks can be skipped, for example, by
        "repeat read(f,c) until c<>' '".  A single leading blank can be
        skipped by "read(f,d,c)" and ignoring d.  If the file window
        would have been implemented (cf. TP01), then leading blanks
        could have been skipped by "while f^=' ' do get(f)".

  TP03. Reading a single character when eoln(f) is true, returns chr(10)
        (=LF) under Unix/Linux, and chr(13) (=CR) under MS-DOS/Windows,
        instead of ' ' as required by the ISO Pascal Standard.
        Furthermore, under MS-DOS/Windows, the next character read after
        CR will be chr(10) (=LF).  Note that a single call readln(f)
        consumes both CR and LF.

  TP04. Reading a single character when eof(f) is true, does NOT raise
        a run-time I/O error (as suggested by the ISO Pascal Standard).
        Under MS-DOS/Windows, it returns chr(26) (^Z).  Under Unix/Linux,
        it is not so clear what happens (I have seen chr(26)
        returned, but also other behavior).

  TP05. Reading an integer by read(f,i) will skip all leading whitespace
        (blank, tab, end-of-line).  The read then consumes ALL subsequent
        non-whitespace characters until another whitespace character is
        encountered.

        A run-time I/O error (106: invalid numeric format) is raised
        when the non-whitespace characters violate the (Pascal) numeric
        format.  Trailing whitespace (incl. end-of-line) is never
        consumed, unless calling readln(f,i), which is equivalent to
        "read(f,i); readln(f)".

  TP06. Reading a variable-length string by read(f,s) does NOT skip
        leading whitespace.  Leading blanks are harder to skip than for
        characters (TP02), because the file window is lacking (TP01).
        If the file window would have been implemented, then leading
        blanks could have been skipped by "while f^=' ' do get(f)".

        It is, of course, possible (but awkward) to read the string with
        leading whitespace and to trim it afterwards:
        "while (s<>'') and (s[1]=' ') do Delete(s,1,1)".
        Note, however, that the string size must be large enough to
        accommodate all such whitespace.

  TP07. The reading of characters for a variable-length string by
        read(f,s) is stopped by the maximum length of the string variable
        s or the first end-of-line encountered.  Note that trailing
        whitespace on a line will be read into the string.

        The maximum length for a variable-length string is 255
        characters.  A trailing end-of-line is never consumed,
        unless calling readln(f,s).

  TP08. Readln(f) consumes upto and including the next end-of-line.
        Under Unix/Linux, this is the next LF.  Under MS-DOS/Windows,
        this is the next CR, and subsequent LF if present.

  TP09. If the file's last line is not terminated by an end-of-line,
        then eof(f) and eoln(f) become true simultaneously, and NOT,
        as the ISO Pascal Standard requires, first eoln(f) and after
        readln(f) then eof(f).

        Under the Pascal Standard, a text file consists of a sequence
        of zero or more /lines/, terminated by /end-of-file/ (recognized
        by function eof), and each line consists of a sequence of zero
        or more /characters/, terminated by /end-of-line/ (recognized by
        function eoln).  Thus, the only case where an end-of-file is
        not preceded by an end-of-line occurs in an empty file.

        Also, according to the Pascal Standard, eoln should not be
        called when eof returns true.  The subtle difference between
        FreePascal and the Pascal Standard is illustrated by the
        two character-and-line counting programs in the appendix
        "Sample Programs".

  TP10. Under Unix/Linux, writeln(f) produces LF.  Under MS-DOS/Windows,
        writeln(f) produces CR followed by LF.

  TP11. Under MS-DOS/Windows, closing an output file appends the ending
        chr(26) (=Ctrl-Z).

Text files and editors

Here are some details concerning the Turbo Pascal Editor (built into
the Turbo Pascal IDE; the built-in Turbo C/C++ editor is very similar:
<TBC: check if this also applies to the FreePascal IDE and RHIDE>

  IDE1. It is NOT possible to add blanks at the end of a line.
        (Hit the END key to find the "real" end of a line.)

  IDE2. When opening a file, any blanks at the end of lines are
        silently discarded (and lost when the file is saved).

  IDE3. The last line (accessible by the cursor) is NOT written
        with a terminating end-of-line when saving the file.

  IDE4. When opening a file with its last line (properly) terminated
        with an end-of-line, it will show up as a trailing empty line
        (accessible by the cursor).

  IDE5. Fixed tab stops are set every 8 positions in columns 1, 9, 17,
        etc.  Hitting the TAB key will insert an "appropriate" number
        of blanks.  To enter a real Tab (TAB character) in the file,
        use the sequence ^P^I (Ctrl-P, Ctrl-I).  Tabs in a file are
        visually not directly discernable from blanks.

  IDE6. Control characters (other than HT, CR/LF) are shown as "funny"
        graphics symbols.

Text file I/O in GCC

Now let me deal with some GCC C/C++ peculiarities (formerly via Turbo
C/C++, hence labeled TC below).  First, we deal with I/O in C.  This is
agreement with the ISO/ANSI C Standard.  For convenience, we assume the
following variable declarations:

  #include <stdio.h>
  FILE *f; char c; int i; char s[256];


  TC01. Formatted input can easily be done through fscanf with a format
        string and a sufficient number of result arguments (or via fgets
        followed by sscanf).  We consider fscanf in the following.

  TC02. Reading a single character by fscanf(f,"%c",&c) does NOT skip
        leading whitespace by itself.  All leading whitespace can be
        skipped by including a blank directly preceding the format
        specifier: fscanf(f," %c",&c).

  TC03. Reading a single character (format "%c") when end-of-line is
        ahead, will return LF (code 10, '\n'), and it skips both CR
        (present under MS-DOS/Windows) and LF.  Note that putting "\n"
        in the format string will skip ALL whitespace at that point.

  TC04. Function fscanf will NOT read the end-of-file character, instead
        it will return a count less than the number of arguments to read,
        indicating premature termination.

  TC05. Reading an integer by fscanf(f,"%d",&i) will skip all leading
        whitespace; the read consumes only characters that agree with
        the decimal integer syntax and stops at the first character that
        does not match.

  TC06. Reading a variable-length string by fscanf(f,"%s",s) will skip
        all leading whitespace.  This is not the case when using gets(s)
        or fgets(f,s,i).  Note that s should be big enough to accommodate
        all characters read AND the terminating null '\0'.  Therefore
        it is recommended to use fscanf(f,"%255s",s) when s can
        accommodate 256 characters (255 non-nulls).

  TC07. Reading a variable-length string is stopped after reading all
        (at least one) non-whitespace characters, by the first
        whitespace character (which is not consumed).  In particular, it
        will never skip trailing end-of-lines, and the returned string
        will never include blanks and will never be empty.

        Again, this is not the case for gets and fgets.  (A subtle
        difference between gets and fgets is that the latter will put
        the terminating newline '\n' into the string returned, whereas
        gets reads the newline and discards it.

  TC08. Reading with fscanf(f,"\n") skips all whitespace at that point,
        including end-of-lines, but also blanks and tabs.  Therefore,
        this is NOT equivalent to readln in Pascal.

  TC09. End-of-lines are faithfully reported as they appear in the file,
        as character '\n'.  The end of the file may, but need not, be
        preceded by an end-of-line.  Using fscanf it is not possible to
        distinghuish between end-of-file, an I/O error, or a format error.

  TC10. Calling fprintf(f,"\n") produces LF under Unix/Linux, and
        CR followed by LF under MS-DOS/Windows.

  TC11. Under MS-DOS/Windows, closing an output file appends the ending
        Ctrl-Z.

Under C++, the default settings for I/O give rise to the following
behavior.  We assume the following declarations

  #include <fstream.h>
  ifstream inp("task.in");
  ofstream out("task.out">;
  char c; int i; char s[256];

  TC12. By default, leading whitespace is always skipped when reading.

  TC13. A character is read by "inp>>c;".  By default, a non-whitespace
        character is returned.

  TC14. An integer is read by "inp>>i;".  Reading stops at the first
        character not in agreement with the syntax of a decimal integer.

  TC15. A variable-length string is read by "inp>>s;".  Reading stops
        at the first whitespace character.  A terminating '\0' is
        always appended to s.  By default, a non-empty string is
        returned.

  TC16. An end-of-line is written by "out<<'\n';" or "out<<endl;".
        The latter also flushes the stream.

  TC17. A stream is automatically closed when it is destroyed.
        It can also be closed explicitly by "out.close();".

Anecdote

Concerning output, here is an anecdote.  In a competition task,
the output format required a text file with an integer N (1<=N<=100)
on the first line, and on the second line a list of N integers in the
range 0..9, separated by single blanks (no blank before the first item).
This format is in agreement with the IOI-style meta-format rules given
below.  There are various ways to produce the output.  Below is some
FreePascal code to put on the dots (...) in the following framework:

  program NumberList;

  var N: 1..100; L: array [1..100] of 0..9;
  ...
  begin
    "produce N and L"
  ; writeln ( N )
    ...
  end.

My preferred solution:

  var i: 1..100; sep: String; { or sep: String[1] if you like }
  ...
  ; s := ''
  ; for i := 1 to N do begin
      write ( s, L[i] )
    ; sep := ' '
    end { for i }
  ; writeln

But you can also get away with

  var i: 1..100;
  ...
  ; for i := 1 to N do begin
      if i > 1 then write(' ')
    ; write ( L[i] )
    end { for i }
  ; writeln

or

  var i: 1..100;
  ...
  ; for i := 1 to N do begin
      write ( L[i] )
    ; if i < N then write(' ')
    end { for i }
  ; writeln

Here are three incorrect fragments (to be enclosed by "var i: 1..100;"
and "; writeln"):

  ; for i := 1 to N do write ( L[i] : 2 )

  ; for i := 1 to N do write ( ' ', L[i] )

  ; for i := 1 to N do write ( L[i], ' ' )

All of them are incorrect because they produce one additional blank.
The first two at the line begin, and the third one at the line end.
An additional weakness of the first fragment, is that changing the range
of the elements in the list L, say from 0..9 to -1..10, will cause
-1 and 10 to miss a preceding blank.  Note also that the output of
the third approach appears to be good when viewed in a Turbo editor
(because that editor discards blanks at the end of a line), or
when dumped to the screen or printed on paper.

At IOI'95 it happened that (at least) one competitor had opted for
the third approach.  This person apparently was bothered by the
trailing blank and changed the finishing writeln into writeln(chr(8)),
that is, included a backspace (BS) "to erase the superfluous blank".
When dumped to the screen, the output does indeed look good.  The IOI'95
evaluation software, however, did not accept this, because it rejected
control codes other than end-of-line and end-of-file.  The IOI'95
evaluation software would have ignored the trailing blank, because
the output format was not strictly enforced (knowing that competitors
make stupid I/O mistakes)!


IOI-Style Meta-Format Restrictions
--------- ----------- ------------
The restrictions on the format of text files given below are inspired
by the language peculariaties explained in the preceding section.
They are intended to make I/O easy for the built-in routines of
the IOI programming languages, and for human use.

Restrictions on output files can be less strict than on input files,
because the competitors (develop the programs that) write them
and the organizers must read them after the competition.

However, it is encouraged that the format of output files adheres
to the same restrictions as input files, so as to simplify the
processing of these output files by competitors as well, in case
they want to apply their own tools to it.

The format of input and output text files is restricted by the
following general /meta-format rules/:

  MF01. An IOI text file consists of a sequence of ONE or more
        NONempty /lines/.  The sequence is terminated by end-of-file
        (Ctrl-Z under MS-DOS/Windows).

  MF02. Each line consists of a sequence of ONE or more /items/.
        This sequence is ALWAYS terminated by an end-of-line
        (LF under Unix/Linux, CR/LF under MS-DOS/Windows),
        in particular also the LAST line.

Note that both MF01 and MF02 state that text files do not contain
empty lines.  In output files, however, empty lines could be tolerated.
MF02 also states that every line, including the last line, is terminated
by an end-of-line marker.  In output files, a missing end-of-line on
the last line could be tolerated.  (Cf. IDE3, IDE4)

  MF03. Each item is one of the following:

          /character item/: A single printable NON-blank character
             (codes 33 through 126).  (Cf. TC02)
             It is discouraged to use characters other than letters
             or digits.
             It is also discouraged to mix upper and lower case letters.

          /integer item/: An integer number in decimal notation,
             possibly negative (preceded by '-'), no leading '+',
             no redundant leading zeroes, at most 32 bits.
             <TBC: Should this be increased to 64 bit?>

             In output files, a leading '+' and redundant leading
             zeroes could be tolerated.

          /string item/: a NON-empty variable-length string of
             at most 255 printable NON-blank characters.  (Cf. TP07, TC07)

             Variable-length here means that the length possibly
             cannot easily be determined from the format description
             or the preceding text in the file.

             It is discouraged to use characters other than letters
             or digits.
             It is also discouraged to mix upper and lower case letters.

Notes:

  *  Text files do not contain control characters, other than those
     needed to encode the end-of-lines and the end-of-file.

  *  An item never contains whitespace, in particular,
     it contains no blanks.

  *  Character and string items should not be abused to encode
     'ordinary' numeric values.

  *  Care should be taken when string items are real words in some
     natural language (e.g. English), because they must not be translated
     and they could well be meaningless to competitors.

  *  There is no real or float number item.  Reals or floats could
     still be treated as a sequence of characters or a string, but the
     appearance of floating-point arithmetic in tasks is frowned upon in
     the IOI.

  *  Fixed-length strings should be treated as variable-length
     strings.  In particular, competitors programs should be
     able to read fixed-length strings with the facilities
     for variable-length strings.

  MF04. Two adjacent items on the same line are either /contiguous/
        (that is, without any extra characters between them), or
        /separated/ from each other by a SINGLE blank.
        The definition of the format (in the Competition Rules
        and/or the Task Description) must clearly state the item
        composition convention.  Additional restrictions
        on item composition are given below.  (Cf. TP02, TP06)

        In output files, extra blanks between items can be tolerated.
        (In input files, extra blanks could be useful to align
        columns and thereby improve human readability, but for the IOI
        we will not do so.)

Note:

  *  Comma-separated values (CSV), and quoted strings are not allowed
     by MF04.

MF05 through MF12 concern restrictions on item composition in input files.

  MF05. A character item must be separated by a single blank from a
        directly preceding integer item on the same line.  (Cf. TP05)

  MF06. It is discouraged that a character item is preceded by a separating
        blank.  (Cf. TP02)

Note:

  *  When adhering to MF05 and MF06, an integer cannot precede a
     character, because in that case these requirements would conflict:
     MF05 requires a separating blank, whereas MF06 forbids it.

  MF07. An integer item is always separated by a single blank from a
        directly preceding integer item on the same line.
        Otherwise, the boundary cannot easily be determined,
        neither in Pascal, nor in C/C++.

  MF08. It is encouraged that an integer item is always separated by
        a single blank from any directly preceding item.
        Reason: Human readability.

        A character or string item contiguous before an integer may be
        confusing, especially if the range of allowed characters includes
        digits, or a plus or minus sign.

Note:

  *  In view of MF05, MF06, and MF08, it is encouraged that character items
     and integer items do not appear on the same line.

  MF09. A variable-length string item always appears at the end of a line.
        (Cf. TP07)

  MF10. A variable-length string item must be separated by a single blank
        from a directly preceding integer item on the same line.
        (Cf. TP05)

  MF11. It is discouraged that a variable-length string item is
        preceded by other items.  (Cf. TP06)

        In view of MF09, a string item preferably occurs on a line by
        itself.

  MF12a There are no blanks at the begin of a line.

        In output files, blanks before the first item on a line can
        be tolerated.

  MF12b There are no blanks at the end of a line.  (Cf. IDE1, IDE2)

        In output files, blanks after the last item on a line can be
        tolerated.  However, these extra blanks are less excusable than
        those under MF04 and MF12a.

  MF13. The format of input files must be such that the type of every
        item and the occurrence of every end-of-line and end-of-file can
        easily be deduced ---before they appear--- from the format
        definition or preceding text (possibly from another file).
        Explicit end-of-line or end-of-file checking should NEVER
        be necessary, not matter which IOI programming language one
        uses.  Below are some examples to illustrate this rule.
        (TC

        This requirement does not apply to output files (produced
        by competitors or their programs).  It may be an unnecessary
        burden to require that the number of items is first written to
        the output file.

  MF14. There are no tabs and other control codes (besides LF under
        Unix/Linux, or CR/LF and Ctrl-Z under MS-DOS/Windows)
        in text files.

Notes:

  *  Format definitions provided by the IOI organizers must respect the
     meta-format rules stated above, and all input files must be in
     /strict/ agreement with the promised input format.  When checking
     output files produced by the competitor programs in order to
     determine scores, it is strongly recommended that the format is
     not enforced strictly, but rather in a /relaxed/ manner.

  *  The Competition Rules for all IOIs up to IOI2000 have formulated
     some input format restrictions.  However, those formulations have
     mostly been LESS restrictive than the meta-format rules that I have
     given above.  And for one aspect MORE restricive: there has been
     a time when non-whitespace characters were resticted to letters
     and digits.  It is good practice to pay extra attention to the use
     of characters other than letters and digits, and avoid those if
     that is easy to do.

  *  It is easy for programs to read files whose format agrees with the
     meta-format rules formulated above.

     In Pascal programs that use read(...) for character, integer, and
     string items, and that do readln at every line end, no further
     attention needs to be paid to the line structure.  Of course,
     the read(x) for the last item on a line can be combined with the
     subsequent readln, yielding readln(x).  A readln occurring directly
     before the read of a character or string item must NOT be omitted,
     because whitespace is NOT skipped.  Other occurrences of readln,
     i.e. before integer and at the end of the file, can be omitted,
     because whitespace before integers is skipped.

     In C programs that use scanf for input and that precede every
     character ('%c') and string ('%s') in the format string by a blank,
     no further attention needs to be paid to the line structure.

     In C++ programs that use the default settings for input streams
     (skipping whitespace before all items), no explicit attention needs
     to be paid to the line structure.

Summary of Meta-Format Rules for Item Composition

Below are two tables that express which items may follow which
items on the same line, and what composition is allowed.

For IOI organizer-generated (input) files:

   First||   Second Item
   Item || Int | Chr | Str
   -----++-----+-----+-----
    None|| C S | C S | C S
   -----++-----+-----+-----
    Int ||  S  | (S) |  - 
   -----++-----+-----+-----
    Chr ||  S  | C(S)|  -
   -----++-----+-----+-----
    Str ||  -  |  -  |  -

For IOI competitor-generated (output) files:

   First||   Second Item
   Item || Int | Chr | Str
   -----++-----+-----+-----
    None|| C S | C S | C S
   -----++-----+-----+-----
    Int ||  S  |(C*)S|(C*)S
   -----++-----+-----+-----
    Chr || C S | C S |(C)S
   -----++-----+-----+-----
    Str ||  S  |  S  |  S

Notes:
  S  Separated (by one blank, or multiple blanks when relaxed)
  C  Contiguous (without intervening whitespace)
  -  Forbidden
  *  The range of characters allowed contiguously after an integer
     should NOT include digits.
 ( ) Allowed but discouraged

Examples for Meta-Format Rules

<TBC: Illustrate various of the meta-format rules by some pathological
examples.>

Here are some examples to illustrate rule MF13.  The following format
definitions violate rule MF13.

  1. Input file ... contains at least one and at most 20 lines.
     Each line contains three nonnegative integers less than 5000.

  2. The first and only line of input file ... contains at least
     one and at most 1000 characters that are uppercase letters.
     There are no blanks between the characters.

  3. The first line of input file ... contains at most 100 integers
     I with 10 <= I <= 99, followed by a single character '*'.

The first and second format require end-of-file and end-of-line
checking, respectively, to be read.  In C/C++, the second format
can be handled without explicit end-of-line checking, because C/C++
does not have the 255 character limit on variable-length strings.
When using the so-called "extended syntax", Turbo Pascal can also
read null-terminated strings (but this is not Standard Pascal,
and I don't find it required IOI knowledge).

The third format has the problem that it is not known what type
of item to read after the first item (which is always an integer).
Thus, it requires looking ahead into the input stream to detect
the terminating 'A'.  If that look-ahead character turns out to
be digit than it must be combined with the following digits to
make up another integer.  In Pascal, such a look-ahead character
cannot be pushed back into the input stream.  In C/C++, this can
be accomplished (though it is not recommended), but I would not
consider that required IOI knowledge.  The total length of the
input line can exceed 255.  Hence, reading it into a string is
not an option for Turbo Pascal (in my opinion).

Here are similar format definitions that do respect rule MF13.

  1. Input file ... contains at least one and at most 21 lines.
     Each line, except the last, contains three nonnegative integers
     less than 5000.  The last line of the file contains a single -1.

  2. The first line of input file ... contains at least 1 and
     at most 1000 characters that are uppercase letters, followed
     by a single '*'.  There are no blanks between the characters.

  3. The first line of input file ... contains at most 100 integers
     I with 10 <= I <= 99, followed by a single 0.

Although these formats are acceptable, their "readability" (by
a program) can be further improved as follows:

  1. The first line of input file ... contains one integer N with
     1 <= N <= 20.  The following N lines contain three nonnegative
     integers less than 5000.

  2. The first line of input file ... contains one integer N with
     1 <= N <= 1000.  The second line contains N characters that
     are uppercase letters.  There are not blanks between the characters.

  3. The first line of input file ... contains one integer N with
     0 <= N <= 100.  If N is positive, then there is also a second
     line, containing exactly N integers I with 10 <= I <= 99.

For the third format, now the question is raised whether the case
with N = 0 really needs to be included at all.  Furthermore, one
could argue that the human writability of the formats has deteriorated,
since the number of items needs to be counted explicitly.


User Requirements for RobIn
---- ------------ --- -----
Here is a list of requirements that I had in mind when designing RobIn:

  UR01  RobIn shall NOT CRASH on errors, but shall report them.

  UR02  RobIn shall be EASY TO USE for developing programs that check
        text files for adherence to IOI-style formats.

  UR03  RobIn shall be FLEXIBLE enough to express all reasonable
        IOI-style formats that adhere to the meta-format rules.

  UR04  RobIn shall ENFORCE the IOI-style meta-format rules.

  UR05  RobIn shall have a RELAXED mode and a STRICT mode for checking
        a format.

  UR06  RobIn shall be usable for developing both INPUT VALIDATORS
        that report to the organizers, and OUTPUT CHECKERS that report
        to the contestants and their coaches.
        <TBC: the latter is not yet treated in enough detail>

  UR07  RobIn shall NUMBER all error messages uniquely and take
        them from a text file to simplify translation.

  UR08  RobIn shall support TASK-SPECIFIC checking and error messages.

As a development requirement, I order the importance of some software
qualities as follows:

  1. First comes correctness

  2. Second comes usability (simplicity, generality)

  3. Third comes maintainability

  4. Fourth comes portability (though I have not given it much thought)

  5. Only in fifth place comes performance (speed, memory)


How to Use RobIn
--- -- --- -----
RobIn is a FreePascal Unit that provides routines for reading text
files and extracting values from those files, while at the same time
checking

  *  that the format (especially, order and composition of items, and
     the line structure) adheres to the meta-format rules, and

   * that the file content adheres to the format (including range
     restrictions).

RobIn is "robust" and does not crash on "bad" text files, but instead
provides detailed diagnostic messages.  The messages are numbered and
taken from message definition files to simplify translation and use by
non-English organizers, coaches, and competitors.

RobIn is intended to be used by organizers to construct tools that,
in turn, are used by organizers (to validate input files) and by
competitors and coaches (to check the format/correctness of output
files).  This section is addressed to the organizer who wishes
to construct a tool using RobIn.  Coaches and competitors may want
to skip to the section titled "How to Interpret RobIn Messages".

Make sure that you have RobIn.ppu (obtained by compiling RobIn.pas
with TP).  Simply include "uses RobIn;" in your FreePascal program and
away you go.  Well, you need to know somewhat more.

RobIn provides the following routines (here shown without parameters):

   BeginLogging     To open the log file and load the messages
   EndLogging       To close the log file and unload the messages

   OpenRIF          To open the text file to be read and checked
   CloseRIF         To terminate reading and close the text file

   ReadCharRIF      To read a character item
   ReadIntegerRIF   To read an integer item
   ReadStringRIF    To read a variable-length string item
   ReadlnRIF        To read an end-of-line
   IgnoreLineRIF    To skip the remainder of the line

   AssertRIF        To do a task-specific check
   IsAbortedRIF     To find out whether a format error occurred

   EofRIF           To detect end-of-file (ignoring whitespace)
   EolnRIF          To detect end-of-line (ignoring blanks)
   NextCharRIF      To inspect next non-whitespace character

   SetIndentation   To set the amount of required indentation (default: 0)
   IntToStr         To help construct composite identifiers for items

The acronym RIF stands for "RobIn File".  The routines ReadCharRIF,
ReadIntegerRIF, and ReadStringRIF are collectively also denoted
by Read<Item>RIF.  The routines EofRIF, EolnRIF, and NextCharRIF
are for checking output files only.

Some restrictions apply to the order in which RobIn routines
are called.  Violating these order restrictions will not crash
RobIn, but may result in early termination, and an error message
that may not explicitly mention the out-of-order calling.

The order restrictions on calls to BeginLogging, EndLogging, and
xxxRIF can be captured by the following regular expression:

  (
    1. BeginLogging
    2. ( "format described by program fragment using xxxRIF" )*
    3. EndLogging
  )*

where ( ... )* means "zero or more times the enclosed".

A /format/ is described by writing a Pascal program fragment containing
calls to OpenRIF, CloseRIF, Read<Item>RIF, and ReadlnRIF.  Typically,
the order of these calls should adhere to the following regular-expression
when checking input files:

    1. OpenRIF
    2. (
         1. ( Read<Item>RIF )+
         2. ReadlnRIF
       )+
    3. CloseRIF

where ( ... )+ means "one or more times the enclosed".
Instead of 2.1 and 2.2, it is also possible to have

         1. ( Read<Item>RIF )*
         2. IgnoreLineRIF

Anywhere between OpenRIF and the corresponding CloseRIF, the
format-defining program fragment can call AssertRIF to check a
task-specific condition.  The program can call IsAbortedRIF any time
after OpenRIF --- also after CloseRIF --- to find out whether reading
of the file content had been aborted (due to a fatal error).  Note that
checking of the format itself against the meta-format rules is always
continued until CloseRIF.

No order restrictions apply to IntToStr.  It is just a "loose" function,
typically used to construct a parameter for Read<Item>RIF when reading
a sequence of items.

As an example, consider the following format definition:

  The first line of file 'example.in' contains two integers N and K
  with  2 <= N <= 9  and  1 <= K <= N.  The second line contains
  N lowerercase letters ('a'..'z', contiguous).
  The following K lines each contain one nonempty string of at most
  20 uppercase letters ('A'..'Z').

This format is captured by the following program fragment (assuming
appropriate constant definitions and variable declarations; the full
program is Example.pas):

     OpenRIF ( inFile, 'example.in', ... )

     { read first line }
   ; N := ReadIntegerRIF ( inFile, Separated, 'N', 2, 9 )
   ; K := ReadIntegerRIF ( inFile, Separated, 'K', 1, N )
   ; ReadlnRIF ( inFile )

     { read second line }
   ; for i := 1 to N do begin
       L[i] := ReadCharRIF ( inFile, Contiguous,
                             'L[' + IntToStr(i,1) + ']',
                             UpperCaseChars )
     end { for i }
   ; ReadlnRIF ( inFile )

     { read following K lines }
   ; for j := 1 to K do begin
       S[j] := ReadStringRIF ( inFile, Contiguous,
                               'S[' + IntToStr(j,1) + ']',
                               LowerCaseChars, 1, 20 )
     ; ReadlnRIF ( inFile )
     end { for j }

   ; CloseRIF ( inFile )

Note that Separated, Contiguous, UpperCaseChars, and LowerCaseChars
are predefined in the RobIn interface section.

Here is a corresponding fragment of FreePascal program that reads the
input data from the input file (again assuming appropriate variable
declarations).  Note the great similarity with the fragment above.
Fragments in C and C++ are given further down.

     Assign ( inFile, 'example.in' )
   ; Reset ( inFile )

     { read first line }
   ; Read ( inFile, N )
   ; Read ( inFile, K )
   ; Readln ( inFile )

     { read second line }
   ; for i := 1 to N do begin
       Read ( inFile, L[i] )
     end { for i }
   ; Readln ( inFile )

     { read following K lines }
   ; for j := 1 to K do begin
       Read ( inFile, S[j] )
     ; Readln ( inFile )
     end { for j }

   ; Close ( inFile )

or somewhat more compactly, as a competitor might code it:

     Assign ( inFile, 'example.in' )
   ; Reset ( inFile )
   ; Readln ( inFile, N, K )
   ; for i := 1 to N do Read ( inFile, L[i] )
   ; Readln ( inFile )
   ; for j := 1 to K do Readln ( inFile, S[j] )
   ; Close ( inFile )

Here is a C version (one of many possibilities) to read the format
(assuming appropriate #include and variable declarations):

     inFile = fopen ( "example.in", "r" );
     fscanf ( inFile, "%d%d\n", &N, &K ); /* \n needed (why?) */
     for ( i=1; i<=N; ++i ) {
       fscanf ( inFile, "%c", &L[i] );
     }
     fscanf ( inFile, "\n" ); /* not needed (homework: why?) */
     for ( j=1; j<=K; ++j ) {
       fscanf ( inFile, "%s\n", &S[j] ); /* \n not needed (why?) */
     }
     fclose ( inFile );

And here is a C++ version to read the format (assuming appropriate
#include and variable declarations):

     ifstream inFile ( "example.in" );
     inFile >> N >> K;
     for ( i=1; i<=N; ++i ) {
       inFile >> L[i];
     }
     for ( j=1; j<=K; ++j ) {
       inFile >> S[j];
     }

<TBC: explain parameters of Read<Item>RIF routines: rif, comp, itemId,
and various bounds>

To simplify reading a list of separated items contained on a single line,
RobIn automatically adjusts the itemComp parameter to Contiguous
when reading the first item on a line (more accurately, to the pair
(minimum: indentation, extra: itemComp.extra)).  Thus, the format

  The first line of ... contains one integer N with 1 <= N <= 20.
  The second line contains N integers X with 0 <= X <= 999.

is correctly captured by (assuming appropriate variable declarations)

    N := ReadIntegerRIF ( inFile, Separated, 'N', 1, 20 )
  ; ReadlnRIF ( inFile )
  ; for i := 1 to N do begin
      X[i] := ReadIntegerRIF ( inFile, Separated, 'X', 0, 999 )
    end { for i }
  ; ReadlnRIF ( inFile )

It is recommended that the format program captures bounds as close as
possible to the way they are given in the task description.  Consider
the following two equivalent format definitions:

  The first line of ... contains two positive integers N and K at
  most 100 and with K <= N.

  The first line of ... contains two integers N and K with
  1 <= K <= N <= 100.

Here are two corresponding format-defining program fragments:

    N := ReadIntRIF ( inFile, 'N', 1, 100 )
  ; K := ReadIntRIF ( inFile, 'K', 1, 100 )
    { check K < N, if not report "K > N" }
  ; AssertRIF ( inFile, K <= N, AssertError, {M#}900,
                IntToStr ( K, 1 ) + ' > ' + IntToStr ( N, 1 ) )

and

    N := ReadIntRIF ( inFile, 'N', 1, 100 )
  ; K := ReadIntRIF ( inFile, 'K', 1, N )

These fragments are equivalent, except for the kind of diagnostics they
produce, and how the competitor can interpret the diagnostic, in view of
the given the format definition.  Consider the input file with first line
"10 20".  The first program will give an AssertError of the form "K > N
(20 > 10)", and the second program will give a FormatError of the form
"Integer not in range (20 not in [1..10])".

Use of AssertRIF, IsAbortedRIF, SetIndentation

<TBC:>

Checking Output Files

Note: There may be good reasons to allow formats in output files
that are forbidden for input files.  For example, if a program
must be written that enumerates all solutions of some kind of
puzzle, and it cannot be easily known in advance how many solutions
there are, then it would be cumbersome to require that the
first line in the output file states the number of solutions and
that each next line states a solutions.  Instead, it is better
to use the format where the program outputs each solution on a single
line in the output file.  However, if there are no solutions at
all, then an empty file would not be a good idea.

In output files, empty lines should never be significant.

Typical use of EofRIF, assuming (as is reasonable) that the output file
is not be empty:

  repeat
    { read one line }
  until EofRIF ( ... )

Note: It is not allowed to call EofRIF when not at the start of
a format line.  The reason for this is that it is unclear how to
handle the situation where the remainder of the line contains
some blanks but no further items; that line must still be finished
by a ReadlnRIF, but EofRIF will skip both Blanks and Eolns to see
if there is an Eof or some printing character ahead.

Typical use of EolnRIF when reading a line, which should not be empty:

  repeat
    { read one item }
  until EolnRIF ( ... )

NextCharRIF is useful when the type of the next item is not known
in advance.  For example, in an output format defined as follows:

    The first line of the the output file consists of N integers
    in [0 .. 999], each integer is optionally followed by an
    asterix ('*').  Each integer must be preceded by at least one
    blank and at most four blanks.  Each asterix must NOT be
    preceded by any blanks, but should be contiguous to the
    preceding integer.

In this case, NextCharRIF can be used to look ahead at
the first character of the next item.  Any intervening end-of-lines
are skipped and flagged as errors or warnings (depending on
whether the checking level is strict or relaxed respectively).
Here is a program fragment to define and read this format:

    SeparatedPlus.extra := 4
  ; for i := 1 to N do begin
      a [ i ] := ReadIntegerRIF ( inFile, SeparatedPlus, 'a[i]', 0, 999 )
    ; if not EolnRIF ( inFile ) then begin
        { test on not EolnRIF is needed for last item on line only }
        b [ i ] := ( NextCharRIF ( inFile ) = '*' )
      ; if b [ i ] then begin
          c := ReadCharRIF ( inFile, Contiguous, 'c', [ '*' ] )  { ignored }
        end { if }
      end { if }
    end { for i }
  ; ReadlnRIF ( inFile )

This is similar to the situation where the number of items on the line
is not known in advance.  Assuming that there is at least one integer
on the output line, the format can be checked by:

    SeparatedPlus.extra := 4
  ; i := 0
  ; repeat
      i := i + 1
    ; a [ i ] := ReadIntegerRIF ( inFile, SeparatedPlus, 'a[i]', 0, 999 )
    ; if not EolnRIF ( inFile ) then begin
        b [ i ] := ( NextCharRIF ( inFile ) = '*' )
      ; if b [ i ] then begin
          c := ReadCharRIF ( inFile, Contiguous, 'c', [ '*' ] )  { ignored }
        end { if }
      end { if }
  ; until EolnRIF ( inFile )
  ; ReadlnRIF ( inFile )

<TBC: Maybe further examples, motivation, ...>

The Structure of a Message Definition File

Each diagnostic message is identified by a number in the range 0..900.
The numbers in the range 0..899 are reserved for use within RobIn.
The numbers in the range 900..999 are available for task-specific
messages issued via AssertRIF by the using program and loaded
by BeginLogging.

The association between a message number and its text is recorded
in a Message Definition File (typical extension .mdf).  In a message
definition file, empty lines and lines starting with '#' are ignored.
These can be used for documentation.  Other lines should be of the
form

  <message number> <non-empty string of at most 50 characters>

Leading and trailing blanks are ignored.  The format of the message
definition file is checked when it is loaded (though in a way that
differs from the checking that RobIn does for RobIn files :-) and
warnings are generated when the message definition file is badly
formatted.

Diagnostic Categories

Every message is issued within a category.  Usually --- but not
necessarily always --- a message occurs in one category only.
However, some messages are more general and make (different) sense
in different categories (e.g. "Empty string").

The following table summarizes the "philosphy" behind the various
Error dianostics.  As with all error reporting, your mileage may vary,
since RobIn cannot know what the user's intentions were.  Compare this
to the art of interpreting compiler error messages.

  IOError         --> need to correct problem with user file in OS
                      its existence, its location or name, its permissions
  InternalError   --> need to correct mistake inside RobIn
                      report this to me, with all evidence you can supply
  ExternalError   --> need to correct mistake in RobIn's runtime environment
                      or configuration, e.g. missing message definition file
  UsageError      --> need to correct mistake in program using RobIn
  MetaFormatError --> need to correct format definition
  FormatError     --> need to correct file content (general isolated issue)
  AssertError     --> need to correct file content (task-specific issue)

<TBC: Include figure which shows the various interfaces to RobIn
and which errors relate to which interface.>

The distinction between UsageError and MetaFormatError is not always
as clearcut as one might hope.  Generally speaking, the format definition
need not change when correcting a UsageError.  Nevertheless, the format
may still turn out to be the real source of the problem.  Consider the
following format definition:

  The first line of input file ... contains two integers N and K with
  1 <= N <= 100 and 3 <= K <= N.

The straightforward format-defining program fragment is

    N := ReadIntegerRIF ( inFile, Separated, 'N', 1, 100 )
  ; K := ReadIntegerRIF ( inFile, Separated, 'K', 3, N )

However, this may give rise to a UsageError "Empty integer range"
on the second ReadIntegerRIF call.  This is, for instance, the case when
reading was aborted (because file does not exist), in which case only the
meta-format is checked.  The first ReadIntegerRIF then returns the
lower bound 1, making the second range empty.  This is clearly an
inconsistent read request.  What is the matter here?  Well, 1 is not
really the lower bound on N, instead the actual consistent minimum
value for N is 3, because of the constraint  3 <= K <= N .  It
is advisable to change the format definition in such a case:

  The first line of input file ... contains two integers N and K with
  3 <= N <= 100 and 3 <= K <= N.

The corresponding program-defining program fragment is:

    N := ReadIntegerRIF ( inFile, Separated, 'N', 3, 100 )
  ; K := ReadIntegerRIF ( inFile, Separated, 'K', 3, N )

This fragment will return N=3, when reading was somehow aborted.
Consequently, the range for K in the next call will no longer be
empty.  Hence it will not produce a UsageError about the range for K.
It will, however, produce a UsageWarning, because the range
consists of a single element only (which would generally be
suspicious).

Warnings

Warnings point to a "suspicious" situation or a discouraged practice.
Sometimes they can easily be explained and subsequently be ignored.
In other cases, it may turn out to be a "hard to catch" mistake.
Compare this again to compiler warnings.

<TBC: explain selection and limiting to tune diagnostic output
  BeginLogging: parameters diagSelection, reportingLimit>

<TBC: concurrent reading/checking of multiple files can be done>

    Exception handling:

      *  We distinguish the follow exceptions:
           -  meta-format warnings, in case the format involves a
              discouraged practice
           -  meta-format errors, in case the format violates one
              of the IOI format restrictions
           -  format warnings, in case the content of the file being
              read violates the format
           -  format errors

      *  In principle, after the first error has occurred, it no
         longer makes sense to continue reading the file.
         However meta-format checking proceeds even if a Format Error
         has occurred.
>

<TBC:>


How to Interpret RobIn Messages
--- -- --------- ----- --------
<TBC: what does competitor have to know about RobIn to understand
 its diagnostic output?>

<TBC: file being read and (intended) format begin processed need not by in sync,
 especially if eoln instead of blank or vice versa.

 coordinates:
   lineNr, columnNr are coordinates of nextChar within file,
   fmtLineNr, itemNr are coordinates of current item within format;
   nextChar looks AHEAD!

 current item may not have been completely handled by Read<Item>RIF,
 because it may still be followed by AssertRIF for semantic checking

 Integer item values are reported in decimal
 Character item values are reported as 'c', '''', or chr(#)
 String item values are reported as "c...c" (embedded '"' appear singly!)

 consequences of limiting diagnostics>


Design and Implementation of RobIn
------ --- -------------- -- -----
The implementation of RobIn resides in the file "RobIn.pas" and
contains comments as documentation.  Here are some very global
additional remarks.

Expressing the Format

I have chosen to express the format as a (Pascal) program fragment
that makes calls to the routines reading the elements in an order
appropriate for the format.  An alternative could have been to
introduce a new notation for expressing formats and to store the
format description in a separate file.  I think that my approach
is more flexible and easier to use directly (unless you hate
Pascal).

Type RobInFile

The file being checked is read via a variable rif: RobInFile, in
particular via the text file variable rif.f.  The variable rif
contains other fields besides f to capture the current state
of the reading and checking process, including some statistics.

The sequence of characters that "passes through" rif.nextChar is
exactly the contents of the file, with two exceptions.  The first
exception is that, under Relaxed Checking, every TAB characters is
replaced by a blank.  The second exception is that, under Relaxed
Checking again, if the last line in the file is not terminated by
an explicit end-of-line, then and EolnChar is inserted in the sequence.

The window rif.nextChar acts as a look-ahead character.  It is needed,
because Turbo/FreePascal lacks a (Standard Pascal) file window to look
ahead into the input stream.  The procedure StepNextChar advances the
rif.nextChar "window" one position.

The look-ahead character is always valid after opening a RobInFile.
Procedure StepNextChar gets the next character into nextChar.
When eof is encountered, nextChar is set to EofChar.
When eoln is encountered, nextChar is set to EolnChar.

RobIn parses items by reading characters.  The end of an integer item
is detected by hitting the end-of-line or by reading a non-digit.
The latter cannot be pushed back into the input stream and is stored
as look-ahead character in nextChar.

The next higher level of abstraction views the file as a sequence of
/lines/, where each line consists of a sequence of /items/.  Each item is
a represented in the text file by a non-empty sequence of non-whitespace
characters.  These items are numbered per line starting at 1.  To get
appropriate diagnostic messages, it is desirable that when AssertRIF is
called after some Read<Item>RIF, this AssertRIF will still refer back to
the last item read.  Thus, the item number and other information about
the last item read, is kept around in the RobInFile descriptor rif.
Once attention is turned to the next item (either via Read<Item>RIF,
or indirectly via EofRIF, EolnRIF, or NextCharRIF), is the item number
incremented.

The implementation of EofRIF, EolnRIF, and NextCharRIF is somewhat
awkward.  These functions should look ahead in the file, WITHOUT advancing
the nextChar window, seeing through blanks and, for EofRIF and NextCharRIF,
also through end-of-lines.  The actual processing of the ignored blanks
and end-of-lines can only be done when it is known what type of item to
read next and how this item is composed to its predecessor.  Because
I could not find a convenient way to accomplish this kind of look ahead,
I have chosen to advance nextChar, and to process the skipped blanks
and end-of-lines as best as possible.  When not possible, the reporting
is delayed and relevant state information is maintained in rif
(in particular, the fields blankCount, eolnCount, excessiveBlanksReported).
Because of this implementation choice, some diagnostics are reported
a little later in the stream then might have been desirable.
The file coordinates in the diagnostic message do not point to the
precise location in the file where the situation first arose.

As an example, consider the occurence of the query EolnRIF(rif) in the
format reading program fragment (for checking an output file which has
a variable number of items on a line).  If rif.nextChar equals a blank,
then this should be the separating blank before the next item on the line.
But it could well be an (erroneous) trailing blank at the end of the line.
Which of these possibilities is actually the case, can only be determined
after finding either the end-of-line after a couple of blanks, or
a non-white character (the start of the next item).  The subsequent
Read<Item>RIF will determine what composition applies.  Only then
will it be clear whether the skipped blanks are in agreement with
the format.

Diagnostic messages

Diagnostic output is done through theLogFileP, which is a pointer
to a text file.  I use such a pointer for two reasons.  First of all,
that way the using program can "own" the file variable and still
tell only once to RobIn which file to use.  Second, it simplifies
switching on-the-fly between log files without having to open and
close them.  You can just move the pointers around.  For instance,
to get diagnostic output temporarily to standard output, simply
insert:

    theLogFileP := @ output
    { diagnostic output now goes to output }
    ...
    theLogFileP := @ logFile
    { diagnostic output now goes to logFile
      (logFile should be open for writing) }

Similarly, the selection/suppression of diagnostic messages can be changed
on-the-fly by assigning to theDiagSel.

All diagnostic messages produced by RobIn are preceded by "RobIn: ".
Subsequent lines are indented by two blanks (implementation
string constant Indent).

Diagnostic messages are formulated to state what is actually the
case, not what should have been the case.  For example, "Empty
string" (when some empty string was encountered where it should
not have been), rather than "Empty string not allowed".

Some diagnostic messages are followed by a /parenthetic note/, which
is NOT taken from a message file, but which is constructed depending
on the current situation.  For example, if a value is out of bounds,
the value and bounds are shown.  I have strived to use as little
English as possible in the parenthetic remarks, and preferably only
symbolic information, to avoid the need for translation.  Note that
only a limited set of diagnostic messages is relevant to the
competitors.

<TBC: There should be a complete list showing which notes can occur. >

Message numbers in the range 1 .. 255 are reserved for Turbo Pascal
run-time error messages.  Message numbers in the range 256 .. 899
are general messages of RobIn.

Type Compatibility Issue for Read<Item>RIF

The routines Read<Item>RIF have been implemented as functions.
They "should" have been implemented as procedures (like the predefined
read in Pascal), because they have (major) side-effects.  Here is a
motivation for my choice.

In case of read procedures, the result (the value read) would have to
be communicated via a var-parameter.  These var-parameters need to
have the "widest" type for the kind of item read, that is, Char for
characters, LongInt for integers, and String for variable-length
strings.  Turbo Pascal insists that the actual var-parameter has
exactly the same type as the formal var-parameter (and not a subtype).
There is no automatic type conversion when assigning the result value
in the procedure to the var-parameter.  (Note: The predefined read
procedures form an exception to this rule!)  Consequently, if you
have "var N: 1..100" and want to read a value using a
"procedure ReadInt ( var i: LongInt )", then "ReadInt ( N )"
is rejected, and you have to resort to the use of an auxiliary
variable: "var ri: LongInt; ... ReadInt ( ri ) ; N := ri".

On the other hand, for functions, the result type only needs to be
assign compatible --- not identical --- when used in an assignment
statement, and the value is automatically converted.  Thus,
"var N: 1..100; ... function ReadInt: LongInt; ... N := ReadInt"
is correct usage.

To simplify the use of RobIn (avoiding auxiliary variables), I have
sacrificed elegance (no side-effects in functions).

In What Order to Check Diagnostic Conditions?

The order in which routines check conditions to generate diagnostic
output poses some dilemmas.  Often, it does not make sense to continue
processing if a UsageError condition occurs, such as an empty
character range in ReadStringRIF.  In fact, it may be dangerous
or senseless to test for a UsageWarning condition if an error
is present.  For example, in ReadStringRIF, if range contains
a single character and minLength = maxLength, then I find that
a good reason to give a UsageWarning (because at this point,
the input file can contain only a single value).  As a parenthetic
remark, I produce that single value.  But what if minLength and
maxLength are negative?  In that case, the UsageError should take
precedence, because it violates the precondition (and no string
value in the file can be acceptable).  But, always first checking
for UsageError conditions and, if none occurred, then checking
for UsageWarning conditions, reduces the amount of diagnostic
information in the log file.  It may be very helpful in locating
an error to receive warnings as well, because they can point
to overlooked situations.

As a result, there are places where I first check for UsageWarning
conditions, and also places where I first check for UsageError
conditions.  It all depends on the context.  If a UsageWarning
can safely and sensibly be issued before the UsageErrors, then
I do so.

Turbo Pascal Extensions Used in RobIn

Here is a (complete?) list of Turbo extensions to Standard Pascal
that were used in the implementation of RobIn.

  LongInt
  Inc, Dec
  UpCase
  String, + on strings, Val, Str, Delete
  Exit, RunError
  Assign, Close
  {$I+/-}, IOResult
  else in a case statement

Testing RobIn

<TBC: RobIn was applied to some of the IOI2000 input files
at the moment of writing this document>

< STILL INCOMPLETE >


Limitations of RobIn
----------- -- -----
RobIn has (of course) some limitations.  Here is a brief discussion
of some of the known limitations.

Diagnostics for Out-of-Order Calls

<TBC: following needs to be updated, since I improved RobIn>
The diagnostics for out-of-order calls to xxxRIF are implicit, rather
than explicit.  In particular, the following restrictions should be
checked explicitly but are not (all calls concern the SAME variable
rif: RobInFile):

  Routine        Condition

  OpenRIF        # OpenRIF = # CloseRIF      (rif not open)
  CloseRIF       # OpenRIF = # CloseRIF + 1  (rif not closed)
  Read<Item>RIF  same as CloseRIF
  ReadlnRIF      same as CloseRIF
  AssertedRIF    same as CloseRIF
  IsAbortedRIF   # OpenRIF > 0               (rif was opened)

Note that "rif is open" does not mean the same thing as "rif.f is
open", because a call to OpenRIF may fail to open rif.f.  In that
case, format checking is aborted, but meta-format checking continues,
and a call to CloseRIF is still needed to terminate the checking,
before opening another file.

Another thing that cannot be diagnosed by RobIn is a missing
CloseRIF call.  If the using program does not make the call,
then RobIn will never know.

Similar limitations apply to the pair BeginLogging and EndLogging.

The reason that such protection is "hard" to implement in Pascal
is that no variables (except file variables) are automatically
initialized to some known value.  Therefore, RobIn (which does not
"own" the RobInFile variable that acts as parameter to xxxRIF) must
always rely on the using program to take care of initialization,
for example by calling an appropriate initialization routine
(in our case, OpenRIF).  (If only Pascal had had a single
auto-initialized type for variables; one bit would have been
enough. :-)

Concerning the second limitation (missing call to CloseRIF),
the thing is that Turbo Pascal has no way to guarantee that
control goes back to RobIn when the program exits.  EndLogging
does a check, but the call to EndLogging may also be missing.
(Delphi does have have that facility.)

< STILL INCOMPLETE >


Other Tools
----- -----

CheckMsg

CheckMsg.pas is a program to help check to check the structure of
message files used by RobIn, especially task-specific message files.
CheckMsg checks the files named by each of its command-line parameters.
If no command-line parameters are given, then it asks for a file name
and checks it.

Application Framework

<TBC: now the general code for each validator/checker is
simply copied.  That code either handles all command-line parameters
and interprets wildcards, or it generates the file names of all files
to be tested internally (based on the root of the file name
and appending numbers for each case).  For each file it calls
a routine that handles that file (either validates the input format,
analyzes the data, checks the output format, or checks the output
correctness.  The task-specific code in the current "framework"
is kept to a minimum, the remainder is put into a task-specific
unit.  See example "ValidEx.pas".  Actually, this file name business
can be handled better in a shell script.>

Normalizer

The program normal.pas normalizes text files.  Normalization involves:

      *  removing whitespace (blanks, tabs) at line begin
      *  replacing multiple whitespace between items by a single blank
      *  removing whitespace at line end
      *  removing invalid characters, outside chr(32) .. chr(126)
      *  removing empty lines (incl. lines with only whitespace)
      *  ensuring an explicit end-of-line on the last line

During normalization, changes are reported on standard output.


Concluding Remarks
---------- -------
Here are some open ends.

Porting to Linux and/or C/C++.

The initial TurboPascal version of RobIn has been adapted to FreePascal.
This version works well under Linux.  I will not be working on a C/C++
version of RobIn.  But if you want to give it a try, look under "Design
and Implementation of RobIn" for a list of Turbo extensions to Standard
Pascal that were used.

ReadRealRIF or ReadFloatRIF

I will not put in a ReadRealRIF/ReadFloatRIF.  For one thing, the use
of floating point arithmetic at the IOI is controversial (for several
reasons, such as rounding, numerical stability, etc., ask Gyula Horvath).
Furthermore, the format for a real/float item must be chosen such
that it is common to all IOI programming languages.  Its definition
is more complicated than that of the items supported now.

<TBC: application to IOI2000 input validators>


Acronyms
--------
BS      Backspace (ASCII character 8)
CR      Carriage Return (ASCII character 13)
FF      Form Feed (ASCII character 12)
HT      (Horizontal) Tab (ASCII character 9)
IDE     Integrated Development Environment
I/O     Input/Output
IOI     International Olympiad in Informatics
LF      Line Feed (ASCII character 10)
OS      Operating System
RobIn   Robust Input module (FreePascal Unit)
TC      Turbo C/C++ by Borland (version 3.0)
TP      Turbo Pascal by Borland (version 7.0)
VT      Vertical Tab (ASCII character 11)


Sample Programs
------ --------
This appendix contains some programs to illustrate certain details
that I did not want to include in the main text.  The layout for
these programs has been kept compact to improve the overview, at
the expense of their maintainability.

End-of-Line and Turbo Pascal

The following two programs count the number of characters and lines
in standard input.  Under the ISO Standard Pascal definition, these
programs are equivalent, but not under Turbo Pascal.  Try the following
two possible inputs, each consisting of two lines (^Z denotes Ctrl-Z):

abc
def
^Z

abc
def^Z

Program TextCounter1 (correctly) reports 6 characters on 2 lines
in both cases.  Program TextCounter2 reports the same for the first
input, but 6 characters and 1 line for the second input, because the
outer loop terminates early.  This can be compensated by including
the line

; if eoln then Inc(lines)

after the loop in TextCounter2.  Note, however, that the resulting
program is no longer correct ISO Pascal (because eoln is inspected
when eof is true)!

program TextCounter1;

var chars, lines: longint; c: char;

begin
  writeln('Count # characters and lines in standard input.')
; writeln('Standard implementation.')
; chars := 0 ; lines := 0
; while not eof do begin
    while not eoln do begin read(c) ; Inc(chars) end
  ; readln ; Inc(lines)
  end
; writeln('# chars = ', chars:6)
; writeln('# lines = ', lines:6)
end.

program TextCounter2;

var chars, lines: longint; c: char;

begin
  writeln('Count # characters and lines in standard input.')
; writeln('Alternative implementation.')
; chars := 0 ; lines := 0
; while not eof do begin
    if eoln
    then begin readln  ; Inc(lines) end
    else begin read(c) ; Inc(chars) end
  end
  { N.B. Turbo Pascal needs "; if eoln then Inc(lines)" here,
    but this would be wrong in Standard Pascal }
; writeln('# chars = ', chars:6)
; writeln('# lines = ', lines:6)
end.

~~~~
NOT YET INCORPORATED ABOVE

  Item composition mistakes are
      *  multiple consecutive blanks
      *  blanks at begin of line or end of line
      *  empty lines
      *  missing blanks between separated items
      *  missing end-of-lines
      *  end-of-lines instead of blanks
      *  blanks instead of end-of-lines
      *  junk at end-of-file

Existence of the file to be checked and limitations on its size
are typically handled prior to reading the file with RobIn.

In implementation of RobIn:

Currently, RobIn does not check that when an integer item is preceded
contiguously by a character item on the same line, the range for the
character item does not include potentially "confusion" characters
(such as digits, or '+', or '-').

Some meta-format checks have been disabled, because they seem too
pedantic:

  *  Character follows integer on same format line
  *  Integer follows character on same format line
  *  Integer is contiguous to preceding item
     (a.k.a. No blank prescribed before integer)
  *  String not at begin of format line