Here is Graham Nelson's original post detailing his algorithm for syntax
colouring Inform.  At the end of the file you will find my notes on the
changes I have made.

John G. Wood 

------------------------------------------------------------------------
#! rnews 21543
Path: news.demon.co.uk!demon!gnelson.demon.co.uk!graham
From: Graham Nelson <graham@gnelson.demon.co.uk>
Newsgroups: rec.arts.int-fiction
Subject: How to syntax-colour Inform
Date: Wed, 17 Dec 1997 18:46:34 +0000 (GMT)
Organization: none
Message-ID: <ant171834bbaM+4%@gnelson.demon.co.uk>
NNTP-Posting-Host: gnelson.demon.co.uk
X-NNTP-Posting-Host: gnelson.demon.co.uk [194.222.103.187]
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
X-Newsreader: ANT RISCOS Marcel [ver 1.26]
Lines: 551
Xref: news.demon.co.uk rec.arts.int-fiction:26517


[This is going to be a new section in the Inform Technical Manual,
which seems as good a place to keep it as any, but in the mean time
it's been requested several times on the newsgroup, hence this
posting.  Comments welcome -- GN.]


How to syntax-colour Inform source code
---------------------------------------

"Syntax colouring" is an automatic process which some text editors apply
to the text being edited: the characters are displayed just as they are,
but with artificial colours added according to what the text editor thinks
they mean.  The editor is in the position of someone going through a book
colouring all the verbs in red and all the nouns in green: it can only do
so if it understands how to tell a verb or a noun from other words.
Many good text editors have been programmed to syntax colour for languages
such as C, and a few will allow users to reprogram them to other languages.

One such is the popular Acorn RISC OS text editor "Zap", for which the
author has written an extension mode called "ZapInform".  ZapInform
contributes colouring rules for the Inform language and as over a dozen
people have now asked me how it works, while the original is written
in ARM assembly (a language rather less widely spoken than Middle Egyptian)
it seems worth documenting the main algorithm.

(ZapInform does a number of other useful things, including pasting in
template objects and rooms when commanded from a mouse-accessed menu:
for instance, you can create a simple game with two or three mouse
clicks and a few object names typed in to a dialogue box, then click
to save and compile the result.  See the ZapInform manual for details.)


(a)  State values

ZapInform associates a 32-bit number called the "state" with every
character position.

The "state" is as follows.  11 of the upper 16 bits hold flags, the
rest being unused:

   32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17
                                                comment
                                             single-quoted text
                                          double-quoted text
                                       statement
                                    after marker
                                 highlight flag
                              highlight all flag
                           colour backtrack
                        after-restart-flag
                     wait-direct (waiting for a directive)
                  dont-know-flag

These flags make up the "outer state" while the lower 16 bits holds
a number pompously called the "inner state":

             0    after WS (WS = white space or start of line or comma)
             1    after WS then "-"
             2    after WS then "-" and ">" [terminal]
             3    after WS then "*" [terminal]

          0xFF    after junk
   0x100*N + S    after WS then an Inform identifier N+1 characters long
                  itself in state S:
                  101 w    202 wi   303 wit   404 with
                  111 h    212 ha   313 has
                  121 c    222 cl   323 cla   424 clas   525 class
 same + 0x8000    when complete [terminal]
   
In practice it would be madness to try to actually store the state
of every character position in memory (it would occupy four times as
much space as the file itself).  Instead, ZapInform caches just one
state value, the one most recently calculated, and uses a process
called "scanning" to determine new states.  That is, given that we
know the state at character X and want to know the state at character
Y, we can find out by scanning each character between X and Y,
altering the state according to each one.

It might possible save some time to cache more state values than
this (say, the state values at the start of every screen-visible
line of text, or some such) but the complexity of doing this doesn't
seem worthwhile on my implementation.  Scanning is a quick process
because the Zap text editor stores the entire file in almost contiguous
memory, easy to run through, and the state value can be kept in a
single CPU register while this is done.


(b)  Scanning text

Let us number the characters in a file 1, 2, 3, ...

The state before character 1 is always 0x02000000: that is, inner
state zero and outer state with only the waiting-for-directive flag set.
(One can think of this as the state of an imaginary "character 0".)
The state at character N+1 is then a function of the state at
character N and what character is actually there.  Thus,

       State(0) = 0x02000000

and for all N >= 0,

       State(N+1) = Scanning_function(State(N), Character(N+1))

And here is what the scanning function does:

    1.  Is the comment bit set?
           Is the character a new-line?
              If so, clear the comment bit.
              Stop.

    2.  Is the double-quote bit set?
           Is the character a double-quote?
              If so, clear the double-quote bit.
              Stop.

    3.  Is the single-quote bit set?
           Is the character a single-quote?
              If so, clear the single-quote bit.
              Stop.

    4.  Is the character a single quote?
           If so, set the single-quote bit and stop.

    5.  Is the character a double quote?
           If so, set the double-quote bit and stop.

    6.  Is the character an exclamation mark?
           If so, set the comment bit and stop.

    7.  Is the statement bit set?
           If so:
              Is the character "]"?
                 If so:
                    Clear the statement bit.
                    Stop.

              If the after-restart bit is clear, stop.

              Run the inner finite state machine.

              If it results in a keyword terminal (that is, a terminal
              which has inner state 0x100 or above):
                 Set colour-backtrack (and record the backtrack colour
                 as "function" colour).
                 Clear after-restart.

              Stop.

           If not:
              Is the character "["?
                 If so:
                    Set the statement bit.
                    If the after-marker bit is set, set after-restart.
                    Stop.

              Run the inner finite state machine.

              If it results in a terminal:
                 Is the inner state 2 [after "->"] or 3 [after "*"]?
                    If so:
                       Set after-marker.
                       Set colour-backtrack (and record the backtrack
                       colour as "directive" colour).
                       Zero the inner state.
                 [If not, the terminal must be from a keyword.]
                 Is the inner state 0x404 [after "with"]?
                    If so:
                       Set colour-backtrack (and record the backtrack
                       colour as "directive" colour).
                       Set after-marker.
                       Set highlight.
                       Clear highlight-all.
                 Is the inner state 0x313 ["has"] or 0x525 ["class"]?
                    If so:
                       Set colour-backtrack (and record the backtrack
                       colour as "directive" colour).
                       Set after-marker.
                       Clear highlight.
                       Set highlight-all.
                 If the inner state isn't one of these: [so that recent
                 text has formed some alphanumeric token which might or
                 might not be a reserved word of some kind]
                    If waiting for directive is set:
                          Set colour-backtrack (and record the backtrack
                          colour as "directive" colour)
                    If not, but highlight-all is set:
                          Set colour-backtrack (and record the backtrack
                          colour as "property" colour)
                    If not, but highlight is set:
                          Clear highlight.
                          Set colour-backtrack (and record the backtrack
                          colour as "property" colour).

                 Is the character ";"?
                    If so:
                       Set wait-direct.
                       Clear after-marker.
                       Clear after-restart.
                       Clear highlight.
                       Clear highlight-all.
                 Is the character ","?
                    If so:
                       Set after-marker.
                       Set highlight.

              Stop.

The "inner finite state machine" adjusts only the inner state, and
always preserves the outer state.  It not only changes an old inner
state to a new inner state, but sometimes returns a "terminal" flag
to signal that something interesting has been found.

          State      Condition      Go to state     Return terminal-flag?
          0          if "-"         1
                     if "*"         3               yes
                     if space, "#",
                        newline     0
                     if "_"         0x100
                     if "w"         0x101
                     if "h"         0x111
                     if "c"         0x121
                     other letters  0x100
                     otherwise      0xFF
          1          if ">"         2               yes
                     otherwise      0xFF
          2          always         0
          3          always         0
          0xFF       if space,
                        newline     0
                     otherwise      0xFF         

          all 0x100+ states:
                     if not alphanumeric, add
                        0x8000 to the state         yes
          then for the following states:
          0x101      if "i"         0x202
                     otherwise      0x200
          0x202      if "t"         0x303
                     otherwise      0x300
          0x303      if "h"         0x404
                     otherwise      0x400
          0x111      if "a"         0x212
                     otherwise      0x200
          0x212      if "s"         0x313
                     otherwise      0x300
          0x121      if "l"         0x222
                     otherwise      0x200
          0x222      if "a"         0x323
                     otherwise      0x300
          0x323      if "s"         0x424
                     otherwise      0x400
          0x424      if "s"         0x525
                     otherwise      0x500
          but for all other 0x100+ states:
                     if alphanumeric, add
                        0x100 to the state

          0x8000+    always         0

(Note that if your text editor stores tabs as characters in their own
right (usually 0x09) rather than rows of spaces, tab should be included
with space and newline in the above.)

Briefly, the finite state machine can be left running until it returns
a terminal, which means it has found "->", "*" or a completed Inform
identifier: and it detects "with", "has" and "class" as special keywords
amongst these identifiers.


(c)  Initial colouring

ZapInform colours one line of visible text at a time.  For instance, it
might be faced with this:

     Object -> bottle "~Heinz~ bottle"

And it outputs an array of colours for each character position in the
line, which the text editor can then use in actually displaying the text.

It works out the state before the first character of the line (the "O"),
then scans through the line.  For each character, it determines the
initial colour as a function of the state at that character:

  If single-quote or double-quote is set, then quoted text colour.
  If comment is set, then comment colour.
  If statement is set:
     Use code colour
        unless the character is "[" or "]", in which case use
           function colour,
        or is a single or double quote, in which case use quoted text
           colour.
  If not:
     Use foreground colour
        unless the character is "," or ";" or "*" or ">", in which
           case use directive colour,
        or the character is "[" or "]", in which case use
           function colour,
        or is a single or double quote, in which case use quoted text
           colour.
  
However, the scanning algorithm sometimes signals that a block of
text must be "backtracked" through and recoloured.  For instance,
this happens if the white space after the sequence "c", "l", "a",
"s" and "s" is detected when in a context where the keyword "class"
is legal.  The scanning algorithm does this by setting the "colour
backtrack" bit in the outer state.  Note that the number of characters
we need to recolour backwards from the current position has been
recorded in bits 9 to 16 of the inner state (which has been counting
up lengths of identifiers), while the scanning algorithm has also
recorded the colour to be used.  For instance, in

     Object -> bottle "~Heinz~ bottle"
           ^  ^      ^

backtracks of size 6, 2 and 6 are called for at the three marked
spaces.  Note that a backtrack never crosses a new-line.

ZapInform uses the following chart of colours:

    name                   default actual colour

    foreground             navy blue
    quoted text            grey
    comment                light green
    directive              black
    property               red
    function               red
    code                   navy blue
    codealpha              dark green
    assembly               gold
    escape character       red

but note that at this stage, we've only used the following:

    function colour        [ and ] as function brackets, plus function names
    comment colour         comments
    directive colour       initial directive keywords, plus "*",
                           "->", "with", "has" and "class" when used
                           in a directive context
    quoted text colour     singly- or doubly-quoted text
    foreground colour      code in directives
    code colour            code in statements
    property colour        property, attribute and class names when
                           used within "with", "has" and "class"

For instance,

     Object -> bottle "~Heinz~ bottle"

would give us the array

     DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQQ

(F being foreground colour; it doesn't really matter what colour
values the spaces have).


(d)  Colour refinement


The next operation is "colour refinement", which includes a number
of things.

Firstly, any characters with colour Q (quoted-text) which have special
meanings are given "escape-character colour" instead.  This applies
to "~", "^", "\" and "@" followed by (possibly) another "@" and a
number of digits.

Next we look for identifiers.  An identifier for these purposes includes
a number, for it is just a sequence of:

     "_" or "$" or "#" or "0" to "9" or "a" to "z" or "A" to "Z".

The initial colouring of an identifier tells us its context.  We're
only interested in those in foreground colour (these must be used
in the body of a directive) or code colour (used in statements).

If an identifier is in code colour, then:

    If it follows an "@", recolour the "@" and the identifier in
       assembly-language colour.
    Otherwise, unless it is one of the following:

      "box"  "break"  "child"  "children"  "continue"  "default"
      "do"  "elder"  "eldest"  "else"  "false"  "font"  "for"  "give"
      "has"  "hasnt"  "if"  "in"  "indirect"  "inversion"  "jump"
      "metaclass"  "move"  "new_line"  "nothing"  "notin"  "objectloop"
      "ofclass"  "or"  "parent"  "print"  "print_ret"  "provides"  "quit"
      "random"  "read"  "remove"  "restore"  "return"  "rfalse"  "rtrue"
      "save"  "sibling"  "spaces"  "string"  "style"  "switch"  "to"
      "true"  "until"  "while"  "younger"  "youngest"

    we recolour the identifier to "codealpha colour".

On the other hand, if an identifier is in foreground colour, then we
check it to see if it's one of the following interesting keywords:

      "first"  "last"  "meta"  "only"  "private"  "replace"  "reverse"
      "string"  "table"

If it is, we recolour it in directive colour.

Thus, after colour refinement we arrive at the final colour scheme:


    function colour        [ and ] as function brackets, plus function names
    comment colour         comments
    quoted text colour     singly- or doubly-quoted text
    directive colour       initial directive keywords, plus "*",
                              "->", "with", "has" and "class" when used
                              in a directive context, plus any of the
                              reserved directive keywords listed above
    property colour        property, attribute and class names when
                              used within "with", "has" and "class"
    foreground colour      everything else in directives
    code colour            operators, numerals, brackets and statement
                              keywords such as "if" or "else" occurring
                              inside routines
    codealpha colour       variable and constant names occurring inside
                              routines
    assembly colour        @ plus assembly language opcodes
    escape char colour     special or escape characters in quoted text


(e)  An example

Consider the following example stretch of code (which is not meant to
be functional or interesting, just colourful):

   ! Here's the bottle:
   
   Object -> bottle "bottle marked ~DRINK ME~"
     with name "bottle" "jar" "flask",
          initial "There is an empty bottle here.",
          before
          [; LetGo:                      ! For dealing with water
                if (noun in bottle)
                    "You're holding that already (in the bottle).";
          ],
     has  container;
   
   [ ReadableSpell i j k;
     if (scope_stage==1)
     {   if (action_to_be==##Examine) rfalse;
         rtrue;
     }
     @set_cursor 1 1;
   ];
   
   Extend "examine" first
                   * scope=ReadableSpell            -> Examine;

Here are the initial colourings:

   ! Here's the bottle:
   CCCCCCCCCCCCCCCCCCCC
   
   Object -> bottle "bottle marked ~DRINK ME~"
   DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQQQQQQQQQQQQ
     with name "bottle" "jar" "flask",
   FFDDDDDPPPPPQQQQQQQQFQQQQQFQQQQQQQD
          initial "There is an empty bottle here.",
   FFFFFFFPPPPPPPPQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQD
          before
   FFFFFFFPPPPPP
          [; LetGo:                      ! For dealing with water
   FFFFFFFfSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSCCCCCCCCCCCCCCCCCCCCCCCC
                if (noun in bottle)
   SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
                    "You're holding that already (in the bottle).";
   SSSSSSSSSSSSSSSSSQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQS
          ],
   SSSSSSSfD
     has  container;
   FFDDDDDPPPPPPPPPD
   
   [ ReadableSpell i j k;
   fffffffffffffffSSSSSSS
     if (scope_stage==1)
   SSSSSSSSSSSSSSSSSSSSS
     {   if (action_to_be==##Examine) rfalse;
   SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
         rtrue;
   SSSSSSSSSSSS
     }
   SSS
     @set_cursor 1 1;
   SSSSSSSSSSSSSSSSSS
   ];
   fD
   
   Extend "examine" first
   DDDDDDDQQQQQQQQQFFFFFF
                   * scope=ReadableSpell            -> Examine;
   FFFFFFFFFFFFFFFFDDFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDDDFFFFFFFD

(Here F=foreground, D=directive, f=function, S=code (S for
"statement"), C=comment, P=property, Q=quoted text.)  And here is
the refinement:

   ! Here's the bottle:
   CCCCCCCCCCCCCCCCCCCC
   
   Object -> bottle "bottle marked ~DRINK ME~"
   DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQEQQQQQQQQEQ
     with name "bottle" "jar" "flask",
   FFDDDDDPPPPPQQQQQQQQFQQQQQFQQQQQQQD
          initial "There is an empty bottle here.",
   FFFFFFFPPPPPPPPQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQD
          before
   FFFFFFFPPPPPP
          [; LetGo:                      ! For dealing with water
   FFFFFFFfSSIIIIISSSSSSSSSSSSSSSSSSSSSSSCCCCCCCCCCCCCCCCCCCCCCCC
                if (noun in bottle)
   SSSSSSSSSSSSSSSSSIIIISSSSIIIIIIS
                    "You're holding that already (in the bottle).";
   SSSSSSSSSSSSSSSSSQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQS
          ],
   SSSSSSSfD
     has  container;
   FFDDDDDPPPPPPPPPD
   
   [ ReadableSpell i j k;
   fffffffffffffffSSSSSSS
     if (scope_stage==1)
   SSSSSSIIIIIIIIIIISSIS
     {   if (action_to_be==##Examine) rfalse;
   SSSSSSSSSSIIIIIIIIIIIISSIIIIIIIIISSSSSSSSS
         rtrue;
   SSSSSSSSSSSS
     }
   SSS
     @set_cursor 1 1;
   SSAAAAAAAAAAASISIS
   ];
   fD
   
   Extend "examine" first
   DDDDDDDQQQQQQQQQFDDDDD
                   * scope=ReadableSpell            -> Examine;
   FFFFFFFFFFFFFFFFDDFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDDDFFFFFFFD

(where E = escape characters, A = assembly and I = "codealpha", that
is, identifiers cited in statement code).



-- 
Graham Nelson | graham@gnelson.demon.co.uk | Oxford, United Kingdom

------------------------------------------------------------------------
There are two changes I had to make to Graham's algorithm to get it to
work - I believe these may be bugs in the algorithm.  In addition, I
made a couple of other changes.

AFTER-MARKER PROBLEM:
=====================
About 20 lines into step 7 (statement bit not set):
>           If not:
>              Is the character "["?
>                 If so:
>                    Set the statement bit.
>                    If the after-marker bit is set, set after-restart.
>                    Stop.

I had to set after-restart if after-marker was NOT set instead.

WAIT-DIRECT PROBLEM:
====================
The wait-direct flag is never cleared in the algorithm.  I decided to
clear it after "->", "*", "with", "has", "class" and any keyword
terminal.

With these two changes, I seem to be getting results that closely match
the screenshots on Graham's website.

INNER STATE MACHINE CHANGES:
============================
I have changed the action of the inner state machine somewhat.  Initial
whitespace is handled in the outer loop; the RunInnerState() method
loops in many cases until it identifies a terminal.  There may be other
small changes I have introduced but failed to document; I will go back
and check at some point.

IDENTIFYING NUMBERS:
====================
Numeric constants such as 6789, $ABCD, or $$1011011 are identified in
the Colour Refinement phase and marked as "number colour" or "code
number colour", depending on the initial colour.  By default, these are
set the same as foreground and code alpha colour respectively, but in
bold.

John

------------------------------------------------------------------------

