Skip to content

Adding Syntax Highlighting for a new Language

bobbylight edited this page Jan 10, 2022 · 7 revisions

Disclaimer: This document was accurate as of the most recent release, 2.0.7. It will be revised to reflect the state of things in the upcoming 2.5.0 release (master in the repository) shortly.

Contents

Overview

In RSyntaxTextArea, syntax highlighting for a language is done via an implementation of the TokenMaker interface. It does not handle code folding, or any other feature, only syntax highlighting. It has several methods to implement, but the most important one is:

public Token getTokenList(Segment text, int initialTokenType, int startOffset)

Syntax highlighting is done on a per-line basis. When the user edits code on a single line, that line is re-parsed by the current TokenMaker. This is done by calling the getTokenList() method. This method is responsible for parsing a line of text and returning the "tokens" for a programming language for that line. The TokenMaker is given the following parameters:

  • The text content of the line. For performance reasons, this may be the actual content of the backing Document, and should not be modified.
  • The "initial token type." This is the token type that the previous line ended with, and what the first token of this line should be, if the previous line ended with an unterminated token such as a multi-line comment. If the previous line did not end with an unterminated token, this value will be TokenTypes.NULL.
  • The starting offset of the line in the document.

What is returned is the first Token of the parsed line. Since Tokens have references to the "next" Token in their line, this is effectively a linked list of all Tokens on the parsed text.

Hand-Made TokenMakers

The most straightforward approach to add syntax highlighting for a new language is to create a TokenMaker implementation by hand. To do this, you will probably want to start by subclassing AbstractTokenMaker, as it implements all of the "boilerplate" methods of the interface for you.

Most programming languages have the concept of "keywords;" many also have a standard library of functions or constants, or some other set of identifiers that should be highlighted as special tokens. These should be added to the protected wordsToHighlight field of your AbstractTokenMaker implementation. This field is a TokenMap, which is a mapping of strings to token types. This field is created in your getWordsToHighlight() method overload. Below is an example of creating a token map that identifies a few keywords and functions of C:

@Override
public TokenMap getWordsToHighlight() {
   TokenMap tokenMap = new TokenMap();
  
   tokenMap.put("case",  Token.RESERVED_WORD);
   tokenMap.put("for",   Token.RESERVED_WORD);
   tokenMap.put("if",    Token.RESERVED_WORD);
   tokenMap.put("while", Token.RESERVED_WORD);
  
   tokenMap.put("printf", Token.FUNCTION);
   tokenMap.put("scanf",  Token.FUNCTION);
   tokenMap.put("fopen",  Token.FUNCTION);
   
   return tokenMap;
}

Next, you should add an override to the addToken() method, so that it actually uses your token map. This addToken() method is the last thing that gets called before a Token is actually added to the list of Tokens to return, so we use it as an opportunity to examine what was identified as a single "token" by the getTokenList() method (more about that below), and decide to highlight it differently if it's in wordsToHighlight. An example of a typical implementation of this method would be:

@Override
public void addToken(Segment segment, int start, int end, int tokenType, int startOffset) {
   // This assumes all keywords, etc. were parsed as "identifiers."
   if (tokenType==Token.IDENTIFIER) {
      int value = wordsToHighlight.get(segment, start, end);
      if (value != -1) {
         tokenType = value;
      }
   }
   super.addToken(segment, start, end, tokenType, startOffset);
}

Now, all that's left is to actually parse the text! Recall that parsing is done on a per-line basis. You will need to implement the getTokenList() method to take a line of text from the Document, identify all of the Tokens that correspond to that line of text, and call one of the addToken() overloads for each. By far, this is the most time-consuming and error-prone part. It is best to start simple, for example, build something that identifies whitespace and numbers, and marks "everything else" as identifiers, and extend from there. Examples help as well; see UnixShellTokenMaker and WindowsBatchTokenMaker.

Below is an example of a very simple getTokenList() implementation. It colorizes numbers, strings, line comments, and anything identified in wordsToHighlight:

/**
 * Returns a list of tokens representing the given text.
 *
 * @param text The text to break into tokens.
 * @param startTokenType The token with which to start tokenizing.
 * @param startOffset The offset at which the line of tokens begins.
 * @return A linked list of tokens representing <code>text</code>.
 */
public Token getTokenList(Segment text, int startTokenType, int startOffset) {

   resetTokenList();

   char[] array = text.array;
   int offset = text.offset;
   int count = text.count;
   int end = offset + count;

   // Token starting offsets are always of the form:
   // 'startOffset + (currentTokenStart-offset)', but since startOffset and
   // offset are constant, tokens' starting positions become:
   // 'newStartOffset+currentTokenStart'.
   int newStartOffset = startOffset - offset;

   currentTokenStart = offset;
   currentTokenType  = startTokenType;

   for (int i=offset; i<end; i++) {

      char c = array[i];

      switch (currentTokenType) {

         case Token.NULL:

            currentTokenStart = i;   // Starting a new token here.

            switch (c) {

               case ' ':
               case '\t':
                  currentTokenType = Token.WHITESPACE;
                  break;

               case '"':
                  currentTokenType = Token.LITERAL_STRING_DOUBLE_QUOTE;
                  break;

               case '#':
                  currentTokenType = Token.COMMENT_EOL;
                  break;

               default:
                  if (RSyntaxUtilities.isDigit(c)) {
                     currentTokenType = Token.LITERAL_NUMBER_DECIMAL_INT;
                     break;
                  }
                  else if (RSyntaxUtilities.isLetter(c) || c=='/' || c=='_') {
                     currentTokenType = Token.IDENTIFIER;
                     break;
                  }
                  
                  // Anything not currently handled - mark as an identifier
                  currentTokenType = Token.IDENTIFIER;
                  break;

            } // End of switch (c).

            break;

         case Token.WHITESPACE:

            switch (c) {

               case ' ':
               case '\t':
                  break;   // Still whitespace.

               case '"':
                  addToken(text, currentTokenStart,i-1, Token.WHITESPACE, newStartOffset+currentTokenStart);
                  currentTokenStart = i;
                  currentTokenType = Token.LITERAL_STRING_DOUBLE_QUOTE;
                  break;

               case '#':
                  addToken(text, currentTokenStart,i-1, Token.WHITESPACE, newStartOffset+currentTokenStart);
                  currentTokenStart = i;
                  currentTokenType = Token.COMMENT_EOL;
                  break;

               default:   // Add the whitespace token and start anew.

                  addToken(text, currentTokenStart,i-1, Token.WHITESPACE, newStartOffset+currentTokenStart);
                  currentTokenStart = i;

                  if (RSyntaxUtilities.isDigit(c)) {
                     currentTokenType = Token.LITERAL_NUMBER_DECIMAL_INT;
                     break;
                  }
                  else if (RSyntaxUtilities.isLetter(c) || c=='/' || c=='_') {
                     currentTokenType = Token.IDENTIFIER;
                     break;
                  }

                  // Anything not currently handled - mark as identifier
                  currentTokenType = Token.IDENTIFIER;

            } // End of switch (c).

            break;

         default: // Should never happen
         case Token.IDENTIFIER:

            switch (c) {

               case ' ':
               case '\t':
                  addToken(text, currentTokenStart,i-1, Token.IDENTIFIER, newStartOffset+currentTokenStart);
                  currentTokenStart = i;
                  currentTokenType = Token.WHITESPACE;
                  break;

               case '"':
                  addToken(text, currentTokenStart,i-1, Token.IDENTIFIER, newStartOffset+currentTokenStart);
                  currentTokenStart = i;
                  currentTokenType = Token.LITERAL_STRING_DOUBLE_QUOTE;
                  break;

               default:
                  if (RSyntaxUtilities.isLetterOrDigit(c) || c=='/' || c=='_') {
                     break;   // Still an identifier of some type.
                  }
                  // Otherwise, we're still an identifier (?).

            } // End of switch (c).

            break;

         case Token.LITERAL_NUMBER_DECIMAL_INT:

            switch (c) {

               case ' ':
               case '\t':
                  addToken(text, currentTokenStart,i-1, Token.LITERAL_NUMBER_DECIMAL_INT, newStartOffset+currentTokenStart);
                  currentTokenStart = i;
                  currentTokenType = Token.WHITESPACE;
                  break;

               case '"':
                  addToken(text, currentTokenStart,i-1, Token.LITERAL_NUMBER_DECIMAL_INT, newStartOffset+currentTokenStart);
                  currentTokenStart = i;
                  currentTokenType = Token.LITERAL_STRING_DOUBLE_QUOTE;
                  break;

               default:

                  if (RSyntaxUtilities.isDigit(c)) {
                     break;   // Still a literal number.
                  }

                  // Otherwise, remember this was a number and start over.
                  addToken(text, currentTokenStart,i-1, Token.LITERAL_NUMBER_DECIMAL_INT, newStartOffset+currentTokenStart);
                  i--;
                  currentTokenType = Token.NULL;

            } // End of switch (c).

            break;

         case Token.COMMENT_EOL:
            i = end - 1;
            addToken(text, currentTokenStart,i, currentTokenType, newStartOffset+currentTokenStart);
            // We need to set token type to null so at the bottom we don't add one more token.
            currentTokenType = Token.NULL;
            break;

         case Token.LITERAL_STRING_DOUBLE_QUOTE:
            if (c=='"') {
               addToken(text, currentTokenStart,i, Token.LITERAL_STRING_DOUBLE_QUOTE, newStartOffset+currentTokenStart);
               currentTokenType = Token.NULL;
            }
            break;

      } // End of switch (currentTokenType).

   } // End of for (int i=offset; i<end; i++).

   switch (currentTokenType) {

      // Remember what token type to begin the next line with.
      case Token.LITERAL_STRING_DOUBLE_QUOTE:
         addToken(text, currentTokenStart,end-1, currentTokenType, newStartOffset+currentTokenStart);
         break;

      // Do nothing if everything was okay.
      case Token.NULL:
         addNullToken();
         break;

      // All other token types don't continue to the next line...
      default:
         addToken(text, currentTokenStart,end-1, currentTokenType, newStartOffset+currentTokenStart);
         addNullToken();

   }

   // Return the first token in our linked list.
   return firstToken;

}

One important thing to note is the final switch statement, which adds an "ending" token to the token list. RSTA uses the last token on the "previous" line to determine what the starting state will be when parsing the "current" line. The previous line's state ends up as the value of the startTokenType argument. It is important to note that, if the current line does not end with an incomplete, multi-line token (such as an unterminated multi-line comment, or an unterminated string in languages that support multi-line strings), you should always add a "null" token to the end of the token list by calling addNullToken(). This simply adds an empty, marker token that tells RSTA to start in the TokenTypes.NULL state when parsing the next line.

JFlex-Based TokenMakers

As you can see, hand-made TokenMakers are relatively straightforward, but they get very long, and very complex, very fast. With complex languages, it is easy to introduce very subtle bugs, and maintenance becomes tough. A solution that alleviates both of these issues it to use JFlex to generate a TokenMaker. This is how almost all of RSTA's built-in TokenMakers were created. An understanding of JFlex's lexical specification is required to create TokenMakers with JFlex.

RSTA provides two base classes, AbstractJFlexTokenMaker and AbstractJFlexCTokenMaker, that your lexical specification can extend. The latter is for languages that have syntax and structure similar to C and Java; it adds auto-indentation logic when curly braces and parens are typed, and in the future it may provide more features specific to such languages. The former is for all other languages.

Whereas your typical JFlex lexer is designed to return a single token with each call to yylex(), JFlex lexers for use in RSTA must be designed to return a linked list of Tokens (recall that RSTA scans for tokens one line at a time). TODO: Expand this section.

This is best demonstrated by an example. Either take a look at one of the .flex files for built-in languages, or check out the simple example below. It colorizes numbers, strings, end-of-line comments and multi-line comments, and a few keywords (also available here). Note that this example only works with jflex-1.4.1 and under; it won't work on more recent versions because we are hacking the generated lexer code.

import java.io.*;   
import javax.swing.text.Segment;   
   
import org.fife.ui.rsyntaxtextarea.*;   
   
%%   
   
%public   
%class ExampleTokenMaker   
%extends AbstractJFlexCTokenMaker   
%unicode   
%type org.fife.ui.rsyntaxtextarea.Token   
   
/**   
 * A simple TokenMaker example.   
 */   
%{   
   
   /**   
    * Constructor.  This must be here because JFlex does not generate a   
    * no-parameter constructor.   
    */   
   public ExampleTokenMaker() {   
   }   
   
   /**   
    * Adds the token specified to the current linked list of tokens.   
    *   
    * @param tokenType The token's type.   
    * @see #addToken(int, int, int)   
    */   
   private void addHyperlinkToken(int start, int end, int tokenType) {   
      int so = start + offsetShift;   
      addToken(zzBuffer, start,end, tokenType, so, true);   
   }   
   
   /**   
    * Adds the token specified to the current linked list of tokens.   
    *   
    * @param tokenType The token's type.   
    */   
   private void addToken(int tokenType) {   
      addToken(zzStartRead, zzMarkedPos-1, tokenType);   
   }   
   
   /**   
    * Adds the token specified to the current linked list of tokens.   
    *   
    * @param tokenType The token's type.   
    * @see #addHyperlinkToken(int, int, int)   
    */   
   private void addToken(int start, int end, int tokenType) {   
      int so = start + offsetShift;   
      addToken(zzBuffer, start,end, tokenType, so, false);   
   }   
   
   /**   
    * Adds the token specified to the current linked list of tokens.   
    *   
    * @param array The character array.   
    * @param start The starting offset in the array.   
    * @param end The ending offset in the array.   
    * @param tokenType The token's type.   
    * @param startOffset The offset in the document at which this token   
    *        occurs.   
    * @param hyperlink Whether this token is a hyperlink.   
    */   
   public void addToken(char[] array, int start, int end, int tokenType,   
                  int startOffset, boolean hyperlink) {   
      super.addToken(array, start,end, tokenType, startOffset, hyperlink);   
      zzStartRead = zzMarkedPos;   
   }   
   
   /**   
    * Returns the text to place at the beginning and end of a   
    * line to "comment" it in a this programming language.   
    *   
    * @return The start and end strings to add to a line to "comment"   
    *         it out.   
    */   
   public String[] getLineCommentStartAndEnd() {   
      return new String[] { "//", null };   
   }   
   
   /**   
    * Returns the first token in the linked list of tokens generated   
    * from <code>text</code>.  This method must be implemented by   
    * subclasses so they can correctly implement syntax highlighting.   
    *   
    * @param text The text from which to get tokens.   
    * @param initialTokenType The token type we should start with.   
    * @param startOffset The offset into the document at which   
    *        <code>text</code> starts.   
    * @return The first <code>Token</code> in a linked list representing   
    *         the syntax highlighted text.   
    */   
   public Token getTokenList(Segment text, int initialTokenType, int startOffset) {   
   
      resetTokenList();   
      this.offsetShift = -text.offset + startOffset;   
   
      // Start off in the proper state.   
      int state = Token.NULL;   
      switch (initialTokenType) {   
                  case Token.COMMENT_MULTILINE:   
            state = MLC;   
            start = text.offset;   
            break;   
   
         /* No documentation comments */   
         default:   
            state = Token.NULL;   
      }   
   
      s = text;   
      try {   
         yyreset(zzReader);   
         yybegin(state);   
         return yylex();   
      } catch (IOException ioe) {   
         ioe.printStackTrace();   
         return new TokenImpl();   
      }   
   
   }   
   
   /**   
    * Refills the input buffer.   
    *   
    * @return      <code>true</code> if EOF was reached, otherwise   
    *              <code>false</code>.   
    */   
   private boolean zzRefill() {   
      return zzCurrentPos>=s.offset+s.count;   
   }   
   
   /**   
    * Resets the scanner to read from a new input stream.   
    * Does not close the old reader.   
    *   
    * All internal variables are reset, the old input stream    
    * <b>cannot</b> be reused (internal buffer is discarded and lost).   
    * Lexical state is set to <tt>YY_INITIAL</tt>.   
    *   
    * @param reader   the new input stream    
    */   
   public final void yyreset(Reader reader) {   
      // 's' has been updated.   
      zzBuffer = s.array;   
      /*   
       * We replaced the line below with the two below it because zzRefill   
       * no longer "refills" the buffer (since the way we do it, it's always   
       * "full" the first time through, since it points to the segment's   
       * array).  So, we assign zzEndRead here.   
       */   
      //zzStartRead = zzEndRead = s.offset;   
      zzStartRead = s.offset;   
      zzEndRead = zzStartRead + s.count - 1;   
      zzCurrentPos = zzMarkedPos = zzPushbackPos = s.offset;   
      zzLexicalState = YYINITIAL;   
      zzReader = reader;   
      zzAtBOL  = true;   
      zzAtEOF  = false;   
   }   
   
%}   
   
Letter                     = [A-Za-z]   
Digit                     = ([0-9])   
AnyCharacterButApostropheOrBackSlash   = ([^\\'])   
AnyCharacterButDoubleQuoteOrBackSlash   = ([^\\\"\n])   
NonSeparator                  = ([^\t\f\r\n\ \(\)\{\}\[\]\;\,\.\=\>\<\!\~\?\:\+\-\*\/\&\|\^\%\"\']|"#"|"\\")   
IdentifierStart               = ({Letter}|"_")   
IdentifierPart                  = ({IdentifierStart}|{Digit})   
WhiteSpace            = ([ \t\f]+)   
   
CharLiteral               = ([\']({AnyCharacterButApostropheOrBackSlash})[\'])   
UnclosedCharLiteral         = ([\'][^\'\n]*)   
ErrorCharLiteral         = ({UnclosedCharLiteral}[\'])   
StringLiteral            = ([\"]({AnyCharacterButDoubleQuoteOrBackSlash})*[\"])   
UnclosedStringLiteral      = ([\"]([\\].|[^\\\"])*[^\"]?)   
ErrorStringLiteral         = ({UnclosedStringLiteral}[\"])   
   
MLCBegin               = "/*"   
MLCEnd               = "*/"   
LineCommentBegin         = "//"   
   
IntegerLiteral         = ({Digit}+)   
ErrorNumberFormat         = (({IntegerLiteral}){NonSeparator}+)   
   
Separator               = ([\(\)\{\}\[\]])   
Separator2            = ([\;,.])   
   
Identifier            = ({IdentifierStart}{IdentifierPart}*)   
   
%state MLC   
   
%%   
   
<YYINITIAL> {   
   
   /* Keywords */   
   "do" |   
   "for" |   
   "if" |   
   "while"      { addToken(Token.RESERVED_WORD); }   
   
   /* Data types */   
   "byte" |   
   "char" |   
   "double" |   
   "float" |   
   "int"      { addToken(Token.DATA_TYPE); }   
   
   /* Functions */   
   "fopen" |   
   "fread" |   
   "printf" |   
   "scanf"      { addToken(Token.FUNCTION); }   
   
   {Identifier}            { addToken(Token.IDENTIFIER); }   
   
   {WhiteSpace}            { addToken(Token.WHITESPACE); }   
   
   /* String/Character literals. */   
   {CharLiteral}            { addToken(Token.LITERAL_CHAR); }   
   {UnclosedCharLiteral}      { addToken(Token.ERROR_CHAR); addNullToken(); return firstToken; }   
   {ErrorCharLiteral}         { addToken(Token.ERROR_CHAR); }   
   {StringLiteral}            { addToken(Token.LITERAL_STRING_DOUBLE_QUOTE); }   
   {UnclosedStringLiteral}      { addToken(Token.ERROR_STRING_DOUBLE); addNullToken(); return firstToken; }   
   {ErrorStringLiteral}      { addToken(Token.ERROR_STRING_DOUBLE); }   
   
   /* Comment literals. */   
   {MLCBegin}               { start = zzMarkedPos-2; yybegin(MLC); }   
   {LineCommentBegin}.*      { addToken(Token.COMMENT_EOL); addNullToken(); return firstToken; }   
   
   /* Separators. */   
   {Separator}               { addToken(Token.SEPARATOR); }   
   {Separator2}            { addToken(Token.IDENTIFIER); }   
   
   /* Operators. */   
   "!" | "%" | "%=" | "&" | "&&" | "*" | "*=" | "+" | "++" | "+=" | "," | "-" | "--" | "-=" |   
   "/" | "/=" | ":" | "<" | "<<" | "<<=" | "=" | "==" | ">" | ">>" | ">>=" | "?" | "^" | "|" |   
   "||" | "~"      { addToken(Token.OPERATOR); }   
   
   /* Numbers */   
   {IntegerLiteral}         { addToken(Token.LITERAL_NUMBER_DECIMAL_INT); }   
   {ErrorNumberFormat}         { addToken(Token.ERROR_NUMBER_FORMAT); }   
   
   /* Ended with a line not in a string or comment. */   
   \n |   
   <<EOF>>                  { addNullToken(); return firstToken; }   
   
   /* Catch any other (unhandled) characters. */   
   .                     { addToken(Token.IDENTIFIER); }   
   
}   
   
<MLC> {   
   [^\n*]+            {}   
   {MLCEnd}         { yybegin(YYINITIAL); addToken(start,zzStartRead+2-1, Token.COMMENT_MULTILINE); }   
   "*"               {}   
   \n |   
   <<EOF>>            { addToken(start,zzStartRead-1, Token.COMMENT_MULTILINE); return firstToken; }   
}   

Once you run this file through JFlex, you'll have to make manual changes to it. Once I stop being lazy, there will be a JFlex skeleton file that handles these changes for you, so the generated .java file will work out-of-the-box, but for now you have to do the following things manually: There are two zzRefill() and yyreset() methods with the same signatures in the generated file. You need to delete the second of each definition (the ones generated by the lexer). Change the declaration/definition of zzBuffer to NOT be initialized. This is a needless memory allocation for us since we will be pointing the array somewhere else anyway.

Using TokenMakerMaker to generate a TokenMaker

What if you want to add syntax highlighting for a new language, but don't want the hassle of hand-writing a scanner, or learning JFlex? TokenMakerMaker is a program that allows you to generate TokenMakers from a graphical user interface. You enter what constitutes keywords, comments, etc., and it generates both the .flex file you can run through JFlex to create the TokenMaker .java source file, as well as automatically generates the .java and class file for you (if configured to point to a JDK).

TokenMakerMaker is good at creating TokenMakers for languages with C- or Java-like syntax. For languages with markup-style syntax, you'll have to create your TokenMaker via one of the first two techniques above.

Currently, TokenMakerMaker is not available as a ready-to-use download. You have to grab it by cloning it from GitHub. You'll also need to grab the RSyntaxTextArea project, as TMM is dependent on it.

Once you've grabbed TokenMakerMaker, open its readme.txt file. It's short and sweet but contains good information about the application and how to build/run it (which is really just a simple Ant command). When you start TMM, the first thing to do is open the Options dialog via "File" -> "Options...". In the "General" section, you can set the location of a JDK for TMM to use. If TMM was started with a JDK (as opposed to a JRE), it will be pre-populated with that JDK. This is the JDK that the application will use to generate the TokenMaker class file you're working on for testing purposes. If a JDK is not specified, TMM will generate the Java source for your TokenMaker class, but not the actual class file for it, and you will not be able to test it out in the application.

Using TokenMakerMaker couldn't be easier. You specify what constitutes comments (EOL comments, multi-line, documentation), keywords, data types, functions, etc. Clicking the "Generate..." button will then create:

  • A .flex file for your TokenMaker, that can be run through JFlex.
  • A .java file for your TokenMaker, the end result of running the .flex file above through JFlex.

Unlike going the manual route, this Java file does not require any manual edits to compile; this is taken care of for you by the tool.

If a JDK was configured in the Options dialog, a compiled .class file for your TokenMaker. Further, if you have a JDK configured, a popup window will appear, with an RSyntaxTextArea demoing your newly-created TokenMaker class. When you are satisfied with it, you can simply copy the .flex and/or .java files into your project (if you did not configure TMM to generate them there directly).

Plugging your TokenMaker into an application

Once your TokenMaker class is included in your Java project, all you have to do to use it is register it with the current TokenMakerFactory, then tell your RSyntaxTextArea instance to use it. A TokenMakerFactory is a factory that maps syntax style identifiers (Strings, typically of the form "text/foobar") to TokenMakers for those styles. Here is a simple example:

AbstractTokenMakerFactory atmf = (AbstractTokenMakerFactory)TokenMakerFactory.getDefaultInstance();
atmf.putMapping("text/myLanguage", "fully.qualified.classNameOfYourSyntaxStyle");
textArea.setSyntaxEditingStyle("text/myLanguage");