Parse MATLAB classification tree and Generate corresponding JSON

MATLAB Classification Learner is a powerful tool for building and testing classification model. However, the generated classification tree is not very user-friendly to use outside of MATLAB.

Tip: Don’t care about details? Here is the final tool: MATLAB Classification Tree Parser.

Update: After I finished everything (the tool and this post), I somehow found that recently a MathWorks staff released a helper in File Exchange to export MATLAB models to PMML. What can I say? ¯\_(ツ)_/¯.

Background

An example output1 (i.e. view(ctree)) of MATLAB classification tree looks like:

Decision tree for classification
1  if x3<2.45 then node 2 elseif x3>=2.45 then node 3 else setosa
2  class = setosa
3  if x4<1.75 then node 4 elseif x4>=1.75 then node 5 else versicolor
4  if x3<4.95 then node 6 elseif x3>=4.95 then node 7 else versicolor
5  class = virginica
6  if x4<1.65 then node 8 elseif x4>=1.65 then node 9 else versicolor
7  class = virginica
8  class = versicolor
9  class = virginica

This is quite difficult to use in other environments. Say if I want to make a classifier in C++, I need to convert the above tree output into C++ code by hand, which is a very tedious work and prone to error especially when the tree is dense.

Though MATLAB allows you to visually inspect tree in GUI via view(ctree, 'mode', 'graph'), it is still annoying to convert it manually.

Also in one of my projects, I’ve implemented a decision tree classifier in C++ and it allows me to dynamically load decision trees from JSON files. Since these trees have to be updated periodically, I have to manually write the JSON frequently for more than a year. It would definitely be nice if I can make a tool to automatically generate those JSON files.

In the beginning, I looked into MATLAB toolbox’s source code but found nothing useful about the view method in either class ClassificationTree or CompactClassificationTree. Thus I need to “decipher” the output by myself.

Though it is possible to parse the tree output with regular expressions, I’d prefer to solve it in a similar way of building compilers. Because we’re dealing with a tree, and a parser in compiler also build a tree (AST, or Abstract Syntax Tree).

Define the Grammar

I wrote the grammar in ANTLR 4. ANTLR can also generates lexer and parser for my later building the tool.

Here is the context-free grammar for MATLAB classification tree:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// MatlabCTree.g4
grammar MatlabCTree;

start           : 'Decision tree for classification'? stmtList;

stmtList        : stmt | stmtList stmt;
stmt            : INTEGER (ifStmt | classExpr);

ifStmt          : ifClause elseifClause? elseClause?;
ifClause        : 'if' condExpr 'then' nodeExpr;
elseifClause    : 'elseif' condExpr 'then' nodeExpr;
elseClause      : 'else' nodeExpr;

labelExpr       : INTEGER | IDENTIFIER;
classExpr       : 'class' '=' labelExpr;
nodeExpr        : 'node' INTEGER | labelExpr;

constExpr       : INTEGER | RATIONAL;
condExpr        : IDENTIFIER '<'  constExpr
                | IDENTIFIER '>'  constExpr
                | IDENTIFIER '<=' constExpr
                | IDENTIFIER '>=' constExpr
                | IDENTIFIER '==' constExpr
                | IDENTIFIER '~=' constExpr;

INTEGER         : '0' | [1-9] [0-9]*;
IDENTIFIER      : [a-zA-Z] [a-zA-Z0-9_]*;

// Adapted from https://git.io/JU4M9
RATIONAL        : NUMBER (E SIGN? NUMBER)?;
fragment NUMBER : INTEGER? '.' [0-9]+;
fragment E      : 'E' | 'e';
fragment SIGN   : '+' | '-';

WHITESPACE      : [ \t\r\n]+ -> skip;

Some key points in the above grammar:

  • Line 3: The prefix Decision tree for classification is optional.
  • Line 6: The following lines are either a branch or a leaf starting with a node number.
  • Line 8: At least from MATLAB’s example1 and my own experience, the output of tree branch has a fixed pattern of
    if feature<a then node x elseif feature>=a then node y else some_label
    

    However I’m not sure if the elseif clause and else clause would be present in all scenarios. So I decided to leave them as optional.

  • Line 13: Depending on how Learner is configured, the label could be either a string (MATLAB’s example) or an integer (my case).
  • Line 18-23: Only 6 operators are supported: <, >, <=, >=, == and ~=.
  • Line 25: Preceding 0s are not allowed in integers.
  • Line 26: This follows the variable naming convention in MATLAB.

Now we have the grammar, we can test MATLAB’s example classification with ANTLR’s built-in GUI tester (click the figure to enlarge):

ANTLR parsing tree for MATLAB's example classification tree

The above parsing tree looks like just what I want! It’s time to build the tool.

Build the Tool

Initially I was thinking to build a tool in Java or C#, but later I realized why not make a web-based tool as ANTLR supports outputting lexers and parsers in JavaScript. A online tool also makes everyone’s life easier: no need to worry in setting up environments and doing (potentially annoying) compiling job.

Also this tool will purely run in the browser, that is 100% of the code running on user’s device, there is no need to set up a server nor any interaction with remote server at all.

If you’re not familiar with using ANTLR on JavaScript target, here is the official documentation.

I used the below command to generate lexer and parser. Note that I’ll use visitor instead of listener in the final product.

java -jar antlr-4.8-complete.jar -Dlanguage=JavaScript -visitor -no-listener MatlabCTree.g4

Note: I’ll only cover the main points in the following sections. If you want to dive into details, source code is available in section Source code.

Traverse AST (Abstract Syntax Tree)

This part is pretty straightforward, just traverse the generated AST and store all classification tree nodes into an array.

Since it’s possible that node numbers are not in a increasing order, for example:

Decision tree for classification
1  if x<0 then node 2 elseif x>=0 then node 3 else foo
3  class = foo
2  class = bar

Note that the above order is 1, 3, 2 while we expect 1, 2, 3.

In order to detect this, we need to setup a counter, increase the counter every time we visit stmt and check if the node number matches the counter:

visitStmt(ctx) {
  // stmt: INTEGER (ifStmt | classExpr);
  let nodeNumber = parseInt(ctx.children[0].getText());

  // Check if the counter matches node number in stmt
  if (++this.nodeNumber !== nodeNumber) {
    let symbol = ctx.children[0].symbol;
    this.errors.push({
      line: symbol.line,
      message: `mismatched node number ${nodeNumber} ` +
        `expecting ${this.nodeNumber}`
    });
  }

  let result = this.visitChildren(ctx);
  this.jsonNodes.push(result[1]);

  // Actually it doesn't matter whether result is returned
  return result;
}

Because the generated tree from fitctree is a binary tree2, we just need to check if the conditions in if and elseif clauses are complement. else clause can be safely ignored.

Here are the examples that conditions are not complement:

# Feature name doesn't match (x and y)
1  if x<0.5 then node 2 elseif y>=0.5 then node 3 else foo

# Comparison value doesn't match (0.5 and -0.5)
1  if x<0.5 then node 2 elseif x>=-0.5 then node 3 else foo

# Operators are not complement (< and >, which supposed to be < and >=)
1  if x<0.5 then node 2 elseif x>0.5 then node 3 else foo

To handle this:

visitIfStmt(ctx) {
  // ifStmt: ifClause elseifClause? elseClause?;
  let result = this.visitChildren(ctx);

  let featureName = result[0].condition.identifer;
  let operator = result[0].condition.operator;
  let compareValue = result[0].condition.value;
  let trueBranch = result[0].branch;
  let falseBranch = null;

  // 2 or more clauses?
  if (result.length > 1) {
    // elseifClause
    if (result[1].hasOwnProperty("condition")) {
      let elseifFN = result[1].condition.identifer;
      let elseifOP = result[1].condition.operator;
      let elseifCV = result[1].condition.value;

      // Does conditions match?
      if (elseifFN !== featureName ||
        elseifOP !== this.complementOperators[operator] ||
        elseifCV !== compareValue) {

        let elseIfExpr = ctx.children[1].children[1];
        this.errors.push({
          line: elseIfExpr.start.line,
          message: `mismatched elseif clause '${elseIfExpr.getText()}' ` +
            `expecting '${featureName}` + 
            `${this.complementOperators[operator]}` + 
            `${compareValue}'`
        });
      } else {
        falseBranch = result[1].branch;
      }
    } else {
      // elseClause
      falseBranch = result[1];
    }
  }

  // We ignored the elseClause here if there are 3 clauses present.
  // This should be safe since the generated classification tree from
  // MATLAB's fitctree function is supposed to be a binary tree.

  return { featureName, operator, compareValue, trueBranch, falseBranch };
}

Generate JSON

The output JSON follows the below format:

{
    "featureName": "{feature name}",
    "operator": "{operator for comparison}",
    "compareValue": {the value to be compared (in double)},
    "trueBranch": {true branch, if `fn op val` holds},
    "falseBranch": {false branch, otherwise}
}

trueBranch and falseBranch are either a string representing the predicted label, or an above object indicating a child tree node.

My classifier actually accepts a compact version of it. The key names are simplified into fn, op, val, tb, and fb for saving spaces.

Since we have already gathered all the nodes in the classification tree via AST visitor, we just need to perform a simple DFS (Depth-First Search):

dfs(node, index) {
  this.visitedNodes.push(index);

  return { 
    featureName: node.featureName,
    operator:  node.operator,
    compareValue: node.compareValue,
    trueBranch: this.visitBranch(node.trueBranch),
    falseBranch: this.visitBranch(node.falseBranch)
  };
}

Even though it is guaranteed to be a tree, it doesn’t prevent us to add support for forest:

// Perform DFS on all nodes so we can also handle forests
this.jsonNodes.forEach((node, index) => {
  // Only take care not yet visited ifStmt nodes
  if (typeof node !== "string" && !this.visitedNodes.includes(index)) {
    jsonOutput.push(this.dfs(node, index));
  }
});

We also need to handle a boundary condition where the tree consists of a label node only, for example:

Decision tree for classification
1  class = setosa

To handle this:

this.jsonNodes.forEach((node, index) => {
  // Boundary condition, the tree only has a label.
  if (typeof node === "string" && !this.visitedNodes.includes(index)) {
    jsonOutput.push(node);
  }
});

One more thing need to take care is branch target’s node number in if and elseif, since it might exceed the number of nodes. For example:

Decision tree for classification
1  if x3<2.45 then node 2 elseif x3>=2.45 then node 3 else setosa

Apparently, node 2 and 3 are missing in the input. So we need to detect this and report error:

if (branch.target > this.jsonNodes.length) {
  this.errors.push({
    line: branch.line,
    message: `invalid node index ${branch.target} ` +
      `expecting no greater than ${this.jsonNodes.length}`
  });
  return null;
}

Wrapper class

Now we can make a wrapper class MatlabCTreeConverter to call lexer and parser, and invoke AST visitor and JSON generator.

Here is the way to handle syntax errors, syntaxErrors is for later error reporting:

let syntaxErrors = [];

// Adapted from https://stackoverflow.com/a/60841205
parser.removeErrorListeners();
parser.addErrorListener({
  syntaxError: (recognizer, offendingSymbol, line, column, msg, err) => {
    syntaxErrors.push({ line: line, message: msg });
  }
});

Since our starting rule happened to be start, we can build AST by:

let ast = parser.start();

webpack everything

Setting up webpack is quite annoying and very painful. Fortunately, I found a working configuration file for webpack3. Hat tip to @nicolasdao!

Note: Since anyway the core code will be publicly accessible, uglifyjs-webpack-plugin is not used.

Once you got everything in that gist done, remember to add the following entry to webpack.config.js’s module.exports:

node: { module: 'empty' net: 'empty', fs: 'empty' }

This is VERY important (noted in ANTLR’s official documentation), otherwise ANTLR will try to import fs module which is not available in browsers.

User Interface

Big bruh moment. Needless to say how tedious is UI design and implementation. Anyway here is how it looks like:

Overview of MATLAB Classification Tree Parser

If any error is detected in the input, it will report which line goes wrong:

Error tooltip

In case you’re curious, I’m using CodeMirror for the input and output editors, and Viz.js for drawing the classification tree.

Finally the link to the tool: MATLAB Classification Tree Parser.

Source Code

Grammar definition has been covered here. Simply copy the code there and save it as MatlabCTree.g4.

Full JavaScript project for webpack: matlab-ctree-parser.zip. SHA1: 212f9647d35ad83f67a01a4bc825d72ac2ad21df

References