Parse MATLAB classification tree and Generate corresponding JSON
MATLAB Classification Learner is a powerful tool for building and testing classification model. However, the generated classification tree is not very user-friendly to use outside of MATLAB.
Tip: Don’t care about details? Here is the final tool: MATLAB Classification Tree Parser.
Update: After I finished everything (the tool and this post), I somehow found that recently a MathWorks staff released a helper in File Exchange to export MATLAB models to PMML. What can I say? ¯\_(ツ)_/¯.
Background
An example output1 (i.e. view(ctree)
) of MATLAB classification tree looks like:
Decision tree for classification
1 if x3<2.45 then node 2 elseif x3>=2.45 then node 3 else setosa
2 class = setosa
3 if x4<1.75 then node 4 elseif x4>=1.75 then node 5 else versicolor
4 if x3<4.95 then node 6 elseif x3>=4.95 then node 7 else versicolor
5 class = virginica
6 if x4<1.65 then node 8 elseif x4>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
This is quite difficult to use in other environments. Say if I want to make a classifier in C++, I need to convert the above tree output into C++ code by hand, which is a very tedious work and prone to error especially when the tree is dense.
Though MATLAB allows you to visually inspect tree in GUI via view(ctree, 'mode', 'graph')
, it is still annoying to convert it manually.
Also in one of my projects, I’ve implemented a decision tree classifier in C++ and it allows me to dynamically load decision trees from JSON files. Since these trees have to be updated periodically, I have to manually write the JSON frequently for more than a year. It would definitely be nice if I can make a tool to automatically generate those JSON files.
In the beginning, I looked into MATLAB toolbox’s source code but found nothing useful about the view
method in either
class ClassificationTree
or CompactClassificationTree
. Thus I need to “decipher” the output by myself.
Though it is possible to parse the tree output with regular expressions, I’d prefer to solve it in a similar way of building compilers. Because we’re dealing with a tree, and a parser in compiler also build a tree (AST, or Abstract Syntax Tree).
Define the Grammar
I wrote the grammar in ANTLR 4. ANTLR can also generates lexer and parser for my later building the tool.
Here is the context-free grammar for MATLAB classification tree:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// MatlabCTree.g4
grammar MatlabCTree;
start : 'Decision tree for classification'? stmtList;
stmtList : stmt | stmtList stmt;
stmt : INTEGER (ifStmt | classExpr);
ifStmt : ifClause elseifClause? elseClause?;
ifClause : 'if' condExpr 'then' nodeExpr;
elseifClause : 'elseif' condExpr 'then' nodeExpr;
elseClause : 'else' nodeExpr;
labelExpr : INTEGER | IDENTIFIER;
classExpr : 'class' '=' labelExpr;
nodeExpr : 'node' INTEGER | labelExpr;
constExpr : INTEGER | RATIONAL;
condExpr : IDENTIFIER '<' constExpr
| IDENTIFIER '>' constExpr
| IDENTIFIER '<=' constExpr
| IDENTIFIER '>=' constExpr
| IDENTIFIER '==' constExpr
| IDENTIFIER '~=' constExpr;
INTEGER : '0' | [1-9] [0-9]*;
IDENTIFIER : [a-zA-Z] [a-zA-Z0-9_]*;
// Adapted from https://git.io/JU4M9
RATIONAL : NUMBER (E SIGN? NUMBER)?;
fragment NUMBER : INTEGER? '.' [0-9]+;
fragment E : 'E' | 'e';
fragment SIGN : '+' | '-';
WHITESPACE : [ \t\r\n]+ -> skip;
Some key points in the above grammar:
- Line 4: The prefix
Decision tree for classification
is optional. - Line 7: The following lines are either a branch or a leaf starting with a node number.
- Line 9: At least from MATLAB’s example1 and my own experience, the output of tree branch has
a fixed pattern of
if feature<a then node x elseif feature>=a then node y else some_label
However I’m not sure if the
elseif
clause andelse
clause would be present in all scenarios. So I decided to leave them as optional. - Line 14: Depending on how Learner is configured, the label could be either a string (MATLAB’s example) or an integer (my case).
- Line 19-24: Only 6 operators are supported:
<
,>
,<=
,>=
,==
and~=
. - Line 26: Preceding
0
s are not allowed in integers. - Line 27: This follows the variable naming convention in MATLAB.
Now we have the grammar, we can test MATLAB’s example classification with ANTLR’s built-in GUI tester (click the figure to enlarge):
The above parsing tree looks like just what I want! It’s time to build the tool.
Build the Tool
Initially I was thinking to build a tool in Java or C#, but later I realized why not make a web-based tool as ANTLR supports outputting lexers and parsers in JavaScript. A online tool also makes everyone’s life easier: no need to worry in setting up environments and doing (potentially annoying) compiling job.
Also this tool will purely run in the browser, that is 100% of the code running on user’s device, there is no need to set up a server nor any interaction with remote server at all.
If you’re not familiar with using ANTLR on JavaScript target, here is the official documentation.
I used the below command to generate lexer and parser. Note that I’ll use visitor
instead of listener
in the final product.
java -jar antlr-4.8-complete.jar -Dlanguage=JavaScript -visitor -no-listener MatlabCTree.g4
Note: I’ll only cover the main points in the following sections. If you want to dive into details, source code is available in section Source code.
Traverse AST (Abstract Syntax Tree)
This part is pretty straightforward, just traverse the generated AST and store all classification tree nodes into an array.
Since it’s possible that node numbers are not in a increasing order, for example:
Decision tree for classification
1 if x<0 then node 2 elseif x>=0 then node 3 else foo
3 class = foo
2 class = bar
Note that the above order is 1
, 3
, 2
while we expect 1
, 2
, 3
.
In order to detect this, we need to setup a counter, increase the counter every time we visit stmt
and
check if the node number matches the counter:
visitStmt(ctx) {
// stmt: INTEGER (ifStmt | classExpr);
let nodeNumber = parseInt(ctx.children[0].getText());
// Check if the counter matches node number in stmt
if (++this.nodeNumber !== nodeNumber) {
let symbol = ctx.children[0].symbol;
this.errors.push({
line: symbol.line,
message: `mismatched node number ${nodeNumber} ` +
`expecting ${this.nodeNumber}`
});
}
let result = this.visitChildren(ctx);
this.jsonNodes.push(result[1]);
// Actually it doesn't matter whether result is returned
return result;
}
Because the generated tree from fitctree
is a binary tree2, we just need to check if the conditions in
if
and elseif
clauses are complement. else
clause can be safely ignored.
Here are the examples that conditions are not complement:
# Feature name doesn't match (x and y)
1 if x<0.5 then node 2 elseif y>=0.5 then node 3 else foo
# Comparison value doesn't match (0.5 and -0.5)
1 if x<0.5 then node 2 elseif x>=-0.5 then node 3 else foo
# Operators are not complement (< and >, which supposed to be < and >=)
1 if x<0.5 then node 2 elseif x>0.5 then node 3 else foo
To handle this:
visitIfStmt(ctx) {
// ifStmt: ifClause elseifClause? elseClause?;
let result = this.visitChildren(ctx);
let featureName = result[0].condition.identifer;
let operator = result[0].condition.operator;
let compareValue = result[0].condition.value;
let trueBranch = result[0].branch;
let falseBranch = null;
// 2 or more clauses?
if (result.length > 1) {
// elseifClause
if (result[1].hasOwnProperty("condition")) {
let elseifFN = result[1].condition.identifer;
let elseifOP = result[1].condition.operator;
let elseifCV = result[1].condition.value;
// Does conditions match?
if (elseifFN !== featureName ||
elseifOP !== this.complementOperators[operator] ||
elseifCV !== compareValue) {
let elseIfExpr = ctx.children[1].children[1];
this.errors.push({
line: elseIfExpr.start.line,
message: `mismatched elseif clause '${elseIfExpr.getText()}' ` +
`expecting '${featureName}` +
`${this.complementOperators[operator]}` +
`${compareValue}'`
});
} else {
falseBranch = result[1].branch;
}
} else {
// elseClause
falseBranch = result[1];
}
}
// We ignored the elseClause here if there are 3 clauses present.
// This should be safe since the generated classification tree from
// MATLAB's fitctree function is supposed to be a binary tree.
return { featureName, operator, compareValue, trueBranch, falseBranch };
}
Generate JSON
The output JSON follows the below format:
{
"featureName": "{feature name}",
"operator": "{operator for comparison}",
"compareValue": {the value to be compared (in double)},
"trueBranch": {true branch, if `fn op val` holds},
"falseBranch": {false branch, otherwise}
}
trueBranch
and falseBranch
are either a string representing the predicted label, or an above object indicating a child tree node.
My classifier actually accepts a compact version of it. The key names are simplified into fn
, op
, val
, tb
, and fb
for saving spaces.
Since we have already gathered all the nodes in the classification tree via AST visitor, we just need to perform a simple DFS (Depth-First Search):
dfs(node, index) {
this.visitedNodes.push(index);
return {
featureName: node.featureName,
operator: node.operator,
compareValue: node.compareValue,
trueBranch: this.visitBranch(node.trueBranch),
falseBranch: this.visitBranch(node.falseBranch)
};
}
Even though it is guaranteed to be a tree, it doesn’t prevent us to add support for forest:
// Perform DFS on all nodes so we can also handle forests
this.jsonNodes.forEach((node, index) => {
// Only take care not yet visited ifStmt nodes
if (typeof node !== "string" && !this.visitedNodes.includes(index)) {
jsonOutput.push(this.dfs(node, index));
}
});
We also need to handle a boundary condition where the tree consists of a label node only, for example:
Decision tree for classification
1 class = setosa
To handle this:
this.jsonNodes.forEach((node, index) => {
// Boundary condition, the tree only has a label.
if (typeof node === "string" && !this.visitedNodes.includes(index)) {
jsonOutput.push(node);
}
});
One more thing need to take care is branch target’s node number in if
and elseif
, since it might exceed the number of nodes. For example:
Decision tree for classification
1 if x3<2.45 then node 2 elseif x3>=2.45 then node 3 else setosa
Apparently, node 2
and 3
are missing in the input. So we need to detect this and report error:
if (branch.target > this.jsonNodes.length) {
this.errors.push({
line: branch.line,
message: `invalid node index ${branch.target} ` +
`expecting no greater than ${this.jsonNodes.length}`
});
return null;
}
Wrapper class
Now we can make a wrapper class MatlabCTreeConverter
to call lexer and parser,
and invoke AST visitor and JSON generator.
Here is the way to handle syntax errors, syntaxErrors
is for later error reporting:
let syntaxErrors = [];
// Adapted from https://stackoverflow.com/a/60841205
parser.removeErrorListeners();
parser.addErrorListener({
syntaxError: (recognizer, offendingSymbol, line, column, msg, err) => {
syntaxErrors.push({ line: line, message: msg });
}
});
Since our starting rule happened to be start
, we can build AST by:
let ast = parser.start();
webpack
everything
Setting up webpack
is quite annoying and very painful. Fortunately, I found a working configuration file for webpack
3.
Hat tip to @nicolasdao!
Note: Since anyway the core code will be publicly accessible, uglifyjs-webpack-plugin
is not used.
Once you got everything in that gist done,
remember to add the following entry to webpack.config.js
’s module.exports
:
node: { module: 'empty' net: 'empty', fs: 'empty' }
This is VERY important (noted in ANTLR’s official documentation),
otherwise ANTLR will try to import fs
module which is not available in browsers.
User Interface
Big bruh moment. Needless to say how tedious is UI design and implementation. Anyway here is how it looks like:
If any error is detected in the input, it will report which line goes wrong:
In case you’re curious, I’m using CodeMirror for the input and output editors, and Viz.js for drawing the classification tree.
Finally the link to the tool: MATLAB Classification Tree Parser.
Source Code
Grammar definition has been covered here. Simply copy the code there and save it as MatlabCTree.g4
.
Full JavaScript project for webpack
: matlab-ctree-parser.zip. SHA1: 212f9647d35ad83f67a01a4bc825d72ac2ad21df
References
-
MathWorks Help Center, View Decision Tree - MATLAB & Simulink. ↩ ↩2
-
MathWorks Help Center, Fit binary decision tree for multiclass classification - MATLAB fitctree. ↩
-
Nicolas Dao, Basic damn Webpack config for simple transpilation ES6 to ES5. ↩
Comments Go back to the top