Today, we continue our journey into Ruby Templating. With the lexer in place, let’s move on to the next step: The parser.
Last time, we looked at string interpolation and subsequently, dived into creating our own templating language. We started by implementing a lexer that reads a template and converts it into a stream of tokens. Today, we’ll implement the accompanying parser. We will also dip our toes into a bit of language theory.
Here we go!
Abstract Syntax Trees
Let’s look back to our simple example template for Welcome to {{name}}
. After using the lexer to tokenize the string, we get a list of tokens like this.
1Magicbars::Lexer.tokenize("Welcome to {{name}}")
2# => [[:CONTENT, "Welcome to "], [:OPEN_EXPRESSION], [:IDENTIFIER, "name"], [:CLOSE]]
Ultimately, we want to evaluate the template and replace the expression with real values. To make things a bit more challenging, we also want to evaluate complex block expressions, allowing for repetition and conditionals.
To do this, we have to generate an abstract syntax tree (AST) that describes the logical structure of the template. The tree consists of nodes that may reference other nodes or store additional data from the tokens.
For our simple example, the desired abstract syntax tree looks like this:
Defining a Grammar
To define the grammar, let's start with the theoretical basis of a language. Like other programming languages, our templating language is a context-free language and therefore can be described by a context-free grammar. (Don’t let the mathematical notations in the detailed Wikipedia descriptions scare you away. The concept is pretty straight forward, and there are more developer-friendly ways to notate a grammar.)
A context-free grammar is a set of rules that describe how all possible strings of a language are constructed. Let’s look at the grammar for our templating language in EBNF notation:
1template = statements;
2statements = { statement };
3statement = CONTENT | expression | block_expression;
4expression = OPEN_EXPRESSION, IDENTIFIER, arguments, CLOSE;
5block_expression = OPEN_BLOCK, IDENTIFIER, arguments, CLOSE, statements, [ OPEN_INVERSE, CLOSE, statements ], OPEN_END_BLOCK, IDENTIFIER, CLOSE;
6arguments = { IDENTIFIER };
Each assignment defines a rule. The rule’s name is on the left and a bunch of other rules (lower case) or tokens (upper case) from our lexer are on the right. Rules and tokens can be concatenated using commas ,
or alternated using the pipe |
symbol. Rules and tokens inside of curly braces { ... }
might be repeated several times. When they are inside of brackets [ ... ]
, they are considered optional.
The above grammar is a concise way to describe that a template consists of statements. A statement is either a CONTENT
token, an expression, or a block expression. An expression is an OPEN_EXPRESSION
token, followed by an IDENTIFIER
token, followed by arguments, followed by a CLOSE
token. And a block expression is the perfect example of why it’s better to use a notation like the one above instead of trying to describe it with a natural language.
There are tools that automatically generate parsers from grammar definitions like the one above. But in true Ruby Magic tradition, let's have some fun and build the parser ourselves, hopefully learning a thing or two in the process.
Building the Parser
With the language theory aside, let’s jump into actually building the parser. Let’s start with an even more minimal, but still valid, template: Welcome to Ruby Magic
. This template doesn’t have any expressions and the list of tokens consists of just one element. Here’s what it looks like:
1[[:CONTENT, "Welcome to Ruby Magic"]]
First, we set up our parser class. It looks like this:
1module Magicbars
2 class Parser
3 def self.parse(tokens)
4 new(tokens).parse
5 end
6
7 attr_reader :tokens
8
9 def initialize(tokens)
10 @tokens = tokens
11 end
12
13 def parse
14 # Parsing starts here
15 end
16 end
17end
The class takes an array of tokens and stores it. It only has one public method called parse
that converts the tokens into an AST.
Looking back at our grammar, the top-most rule is template
. That implies that parse
, at the start of the parsing process, will return a Template
node.
Nodes are simple classes with no behavior of their own. They just connect other nodes or store some values from the tokens. Here’s what the Template
node looks like:
1module Magicbars
2 module Nodes
3 class Template
4 attr_reader :statements
5
6 def initialize(statements)
7 @statements = statements
8 end
9 end
10 end
11end
To make our example work, we also need a Content
node. It just stores the text content ("Welcome to Ruby Magic"
) from the token.
1module Magicbars
2 module Nodes
3 class Content
4 attr_reader :content
5
6 def initialize(content)
7 @content = content
8 end
9 end
10 end
11end
Next, let’s implement the parse method to create an instance of Template
and an instance of Content
and connect them up correctly.
1def parse
2 Magicbars::Nodes::Template.new(parse_content)
3end
4
5def parse_content
6 return unless tokens[0][0] == :CONTENT
7
8 Magicbars::Nodes::Content.new(tokens[0][1])
9end
When we run the parser, we get the correct result:
1Magicbars::Parser.parse(tokens)
2# => #<Magicbars::Nodes::Template:0x00007fe90e939410 @statements=#<Magicbars::Nodes::Content:0x00007fe90e939578 @content="Welcome to Ruby Magic">>
Admittedly, this only works for our simple example that only has one content node. Let’s switch to a more complex example that actually includes an expression: Welcome to {{name}}
.
1Magicbars::Lexer.tokenize("Welcome to {{name}}")
2# => [[:CONTENT, "Welcome to "], [:OPEN_EXPRESSION], [:IDENTIFIER, "name"], [:CLOSE]]
For this, we need an Expression
node and an Identifier
node. The Expression
node stores the identifier as well as any arguments (which, according to the grammar, are an array of zero or more Identifier
nodes). As with the other nodes, there’s not much to see here.
1module Magicbars
2 module Nodes
3 class Expression
4 attr_reader :identifier, :arguments
5
6 def initialize(identifier, arguments)
7 @identifier = identifier
8 @arguments = arguments
9 end
10 end
11 end
12end
13
14module Magicbars
15 module Nodes
16 class Identifier
17 attr_reader :value
18
19 def initialize(value)
20 @value = value.to_sym
21 end
22 end
23 end
24end
With the new nodes in place, let’s modify the parse
method to handle both regular content as well as expressions. We do that by introducing a parse_statements
method that just keeps on calling parse_statement
as long as it returns a value.
1def parse
2 Magicbars::Nodes::Template.new(parse_statements)
3end
4
5def parse_statements
6 results = []
7
8 while result = parse_statement
9 results << result
10 end
11
12 results
13end
parse_statement
itself first calls parse_content
and if that doesn’t return any value, it calls parse_expression
.
1def parse_statement
2 parse_content || parse_expression
3end
Have you noticed that the parse_statement
method is starting to look very similar to the statement
rule in the grammar? This is where taking the time to explicitly write up the grammar beforehand helps a lot to ensure that we’re on the right path.
Next, let’s modify the parse_content
method so that it doesn’t only look at the first token. We do this by introducing an additional @position
instance variable in the initializer and use it to fetch the current token.
1attr_reader :tokens, :position
2
3def initialize(tokens)
4 @tokens = tokens
5 @position = 0
6end
7
8# ...
9
10def parse_content
11 return unless token = tokens[position]
12 return unless token[0] == :CONTENT
13
14 @position += 1
15
16 Magicbars::Nodes::Content.new(token[1])
17end
The parse_content
method now looks at the current token and checks its type. If it’s a CONTENT
token, it increments the position (because the current token was successfully parsed) and uses the token’s content to create the Content
node. If there is no current token (because we’re at the end of the tokens) or the type doesn’t match, the method exits early and returns nil
.
With the improved parse_content
method in place, let’s tackle the new parse_expression
method.
1def parse_expression
2 return unless token = tokens[position]
3 return unless token[0] == :OPEN_EXPRESSION
4
5 @position += 1
6
7 identifier = parse_identifier
8 arguments = parse_arguments
9
10 if !tokens[position] || tokens[position][0] != :CLOSE
11 raise "Unexpected token #{tokens[position][0]}. Expected :CLOSE."
12 end
13
14 @position += 1
15
16 Magicbars::Nodes::Expression.new(identifier, arguments)
17end
First, we check that there is a current token and that its type is OPEN_EXPRESSION
. If that’s the case, we advance to the next token and parse the identifier as well as the arguments by calling parse_identifier
and parse_arguments
, respectively. Both methods will return the respective nodes and advance the current token. When that’s done, we ensure that the current token exists and is a :CLOSE
token. If it is not, we raise an error. Otherwise, we advance the position one last time, before returning the newly created Expression
node.
At this point, we see some patterns emerge. We’re advancing to the next token several times and we’re also checking that there is a current token and its type. Because the code for that is a bit cumbersome, let’s introduce two helper methods.
1def expect(*expected_tokens)
2 upcoming = tokens[position, expected_tokens.size]
3
4 if upcoming.map(&:first) == expected_tokens
5 advance(expected_tokens.size)
6 upcoming
7 end
8end
9
10def advance(offset = 1)
11 @position += offset
12end
The expect
method takes a variable number of token types and checks them against the next tokens in the token stream. If they all match, it advances past the matching tokens and returns them. The advance
method just increments the @position
instance variable by the given offset.
For cases where there's no flexibility regarding the next expected token, we also introduce a method that raises a nice error message when the token doesn’t match.
1def need(*required_tokens)
2 upcoming = tokens[position, required_tokens.size]
3 expect(*required_tokens) or raise "Unexpected tokens. Expected #{required_tokens.inspect} but got #{upcoming.inspect}"
4end
By using these helper methods, parse_content
and parse_expression
are now cleaner and more readable.
1def parse_content
2 if content = expect(:CONTENT)
3 Magicbars::Nodes::Content.new(content[0][1])
4 end
5end
6
7def parse_expression
8 return unless expect(:OPEN_EXPRESSION)
9
10 identifier = parse_identifier
11 arguments = parse_arguments
12
13 need(:CLOSE)
14
15 Magicbars::Nodes::Expression.new(identifier, arguments)
16end
Finally, let’s also look at parse_identifier
and parse_arguments
. Thanks to the helper methods, the parse_identifier
method is as simple as the parse_content
method. The only difference is that it returns another node type.
1def parse_identifier
2 if identifier = expect(:IDENTIFIER)
3 Magicbars::Nodes::Identifier.new(identifier[0][1])
4 end
5end
When implementing the parse_arguments
method, we noticed that it’s almost identical to the parse_statements
method. The only difference is that it calls parse_identifier
instead of parse_statement
. We can get rid of the duplicated logic by introducing another helper method.
1def repeat(method)
2 results = []
3
4 while result = send(method)
5 results << result
6 end
7
8 results
9end
The repeat
method uses send
to call the given method name until it no longer returns a node. Once that happens, the collected results (or just an empty array) are returned. With this helper in place, both parse_statements
and parse_arguments
become one-line methods.
1def parse_statements
2 repeat(:parse_statement)
3end
4
5def parse_arguments
6 repeat(:parse_identifier)
7end
With all these changes in place, let’s try and parse the token stream:
1Magicbars::Parser.parse(tokens)
2# => #<Magicbars::Nodes::Template:0x00007f91a602f910
3# @statements=
4# [#<Magicbars::Nodes::Content:0x00007f91a58802c8 @content="Welcome to ">,
5# #<Magicbars::Nodes::Expression:0x00007f91a602fcd0
6# @arguments=[],
7# @identifier=
8# #<Magicbars::Nodes::Identifier:0x00007f91a5880138 @value=:name> >
It’s a bit hard to read but it’s, in fact, the correct abstract syntax tree. The Template
node has a Content
and an Expression
statement. The Content
node’s value is "Welcome to "
and the Expression
node’s identifier is the Identifier
node with :name
as its value.
Parsing Block Expressions
To complete our parser implementation, we still have to implement parsing of block expressions. As a reminder, here’s the template we want to parse:
1Welcome to {{name}}!
2
3{{#if subscribed}}
4 Thank you for subscribing to our mailing list.
5{{else}}
6 Please sign up for our mailing list to be notified about new articles!
7{{/if}}
8
9Your friends at {{company_name}}
To do this, let’s first introduce a BlockExpression
node. While this node stores a bit more data, it doesn’t do anything else and therefore isn’t very exciting.
1module Magicbars
2 module Nodes
3 class BlockExpression
4 attr_reader :identifier, :arguments, :statements, :inverse_statements
5
6 def initialize(identifier, arguments, statements, inverse_statements)
7 @identifier = identifier
8 @arguments = arguments
9 @statements = statements
10 @inverse_statements = inverse_statements
11 end
12 end
13 end
14end
Like the Expression
node, it stores the identifier as well as any arguments. Additionally, it also stores the statements of the block and of the inverse block.
Looking back at the grammar, we notice that to parse block expressions, we have to amend the parse_statements
method with a call to parse_block_expression
. It now looks just like the rule in the grammar.
1def parse_statement
2 parse_content || parse_expression || parse_block_expression
3end
The parse_block_expression
method itself is a bit more complex. But thanks to our helper methods, it’s still quite readable.
1def parse_block_expression
2 return unless expect(:OPEN_BLOCK)
3
4 identifier = parse_identifier
5 arguments = parse_arguments
6
7 need(:CLOSE)
8
9 statements = parse_statements
10
11 if expect(:OPEN_INVERSE, :CLOSE)
12 inverse_statements = parse_statements
13 end
14
15 need(:OPEN_END_BLOCK)
16
17 if identifier.value != parse_identifier.value
18 raise("Error. Identifier in closing expression does not match identifier in opening expression")
19 end
20
21 need(:CLOSE)
22
23 Magicbars::Nodes::BlockExpression.new(identifier, arguments, statements, inverse_statements)
24end
The first part is very similar to the parse_expression
method. It parses the opening block expression with the identifier and the arguments. Afterward, it calls parse_statements
to parse the inside of the block.
Once that’s done, we check for an {{else}}
expression, identified by an OPEN_INVERSE
token followed by a CLOSE
token. If both tokens are found, we call parse_statements
again to parse the inverse block. Otherwise, we just skip that part entirely.
As a final thing, we ensure that there is an end block expression using the same identifier as the open block expression. If the identifiers don’t match, we raise an error. Otherwise, we create a new BlockExpression
node and return it.
Calling the parser with the tokens of the advanced block expression template will return the AST for the template. I’ll not include the example output here, as it’s barely readable. Instead, here’s a visual representation of the generated AST.
Because we’re calling parse_statements
inside of parse_block_expression
, both the block and the inverse block may include more expressions, block expressions, as well as regular content.
The Journey Continues…
We made decent progress with our journey towards implementing our own templating language. After a short dip into language theory, we defined a grammar for our templating language and used it to implement a parser for it from scratch.
With both the lexer and the parser in place, we’re only missing an interpreter to generate the interpolated string from our template. We’ll cover this part in an upcoming edition of RubyMagic. Subscribe to the Ruby Magic mailinglist, to get alerted when it comes out.