昨天发了个贴,讲 LL 预测分析表的自动生成(http://ruby-china.org/topics/19459),正好借机说一说如何自己做一个正则表达式的引擎。 打算分以下几个篇章,理论结合实践,我会一步步用 ruby 来实现一个引擎:
今天把 LL 预测分析表完成了,代码放在:https://github.com/rasefon/LLTableGenerator 由于时间太晚要睡觉了,我会争取这两天把第一篇完成。
正则表达式作为字符串解析利器,本身可以就是一个由 context free 语法构成的语法集合。本质上只有三条规则:连接(e1 e2),重复(e*),选择(e1|e2)。例如 [a-z],\s,\t等。下面作为示例,给出一个简化的正则表达式的 BNF 范式: Rule: Expr | ^Expr | Expr$ Expr: Expr '|' Cat_Expr | Cat_Expr Cat_Expr: Cat_Expr Factor | Factor Factor: Term * | Term + | Term ? | Term Term: [ String ] | [^ String ] | . | Char | '(' Expr ')' String: #Set of ascii characters Char: #Single ascii character 由此文法,即可创建基于 NFA 的表示正则表达式的内存模型,构造 NFA 的算法用 Thompson。 构造之前,先了解下如何解析上述文法,因为正则表达式通常都比较短,所以一般用递归下降的算法即可。所谓递归下降,即如下这种算法: 例如,有文法 T: F T1 T1: '*' F T1 | epsilon F: '(' T ')' | id 则递归下降算法形如如下所示:
def next_token
# Get next token from lexema buffer.
end
def T
F()
T1()
#do something
end
def F
if "(" == next_token()
T()
if ")" == next_token()
# do something
else
# Error!
end
elsif id() == next_token()
# do something
else
# Error!
end
end
def T1
if "*" == next_token()
F()
T1()
# do something
else
# Error!
end
end
# Start parsing...
T()
在"do something'处可以插入用于构造 NFA 模型的代码。注意此处文法都是右递归的,左递归的文法无法用于自顶向下的文法解析算法,所以要先把文法从左递归转成右递归。 递归下降算法优点是简单,缺点是有回溯,效率低,且不容易扩展变更。所以我们可以先做一个 LL 语法解析器的自动生成器,就像 yacc 那样(基于 LALR 算法)。这样生成一个 LL 预测分析表以后,我们就可以进行无回溯的文法解析。下面说下如何构造 LL 预测分析表. 构造预测表,首先我们需要读取语法描述文件,语法描述文件是我自定义的,这里给出一个例子:
``
token: tPlus, tMul, tLp, tRp, tEnd
token: tId
# nil is predefined keyword.
``
$Start: E
E: T,E1
E1: tPlus,T,E1
E1: nil
T: F,T1
T1: tMul,F,T1
T1: nil
F: F1
F1: tLp,E,tRp
F1: tId
''如同 yacc 里的%% 第一部分是定义 token 集,第二部分是定义文法。在此给出解析此文件,并生成临时文法规则数据结构的简单算法:
$start_lside_rule = ""
$token_list = Set.new
$gram_list = Hash.new
$first_set = Hash.new
$follow_set = Hash.new
# For ll table construction
$nonterm_token_list = [] # [E, F, T, ...]
$single_gram_list = [] # [{T->[E,tRp]}, ...]
$input_tokens = ["$"]
$ll_table = nil
def construct_table_model(rule_file_name)
lines = IO.readlines(rule_file_name)
lines = lines.map { |l| l.chomp }
token_def_phase = false
rule_def_phase = false
# temporarily record left side tokens and right tokens as string.
lines.each do |line|
# skip comment
next if "#" == line[0]
# Start token define phase.
if "``" == line
if !token_def_phase
token_def_phase = true
else
token_def_phase = false
rule_def_phase = true
end
next
end
if token_def_phase
line.split(':')[1].split(',').each { |t| $token_list.add(t.strip) }
elsif rule_def_phase
rule = line.split(':').map { |item| item.strip }
if "$Start" == rule[0]
$start_lside_rule = rule[1]
else
rhs = rule[1].split(',').map { |r| r.strip }
if $gram_list.has_key?(rule[0])
# merge right side rule
$gram_list[rule[0]] = $gram_list[rule[0]] << rhs
else
$gram_list[rule[0]] = [] << rhs
end
end
end
end
$input_tokens.concat($token_list.to_a)
end
这个方法会生成一个全局的$token_list 集合,一个$input_tokens 的所有可能的输入 tokens 数组(即 token_list 加上终结符"$"),一个表示语法的$gram_list.
LL 预测分析表需要两个集合,一个是 First 集合,一个是 Follow 集合,首先给出构造这两者的算法,然后再说明这两个集合的用处,以及具体的构造步骤: First 集合:
def construct_first_set()
#first to add all the terminal and 'nil' into the FIRST set.
$gram_list.each do |lhs, rhs_arr|
unless $first_set.has_key?(lhs)
$first_set[lhs] = Set.new
end
rhs_arr.each do |rhs|
if $token_list.include?(rhs.first) or "nil" == rhs.first
$first_set[lhs].add(rhs.first)
end
end
end
# loop computing first set until no new item is added.
changed = true
while (changed)
changed = false
$gram_list.each do |lhs, rhs_arr|
rhs_arr.each do |rhs|
# If the first item of the rhs token is terminal or 'nil', just skip it because the token had already been
# added in previous loop.
next if $token_list.include?(rhs.first) or "nil" == rhs.first
count = 0
rhs.each do |rhs_term|
# If 'nil' is contained in FIRST set of rhs_term, skip to the next term.
if $first_set[rhs_term].include?("nil")
count += 1
next
else
# merge the FIRST SET of rhs token with the lhs if necessary.
$first_set[rhs_term].each do |token|
unless $first_set[lhs].include?(token)
$first_set[lhs].add(token)
changed = true
end
end
break
end
end
# check if all the rhs tokens produce nil, if so add nil to the lhs FIRST set.
if rhs.size == count
$first_set[lhs].add("nil")
end
end
end
end
end
First 集合的结果会放在全局变量$first_set 中,同样的对于 follow set,构造算法如下:
def construct_follow_set
# Initialize FOLLOW set and add '$' into the FOLLOW set of start tokens.
$gram_list.each_key do |lhs|
$follow_set[lhs] = Set.new
$follow_set[lhs].add("$") if lhs == $start_lside_rule
end
changed = true
while (changed)
changed = false
$gram_list.each do |lhs, rhs_arr|
rhs_arr.each do |rhs|
rhs.each_index do |i|
# skip the nil and terminal token.
next if 'nil' == rhs[i] or $token_list.include?(rhs[i])
# case 1, the current rhs term is the last one.
if (i+1) == rhs.size
# Union the FOLLOW set of the rhs into the current rhs term.
$follow_set[lhs].each do |token|
unless $follow_set[rhs[i]].include?(token)
changed = true
$follow_set[rhs[i]].add(token)
end
end
else
# First union the FIRST set of the follow rhs token into the current rhs token.
if "nil" != rhs[i+1]
if $token_list.include?(rhs[i+1])
unless $follow_set[rhs[i]].include?(rhs[i+1])
changed = true
$follow_set[rhs[i]].add(rhs[i+1])
end
else
$first_set[rhs[i+1]].each do |token|
if !$follow_set[rhs[i]].include?(token) and "nil" != token
changed = true
$follow_set[rhs[i]].add(token)
end
end
end
end
# If the follow rhs tokens produces a 'nil' chain, union the FOLLOW set of lhs into the current rhs token.
all_nil = true
j = i + 1
while (j < rhs.size)
# If there is any terminal, break.
if "nil" == rhs[j] or $token_list.include?(rhs[j])
all_nil = false
break
end
unless $first_set[rhs[j]].include?("nil")
all_nil = false
break
end
j += 1
end
if all_nil
$follow_set[lhs].each do |token|
unless $follow_set[rhs[i]].include?(token)
changed = true
$follow_set[rhs[i]].add(token)
end
end
end
end
end
end
end
end
end
结果放在$follow_set.