算法如何自制正则表达式引擎 (一)

rasefon · 2014年05月22日 · 最后由 love93hate 回复于 2018年08月27日 · 6894 次阅读

昨天发了个贴，讲 LL 预测分析表的自动生成（http://ruby-china.org/topics/19459），正好借机说一说如何自己做一个正则表达式的引擎。打算分以下几个篇章，理论结合实践，我会一步步用 ruby 来实现一个引擎：

LL 预测分析表的生成
NFA 构造
NFA 转 DFA
DFA 最小化

今天把 LL 预测分析表完成了，代码放在：https://github.com/rasefon/LLTableGenerator 由于时间太晚要睡觉了，我会争取这两天把第一篇完成。

分割线，今天先更新一部分，具体算法的详细解析放在下回。

def next_token
   # Get next token from lexema buffer.
end

def T
   F()
   T1()
   #do something
end

def F
   if "(" == next_token()
      T()
      if ")" == next_token()
         # do something
      else
         # Error!
      end
   elsif id() == next_token()
      # do something
   else
      # Error!
   end
end

def T1
   if "*" == next_token()
      F()
      T1()
      # do something
   else
      # Error!
   end
end

# Start parsing...
T()

在"do something'处可以插入用于构造 NFA 模型的代码。注意此处文法都是右递归的，左递归的文法无法用于自顶向下的文法解析算法，所以要先把文法从左递归转成右递归。递归下降算法优点是简单，缺点是有回溯，效率低，且不容易扩展变更。所以我们可以先做一个 LL 语法解析器的自动生成器，就像 yacc 那样（基于 LALR 算法）。这样生成一个 LL 预测分析表以后，我们就可以进行无回溯的文法解析。下面说下如何构造 LL 预测分析表. 构造预测表，首先我们需要读取语法描述文件，语法描述文件是我自定义的，这里给出一个例子：

``
token: tPlus, tMul, tLp, tRp, tEnd
token: tId
# nil is predefined keyword.
``
$Start: E
E:    T,E1
E1:   tPlus,T,E1
E1:   nil
T:    F,T1
T1:   tMul,F,T1
T1:   nil
F:    F1
F1:   tLp,E,tRp
F1:   tId

''如同 yacc 里的%% 第一部分是定义 token 集，第二部分是定义文法。在此给出解析此文件，并生成临时文法规则数据结构的简单算法：

$start_lside_rule = ""
$token_list = Set.new
$gram_list = Hash.new
$first_set = Hash.new
$follow_set = Hash.new
# For ll table construction
$nonterm_token_list = [] # [E, F, T, ...]
$single_gram_list = [] # [{T->[E,tRp]}, ...]
$input_tokens = ["$"]
$ll_table = nil

def construct_table_model(rule_file_name)
   lines = IO.readlines(rule_file_name) 
   lines = lines.map { |l| l.chomp }
   token_def_phase = false
   rule_def_phase = false
   # temporarily record left side tokens and right tokens as string.
   lines.each do |line|
      # skip comment
      next if "#" == line[0]
      # Start token define phase.
      if "``" == line
         if !token_def_phase
            token_def_phase = true
         else
            token_def_phase = false
            rule_def_phase = true
         end
         next
      end

      if token_def_phase
         line.split(':')[1].split(',').each { |t| $token_list.add(t.strip) }
      elsif rule_def_phase
         rule = line.split(':').map { |item| item.strip }
         if "$Start" == rule[0]
            $start_lside_rule = rule[1]
         else
            rhs = rule[1].split(',').map { |r| r.strip }
            if $gram_list.has_key?(rule[0])
               # merge right side rule
               $gram_list[rule[0]] = $gram_list[rule[0]] << rhs
            else
               $gram_list[rule[0]] = [] << rhs
            end
         end
      end
   end
   $input_tokens.concat($token_list.to_a)
end

这个方法会生成一个全局的$token_list 集合，一个$input_tokens 的所有可能的输入 tokens 数组（即 token_list 加上终结符"$"），一个表示语法的$gram_list.

LL 预测分析表需要两个集合，一个是 First 集合，一个是 Follow 集合，首先给出构造这两者的算法，然后再说明这两个集合的用处，以及具体的构造步骤： First 集合：

def construct_first_set()
   #first to add all the terminal and 'nil' into the FIRST set.
   $gram_list.each do |lhs, rhs_arr|
      unless $first_set.has_key?(lhs)
         $first_set[lhs] = Set.new
      end

      rhs_arr.each do |rhs|
         if $token_list.include?(rhs.first) or "nil" == rhs.first
            $first_set[lhs].add(rhs.first)
         end
      end
   end
   # loop computing first set until no new item is added.
   changed = true
   while (changed)
      changed = false
      $gram_list.each do |lhs, rhs_arr|
         rhs_arr.each do |rhs|
            # If the first item of the rhs token is terminal or 'nil', just skip it because the token had already been
            # added in previous loop.
            next if $token_list.include?(rhs.first) or "nil" == rhs.first

            count = 0
            rhs.each do |rhs_term|
               # If 'nil' is contained in FIRST set of rhs_term, skip to the next term.
               if $first_set[rhs_term].include?("nil")
                  count += 1
                  next
               else
                  # merge the FIRST SET of rhs token with the lhs if necessary.
                  $first_set[rhs_term].each do |token|
                     unless $first_set[lhs].include?(token)
                        $first_set[lhs].add(token)
                        changed = true
                     end
                  end
                  break
               end
            end
            # check if all the rhs tokens produce nil, if so add nil to the lhs FIRST set.
            if rhs.size == count
               $first_set[lhs].add("nil")
            end
         end
      end
   end
end

First 集合的结果会放在全局变量$first_set 中，同样的对于 follow set，构造算法如下：

def construct_follow_set
   # Initialize FOLLOW set and add '$' into the FOLLOW set of start tokens.
   $gram_list.each_key do |lhs|
      $follow_set[lhs] = Set.new
      $follow_set[lhs].add("$") if lhs == $start_lside_rule
   end

   changed = true
   while (changed)
      changed = false
      $gram_list.each do |lhs, rhs_arr|
         rhs_arr.each do |rhs|
            rhs.each_index do |i|
               # skip the nil and terminal token.
               next if 'nil' == rhs[i] or $token_list.include?(rhs[i])

               # case 1, the current rhs term is the last one. 
               if (i+1) == rhs.size 
                  # Union the FOLLOW set of the rhs into the current rhs term.
                  $follow_set[lhs].each do |token|
                     unless $follow_set[rhs[i]].include?(token)
                        changed = true
                        $follow_set[rhs[i]].add(token)
                     end
                  end
               else
                  # First union the FIRST set of the follow rhs token into the current rhs token.
                  if "nil" != rhs[i+1]
                     if $token_list.include?(rhs[i+1])
                        unless $follow_set[rhs[i]].include?(rhs[i+1])
                           changed = true
                           $follow_set[rhs[i]].add(rhs[i+1])
                        end
                     else
                        $first_set[rhs[i+1]].each do |token|
                           if !$follow_set[rhs[i]].include?(token) and "nil" != token
                              changed = true
                              $follow_set[rhs[i]].add(token)
                           end
                        end
                     end
                  end
                  # If the follow rhs tokens produces a 'nil' chain, union the FOLLOW set of lhs into the current rhs token.
                  all_nil = true
                  j = i + 1
                  while (j < rhs.size)
                     # If there is any terminal, break.
                     if "nil" == rhs[j] or $token_list.include?(rhs[j])
                        all_nil = false
                        break
                     end

                     unless $first_set[rhs[j]].include?("nil")
                        all_nil = false
                        break
                     end
                     j += 1
                  end
                  if all_nil
                     $follow_set[lhs].each do |token|
                        unless $follow_set[rhs[i]].include?(token)
                           changed = true
                           $follow_set[rhs[i]].add(token)
                        end
                     end
                  end
               end
            end
         end
      end
   end
end

结果放在$follow_set.

3 个赞

LL (1) 预测分析表生成器。

无引用文章

yukihiro_matz #1 2014年05月22日

bhuztez #2 2014年05月23日

剩下的部分在这里

1 个赞

rasefon #3 2014年05月23日

#2 楼 @bhuztez 解析正则表达式文法本身的这块，你用的什么方法？

bhuztez #4 2014年05月23日

#3 楼 @rasefon 我觉得你这样搞方向反了，先写 Earley 再降级回来就容易多了，也就不存在怎么解析正则表达式的问题了

rasefon #5 2014年05月23日

#4 楼 @bhuztez 我的想法是这样的：抽象出正则表达式本身的 BNF 范式，然后用 LL 分析它，以此构造出正则表达式本身的 bnf 范式解析的引擎，以此来构造 nfa，dfa 等，因为正则式的特点就是词法分析异常简单，只有两种情况：1.通常字符一个一个进来，2.“和转义字符。

bhuztez #6 2014年05月23日

#5 楼 @rasefon 我觉得从 Earley 降级下来比较好，毕竟 Earley 简单啊

rasefon #7 2014年05月23日

#6 楼 @bhuztez LL 分析表生成器我已经写好了，难道再去写个 Earley……

bhuztez #8 2014年05月23日

#7 楼 @rasefon 我的思路是这样的，先写 Earley，再预先计算 parsing 中可能出现的各种情况，也就是把 Earley 编译成 Table，其中一些特殊情况可以降级为 LL Table，LR Table?

你这里计算 First 和 Follow 的方式，和把 Earley 编译成 Table 时用到的并没有本质区别

rasefon #9 2014年05月23日

#8 楼 @bhuztez 我记得 Earley 是属于比较通用的 parsing 算法，cf 的文法应该都可以解析，缺点是慢，但是没想过可不可以转成 LLTable 或者 LR。我觉得 LL 其实还不算复杂，LR（1）和 LALR（1）要更复杂，但是这两个效率高，就是适用范围没 Earley 广，我先去看看 Earley 的细节。

luikore #10 2014年05月23日

#6 楼 @bhuztez 你还可以先写个 RCG 的 parser, 然后降级到解析 CF 文法的 parser 呢，还是一步一步来吧

bhuztez #11 2014年05月23日

#10 楼 @luikore 正则表达式这里有个鸡蛋问题啊，正则表达式本身不能被正则表达式解析啊

rasefon #12 2014年05月23日

#11 楼 @bhuztez 正则式本身是可以被其他文法解析的。

luikore #13 2014年05月23日

#11 楼 @bhuztez 那正则也是 LL(1) 就足够表示的文法啊

bhuztez #14 2014年05月23日

#13 楼 @luikore Earley 比较简单啊

rasefon #15 2014年05月23日

我比较好奇，bhuztez 到底是干嘛的？

skandhas #16 2014年05月23日

#15 楼 @rasefon 专业歪楼的。嘿嘿

1 个赞

luikore #17 2014年05月23日

#14 楼 @bhuztez 把 right recursion 和 nullable 等各种问题修正后，就远比 LL 复杂了

rasefon #18 2014年05月26日

更新了一部分。

love93hate #19 2018年08月27日

机器学习交易——如何使用回归预测股票价格？最近翻译了一篇文章，本人对机器学习应用在量化投资上很感兴趣，希望可以和社区大神一起交流学习。

需要登录后方可回复, 如果你还没有账号请注册新账号

算法 如何自制正则表达式引擎 (一)

算法 如何自制正则表达式引擎 (一)

分割线，今天先更新一部分，具体算法的详细解析放在下回。

算法如何自制正则表达式引擎 (一)

算法如何自制正则表达式引擎 (一)