2013年6月20日

PHPで仮想マシンベースの正規表現エンジンを作ってみる第一回

こんにちは、久保田です。

皆さん正規表現は使っていますか? PHPに限らずどんな言語を使っていても、正規表現にお世話になっていないプログラマはいないと思います。しかし、その正規表現がどのように実装されているかについては知らない方が多いのではないのでしょうか。

この記事では、その正規表現エンジンの実装方法の一つである仮想マシンによる正規表現エンジンの実装方法を解説しつつ実際に簡単な正規表現エンジンを作っていきたいと思います。

正規表現エンジンの実装方法

正規表現エンジンの実装方法はいくつかあるのですが、それの一つに仮想マシンによって正規表現のマッチング処理を実行するやり方があります。PHPで利用している正規表現エンジンであるPCREはこの方式を採用しています。

仮想マシンによる実装方法は、正規表現というよりもプログラミング言語の実装方法の一つとして知られています。Rubyの最もメジャーな実装であるCRubyの1.9以降を例にして言えば、Rubyのコードは一旦パースされて、YARVと呼ばれる内部の仮想マシンが実行できる内部表現にコンパイルされたのち仮想マシンによって実行されます。

この実装方法は実は正規表現にも適用できます。今回のこの一連の記事ではこの仮想マシンによる正規表現エンジンの仕組みを解説しつつ、実際に簡単な正規表現エンジンを実装してみたいと思います。

通常正規表現エンジンはCやC++などで実装されますが、僕はC言語をまともに読み書きできないハイパーゆとりなのでここではみんな大好きPHPで実装してみたいと思います。

仮想マシンによる正規表現エンジンの実装

今回作成していく正規表現エンジンの実装方法ですが、基本的にRegular Expression Matching: the Virtual Machine Approachを参照していきます。この中では、仮想マシンによる正規表現エンジンの実装方法についてわかりやすく記述されています。英語ですが平易な語彙で記述されているので、適当に眺めているだけでもなんとなくわかった気になれます。

記事で紹介されている仮想マシンの概要を以下に引用します。


To start, we'll define a regular expression virtual machine (think Java VM). The VM executes one or more threads, each running a regular expression program, which is just a list of regular expression instructions. Each thread maintains two registers while it runs: a program counter (PC) and a string pointer (SP). 
 
 The regular expression instructions are: 
 
    char c     If the character SP points at is not c, stop this thread: it failed.
               Otherwise, advance SP to the next character and advance PC to the next instruction. 
    match      Stop this thread: it found a match. 
    jmp x      Jump to (set the PC to point at) the instruction at x. 
    split x, y Split execution: continue at both x and y. Create a new thread with SP 
               copied from the current thread. One thread continues with PC x. 
               The other continues with PC y. (Like a simultaneous jump to both locations.)

これを見ると、正規表現を実行する仮想マシンが驚くほど単純であることがわかります。この仮想マシンが必要とするレジスタはPCとSPの２つで、必要とする命令はmatchとcharとsplitとjmpのたったの4つだけです。

PHP内部の仮想マシンであるZendEngineの持っている命令数が150程度あるのに比べると、べらぼうに簡単であることがわかると思います。

実装の流れ

実装していく流れですが、以下の様な流れで実装していきます。

1. 正規表現パーサの構築
2. 仮想マシンの構築
3. コンパイラ構築

この記事では、まず正規表現のパーサを構築します。その後、正規表現のマッチング処理を行う仮想マシンを構築し、最後に正規表現を仮想マシンの命令に変換するコンパイラを構築します。

正規表現パーサの構築

まず正規表現エンジンを実装するにあたって、正規表現の文法のパーサを構築します。

実装する正規表現の文法の概要を簡単に書いておきます。解説用のものなので、簡易的な文法にとどめています。


* hoge|fuga  "|"による選択を利用できます
* a(ho|ge)b  括弧によるグルーピングができます
* a+b*c?     "+"や"*"や"?"などの繰り返し演算子が利用できます

PHPPEGを用いて正規表現の文法のパーサを構築します。PHPPEGはPEGに基づくパーサコンビネータです。これを用いると簡単にパーサを構築出来ます。

パーサの構築についてはそれほど本質的では無いのでここでは特に解説無しでいきます。PHPPEGの使い方はドキュメントを参照してください。


<?php
include_once __DIR__ . '/../vendor/autoload.php';
class RegexSyntaxParser implements PEG_IParser
{ 
    protected $regexParser;
    function __construct()
    {
        /*
         * regex <- split*
         * split <- operations ("|" operations)*
         * operations <- operation*
         * operation <- target operator
         * target <- charClass / group / singleCharacter
         * suffixOperator <- "*" / "+" / "?"
         * group <- "(" split ")"
         * charClass <- "[" (!"]" .)+ "]"
         * singleCharacter <- ![+*?|[)] .
         */
        $singleCharacter = self::objectize('singleCharacter', 
            PEG::second(PEG::not(PEG::choice('*', '+', '?', '|', '[', ')')), PEG::anything())
        );
        $charClass = self::objectize('charClass', PEG::second(
            '[', 
            PEG::many1(PEG::second(PEG::not(']'), PEG::anything())),
            ']'
        ));
        $group = self::objectize('group', PEG::memo(PEG::second(
            '(', PEG::ref($split), ')'
        )));
        $suffixOperator = self::objectize('suffixOperator', PEG::choice('*', '+', '?'));
        $target = PEG::choice(
            $charClass, $group, $singleCharacter
        );
        $operation = self::objectize('operation', 
            PEG::seq($target, PEG::optional($suffixOperator))
        );
        $operations = self::objectize('operations', PEG::many($operation));
        $split = self::objectize('split', 
            PEG::choice(PEG::listof($operations, '|'), '')
        );
        $this->regexParser = self::objectize('regex', PEG::many($split));
    }
    /**
     * @return PEG::IParser
     */
    function getParser() 
    {
        return $this->regexParser;
    }
    /**
     * @param String $str
     */
    function parse(PEG_IContext $context)
    {
        return $this->regexParser->parse($context);
    }
    /**
     * @param PEG_IParser
     * @return PEG_IParser
     */
    protected static function objectize($name, PEG_IParser $parser)
    {
        return PEG::hook(function($result) use($name) {
            return new RegexSyntaxNode($name, $result);
        }, $parser);
    }
}
class RegexSyntaxNode
{
    protected $name, $content;
    function __construct($name, $content) 
    {
        $this->name = $name;
        $this->content = $content;
    }
    function __toString()
    {
        $result = '';
        $result .= $this->name . " {\n";
        $result .= $this->dump($this->content);
        $result .= "}";
        return $result;
    }
    protected function dump($content)
    {
        $result = '';
        if (is_array($content)) {
            foreach ($content as $i => $element) {
                $result .= $this->dump($element);
            }
        } elseif ($content instanceof self) {
            $result .= self::indent($content->__toString()) . "\n";
        } else {
            $result .= self::indent(var_export($content, true)) . "\n";
        }
        return $result;
    }
    static function indent($str) {
        $lines = preg_split("/\r|\n|\r\n/", $str);
        foreach ($lines as $i => $line) {
            $lines[$i] = '  ' . $line;
        }
        return implode($lines, "\n");
    }
}

このコードやプロジェクトは、githubに公開していますので実際に動かしてみたい方は参照してください。

このパーサに正規表現をかけてみます。


<?php
include_once __DIR__ . '/../src/PHPRegex.php';
$parser = new RegexSyntaxParser();
echo 'a => ' . $parser->parse(PEG::context('a')) . "\n\n";
echo 'a|b =>' .  $parser->parse(PEG::context('a|b')) . "\n\n"; 
echo 'a(bc) => ' . $parser->parse(PEG::context('a(bc)')) . "\n\n";
echo 'a+b*c? => ' . $parser->parse(PEG::context('a+b*c?')) . "\n\n";

すると、以下のように出力されます。正規表現がきちんとパースされて構文木ができているのがわかると思います。


a => regex {
  split {
    operations {
      operation {
        singleCharacter {
          'a'
        }
        false
      }
    }
  }
}
a|b =>regex {
  split {
    operations {
      operation {
        singleCharacter {
          'a'
        }
        false
      }
    }
    operations {
      operation {
        singleCharacter {
          'b'
        }
        false
      }
    }
  }
}
a(bc) => regex {
  split {
    operations {
      operation {
        singleCharacter {
          'a'
        }
        false
      }
      operation {
        group {
          split {
            operations {
              operation {
                singleCharacter {
                  'b'
                }
                false
              }
              operation {
                singleCharacter {
                  'c'
                }
                false
              }
            }
          }
        }
        false
      }
    }
  }
}
a+b*c? => regex {
  split {
    operations {
      operation {
        singleCharacter {
          'a'
        }
        suffixOperator {
          '+'
        }
      }
      operation {
        singleCharacter {
          'b'
        }
        suffixOperator {
          '*'
        }
      }
      operation {
        singleCharacter {
          'c'
        }
        suffixOperator {
          '?'
        }
      }
    }
  }
}