pytho****@googl*****
pytho****@googl*****
2011年 6月 4日 (土) 20:08:27 JST
Revision: 90fd33e289a8 Author: Akihiro Uchida <uchid****@ike-d*****> Date: Sun May 22 06:08:07 2011 Log: translate howto/regex.rst http://code.google.com/p/python-doc-ja/source/detail?r=90fd33e289a8 Modified: /howto/regex.rst ======================================= --- /howto/regex.rst Fri May 20 04:17:35 2011 +++ /howto/regex.rst Sun May 22 06:08:07 2011 @@ -1,8 +1,13 @@ +.. + **************************** + Regular Expression HOWTO + **************************** + .. _regex-howto: -*********************************** - Regular Expression HOWTO (英語) -*********************************** +****************** + 正規表現 HOWTO +****************** :Author: A.M. Kuchling :Release: 0.05 @@ -14,315 +19,640 @@ Unicode (at least a reference) -.. topic:: Abstract - - This document is an introductory tutorial to using regular expressions in Python - with the :mod:`re` module. It provides a gentler introduction than the - corresponding section in the Library Reference. - - -Introduction -============ - -The :mod:`re` module was added in Python 1.5, and provides Perl-style regular -expression patterns. Earlier versions of Python came with the :mod:`regex` -module, which provided Emacs-style patterns. The :mod:`regex` module was -removed completely in Python 2.5. - -Regular expressions (called REs, or regexes, or regex patterns) are essentially -a tiny, highly specialized programming language embedded inside Python and made -available through the :mod:`re` module. Using this little language, you specify -the rules for the set of possible strings that you want to match; this set might -contain English sentences, or e-mail addresses, or TeX commands, or anything you -like. You can then ask questions such as "Does this string match the pattern?", -or "Is there a match for the pattern anywhere in this string?". You can also -use REs to modify a string or to split it apart in various ways. - -Regular expression patterns are compiled into a series of bytecodes which are -then executed by a matching engine written in C. For advanced use, it may be -necessary to pay careful attention to how the engine will execute a given RE, -and write the RE in a certain way in order to produce bytecode that runs faster. -Optimization isn't covered in this document, because it requires that you have a -good understanding of the matching engine's internals. - -The regular expression language is relatively small and restricted, so not all -possible string processing tasks can be done using regular expressions. There -are also tasks that *can* be done with regular expressions, but the expressions -turn out to be very complicated. In these cases, you may be better off writing -Python code to do the processing; while Python code will be slower than an -elaborate regular expression, it will also probably be more understandable. - - -Simple Patterns -=============== - -We'll start by learning about the simplest possible regular expressions. Since -regular expressions are used to operate on strings, we'll begin with the most -common task: matching characters. - -For a detailed explanation of the computer science underlying regular -expressions (deterministic and non-deterministic finite automata), you can refer -to almost any textbook on writing compilers. - - -Matching Characters -------------------- - -Most letters and characters will simply match themselves. For example, the -regular expression ``test`` will match the string ``test`` exactly. (You can -enable a case-insensitive mode that would let this RE match ``Test`` or ``TEST`` -as well; more about this later.) - -There are exceptions to this rule; some characters are special -:dfn:`metacharacters`, and don't match themselves. Instead, they signal that -some out-of-the-ordinary thing should be matched, or they affect other portions -of the RE by repeating them or changing their meaning. Much of this document is -devoted to discussing various metacharacters and what they do. - -Here's a complete list of the metacharacters; their meanings will be discussed -in the rest of this HOWTO. :: + +.. + .. topic:: Abstract + + This document is an introductory tutorial to using regular expressions in Python + with the :mod:`re` module. It provides a gentler introduction than the + corresponding section in the Library Reference. + +.. topic:: 概要 + + このドキュメントは :mod:`re` モジュールを使って Python で正規表現を扱う ための + 導入のチュートリアルです。 + ライブラリレファレンスの正規表現の節よりもやさしい入門ドキュメントを用意 しています。 + +.. + Introduction + ============ + +入門 +==== + +.. + The :mod:`re` module was added in Python 1.5, and provides Perl-style regular + expression patterns. Earlier versions of Python came with the :mod:`regex` + module, which provided Emacs-style patterns. The :mod:`regex` module was + removed completely in Python 2.5. + +:mod:`re` モジュール Python 1.5 で追加され、Perl スタイルの正規表現パターン を提供します。 +それ以前の Python では :mod:`regex` モジュールが Emacs スタイルのパターンを 提供していました。 +:mod:`regex` モジュールは Python 2.5 で完全に削除されました。 + +.. + Regular expressions (called REs, or regexes, or regex patterns) are essentially + a tiny, highly specialized programming language embedded inside Python and made + available through the :mod:`re` module. Using this little language, you specify + the rules for the set of possible strings that you want to match; this set might + contain English sentences, or e-mail addresses, or TeX commands, or anything you + like. You can then ask questions such as "Does this string match the pattern?", + or "Is there a match for the pattern anywhere in this string?". You can also + use REs to modify a string or to split it apart in various ways. + +正規表現 regular expressions (REs や regexes または regex patterns と呼ばれ ます) は +本質的に小さく、Python 内部に埋め込まれた高度に特化したプログラミング言語で +:mod:`re` モジュールから利用可能です。 +この小さな言語を利用することで、マッチさせたい文字列に適合するような文字列 の集合を +指定することができます; +この集合は英文や e-mail アドレスや TeX コマンドなど、どんなものでも構いませ ん。 +「この文字列は指定したパターンにマッチしますか?」 +「このパターンはこの文字列のどの部分にマッチするのですか?」といったことを +問い合わせることができます。 +正規表現を使って文字列を変更したりいろいろな方法で別々の部分に分割したりす ることもできます。 + +.. + Regular expression patterns are compiled into a series of bytecodes which are + then executed by a matching engine written in C. For advanced use, it may be + necessary to pay careful attention to how the engine will execute a given RE, + and write the RE in a certain way in order to produce bytecode that runs faster. + Optimization isn't covered in this document, because it requires that you have a + good understanding of the matching engine's internals. + +正規表現パターンは一連のバイトコードとしてコンパイルされ、 +C で書かれたマッチングエンジンによって実行されます。 +より進んだ利用法では、エンジンがどう与えられた正規表現を実行するかに注意す ることが +必要になり、高速に実行できるバイトコードを生成するように正規表現を書くこと になります。 +このドキュメントでは最適化までは扱いません、それにはマッチングエンジンの内 部に対する十分な理解が必要だからです。 + +.. + The regular expression language is relatively small and restricted, so not all + possible string processing tasks can be done using regular expressions. There + are also tasks that *can* be done with regular expressions, but the expressions + turn out to be very complicated. In these cases, you may be better off writing + Python code to do the processing; while Python code will be slower than an + elaborate regular expression, it will also probably be more understandable. + +正規表現言語は相対的に小さく、制限されています、 +そのため正規表現を使ってあらゆる文字列処理作業を行なえるわけではありませ ん。 +正規表現を使って行うことのできる作業もあります、 +ただ表現はとても複雑なものになります。 +それらの場合では、Python コードを書いた方がいいでしょう; +Python コードは念入りに作られた正規表現より遅くなりますが、 +おそらくより読み易いでしょう。 + +.. + Simple Patterns + =============== + +単純なパターン +============== + +.. + We'll start by learning about the simplest possible regular expressions. Since + regular expressions are used to operate on strings, we'll begin with the most + common task: matching characters. + +まずはできるだけ簡単な正規表現を学ぶことから始めてみましょう。 +正規表現は文字列の操作に使われるので、ますは最も一般的な作業である文字のマ ッチングをしてみます。 + +.. + For a detailed explanation of the computer science underlying regular + expressions (deterministic and non-deterministic finite automata), you can refer + to almost any textbook on writing compilers. + +正規表現の基礎を成す計算機科学 (決定、非決定有限オートマトン) の詳細な説明 については, +コンパイラ作成に関するテキストブックをどれでもいいので参照して下さい。 + +.. + Matching Characters + ------------------- + +文字のマッチング +---------------- + +.. + Most letters and characters will simply match themselves. For example, the + regular expression ``test`` will match the string ``test`` exactly. (You can + enable a case-insensitive mode that would let this RE match ``Test`` or ``TEST`` + as well; more about this later.) + +多くの活字や文字は単純にそれ自身とマッチします。例えば、 ``test`` という正 規表現は文字列 ``test`` に厳密にマッチします。 +(大文字小文字を区別しないモードでその正規表現が ``Test`` や ``TEST`` にも同 様にマッチすることもできます; 詳しくは後述します。) + +.. + There are exceptions to this rule; some characters are special + :dfn:`metacharacters`, and don't match themselves. Instead, they signal that + some out-of-the-ordinary thing should be matched, or they affect other portions + of the RE by repeating them or changing their meaning. Much of this document is + devoted to discussing various metacharacters and what they do. + +この規則には例外が存在します; いくつかの文字は特別な :dfn:`特殊文字 (metacharacters)` で、それら自身にマッチしません。 +代わりに通常のマッチするものとは違うという合図を出したり、正規表現の一部に 対して繰り返したり、意味を変えたりして影響を与えます。 +このドキュメントの中の多くは様々な特殊文字とそれが何をするかについて論じる ことになります。 + +.. + Here's a complete list of the metacharacters; their meanings will be discussed + in the rest of this HOWTO. :: + +ここに特殊文字の完全な一覧があります; これらの意味はこの HOWTO の残りの部分 で説明します:: . ^ $ * + ? { [ ] \ | ( ) -The first metacharacters we'll look at are ``[`` and ``]``. They're used for -specifying a character class, which is a set of characters that you wish to -match. Characters can be listed individually, or a range of characters can be -indicated by giving two characters and separating them by a ``'-'``. For -example, ``[abc]`` will match any of the characters ``a``, ``b``, or ``c``; this -is the same as ``[a-c]``, which uses a range to express the same set of -characters. If you wanted to match only lowercase letters, your RE would be -``[a-z]``. - -Metacharacters are not active inside classes. For example, ``[akm$]`` will -match any of the characters ``'a'``, ``'k'``, ``'m'``, or ``'$'``; ``'$'`` is -usually a metacharacter, but inside a character class it's stripped of its -special nature. - -You can match the characters not listed within the class by :dfn:`complementing` -the set. This is indicated by including a ``'^'`` as the first character of the -class; ``'^'`` outside a character class will simply match the ``'^'`` -character. For example, ``[^5]`` will match any character except ``'5'``. - -Perhaps the most important metacharacter is the backslash, ``\``. As in Python -string literals, the backslash can be followed by various characters to signal -various special sequences. It's also used to escape all the metacharacters so -you can still match them in patterns; for example, if you need to match a ``[`` -or ``\``, you can precede them with a backslash to remove their special -meaning: ``\[`` or ``\\``. - -Some of the special sequences beginning with ``'\'`` represent predefined sets -of characters that are often useful, such as the set of digits, the set of -letters, or the set of anything that isn't whitespace. The following predefined -special sequences are available: +.. + The first metacharacters we'll look at are ``[`` and ``]``. They're used for + specifying a character class, which is a set of characters that you wish to + match. Characters can be listed individually, or a range of characters can be + indicated by giving two characters and separating them by a ``'-'``. For + example, ``[abc]`` will match any of the characters ``a``, ``b``, or ``c``; this + is the same as ``[a-c]``, which uses a range to express the same set of + characters. If you wanted to match only lowercase letters, your RE would be + ``[a-z]``. + +最初に扱う特殊文字は ``[`` と ``]`` です。 +これらは文字クラスを指定します、文字クラスはマッチしたい文字の集合です。 +文字は個別にリストにしても構いませんし、二つの文字を ``'-'`` でつなげて文字 を範囲で与えてもかまいません。 +たとえば ``[abc]`` は ``a``, ``b``, または ``c`` のどの文字列にもマッチしま す; +これは ``[a-c]`` で同じ文字集合を範囲で表現しても全く同じです。 +小文字のアルファベットのみにマッチしたい場合、 ``[a-z]`` の正規表現をつかう ことになるでしょう。 + +.. + Metacharacters are not active inside classes. For example, ``[akm$]`` will + match any of the characters ``'a'``, ``'k'``, ``'m'``, or ``'$'``; ``'$'`` is + usually a metacharacter, but inside a character class it's stripped of its + special nature. + +特殊文字は文字クラスの内部では有効になりません。 +例えば、 ``[akm$]`` は ``'a'``, ``'k'``, ``'m'`` または ``'$'`` にマッチ します; +``'$'`` は通常は特殊文字ですが、文字クラス内部では特殊な性質は取り除かれま す。 + +.. + You can match the characters not listed within the class by :dfn:`complementing` + the set. This is indicated by including a ``'^'`` as the first character of the + class; ``'^'`` outside a character class will simply match the ``'^'`` + character. For example, ``[^5]`` will match any character except ``'5'``. + +文字クラス内のリストにない文字に対しても :dfn:`補集合` を使ってマッチするこ とができます。 +補集合はクラスの最初の文字として ``'^'`` を含めることで表すことができます; +文字クラスの外側の ``'^'`` は単に ``'^'`` 文字にマッチします。 +例えば、 ``[^5]`` は ``'5'`` を除く任意の文字にマッチします。 + +.. + Perhaps the most important metacharacter is the backslash, ``\``. As in Python + string literals, the backslash can be followed by various characters to signal + various special sequences. It's also used to escape all the metacharacters so + you can still match them in patterns; for example, if you need to match a ``[`` + or ``\``, you can precede them with a backslash to remove their special + meaning: ``\[`` or ``\\``. + +おそらく最も重要な特殊文字はバックスラッシュ ``\`` でしょう。 +Python の文字列リテラルのようにバックスラッシュに続けていろいろな文字を入力 することでいろいろな特殊シーケンスの合図を送ることができます。 +また、バックスラッシュはすべての特殊文字をエスケープするのにも利用されま す、 +つまり、特殊文字をマッチさせることができます; +例えば、 ``[`` または ``\`` にマッチさせたい場合、それらをバックスラッシュ に続けることで特殊な意味を除きます: ``\[`` または ``\\`` 。 + +.. + Some of the special sequences beginning with ``'\'`` represent predefined sets + of characters that are often useful, such as the set of digits, the set of + letters, or the set of anything that isn't whitespace. The following predefined + special sequences are available: + +いくつかの ``'\'`` で始まる特殊シーケンスはあらかじめ定義された文字集合を表 していて、 +しばしば便利に使うことができます、例えば、10進数の集合、文字の集合、空白以 外の任意の文字の集合。 +以下のあらかじめ定義された特殊シーケンスが利用可能です。 + +.. + ``\d`` + Matches any decimal digit; this is equivalent to the class ``[0-9]``. + + ``\D`` + Matches any non-digit character; this is equivalent to the class ``[^0-9]``. + + ``\s`` + Matches any whitespace character; this is equivalent to the class ``[ + \t\n\r\f\v]``. + + ``\S`` + Matches any non-whitespace character; this is equivalent to the class ``[^ + \t\n\r\f\v]``. + + ``\w`` + Matches any alphanumeric character; this is equivalent to the class + ``[a-zA-Z0-9_]``. + + ``\W`` + Matches any non-alphanumeric character; this is equivalent to the class + ``[^a-zA-Z0-9_]``. ``\d`` - Matches any decimal digit; this is equivalent to the class ``[0-9]``. + 任意の十進数とマッチします;これは集合 ``[0-9]`` と同じ意味です。 ``\D`` - Matches any non-digit character; this is equivalent to the class ``[^0-9]``. + 任意の非数字文字とマッチします;これは集合 ``[^0-9]`` と同じ意味です。 ``\s`` - Matches any whitespace character; this is equivalent to the class ``[ - \t\n\r\f\v]``. + 任意の空白文字とマッチします;これは集合 ``[\t\n\r\f\v]`` と同じ意味で す。 ``\S`` - Matches any non-whitespace character; this is equivalent to the class ``[^ - \t\n\r\f\v]``. + 任意の非空白文字とマッチします;これは集合 ``[^\t\n\r\f\v]`` と同じ意味 です。 ``\w`` - Matches any alphanumeric character; this is equivalent to the class - ``[a-zA-Z0-9_]``. + 任意の英数文字および下線とマッチします;これは、集合 ``[a-zA-Z0-9_]`` と 同じ意味です。 ``\W`` - Matches any non-alphanumeric character; this is equivalent to the class - ``[^a-zA-Z0-9_]``. - -These sequences can be included inside a character class. For example, -``[\s,.]`` is a character class that will match any whitespace character, or -``','`` or ``'.'``. - -The final metacharacter in this section is ``.``. It matches anything except a -newline character, and there's an alternate mode (``re.DOTALL``) where it will -match even a newline. ``'.'`` is often used where you want to match "any -character". - - -Repeating Things ----------------- - -Being able to match varying sets of characters is the first thing regular -expressions can do that isn't already possible with the methods available on -strings. However, if that was the only additional capability of regexes, they -wouldn't be much of an advance. Another capability is that you can specify that -portions of the RE must be repeated a certain number of times. - -The first metacharacter for repeating things that we'll look at is ``*``. ``*`` -doesn't match the literal character ``*``; instead, it specifies that the -previous character can be matched zero or more times, instead of exactly once. - -For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``), -``caaat`` (3 ``a`` characters), and so forth. The RE engine has various -internal limitations stemming from the size of C's ``int`` type that will -prevent it from matching over 2 billion ``a`` characters; you probably don't -have enough memory to construct a string that large, so you shouldn't run into -that limit. - -Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching -engine will try to repeat it as many times as possible. If later portions of the -pattern don't match, the matching engine will then back up and try again with -few repetitions. - -A step-by-step example will make this more obvious. Let's consider the -expression ``a[bcd]*b``. This matches the letter ``'a'``, zero or more letters -from the class ``[bcd]``, and finally ends with a ``'b'``. Now imagine matching -this RE against the string ``abcbd``. - -+------+-----------+---------------------------------+ -| Step | Matched | Explanation | -+======+===========+=================================+ -| 1 | ``a`` | The ``a`` in the RE matches. | -+------+-----------+---------------------------------+ -| 2 | ``abcbd`` | The engine matches ``[bcd]*``, | -| | | going as far as it can, which | -| | | is to the end of the string. | -+------+-----------+---------------------------------+ -| 3 | *Failure* | The engine tries to match | -| | | ``b``, but the current position | -| | | is at the end of the string, so | -| | | it fails. | -+------+-----------+---------------------------------+ -| 4 | ``abcb`` | Back up, so that ``[bcd]*`` | -| | | matches one less character. | -+------+-----------+---------------------------------+ -| 5 | *Failure* | Try ``b`` again, but the | -| | | current position is at the last | -| | | character, which is a ``'d'``. | -+------+-----------+---------------------------------+ -| 6 | ``abc`` | Back up again, so that | -| | | ``[bcd]*`` is only matching | -| | | ``bc``. | -+------+-----------+---------------------------------+ -| 6 | ``abcb`` | Try ``b`` again. This time | -| | | the character at the | -| | | current position is ``'b'``, so | -| | | it succeeds. | -+------+-----------+---------------------------------+ - -The end of the RE has now been reached, and it has matched ``abcb``. This -demonstrates how the matching engine goes as far as it can at first, and if no -match is found it will then progressively back up and retry the rest of the RE -again and again. It will back up until it has tried zero matches for -``[bcd]*``, and if that subsequently fails, the engine will conclude that the -string doesn't match the RE at all. - -Another repeating metacharacter is ``+``, which matches one or more times. Pay -careful attention to the difference between ``*`` and ``+``; ``*`` matches -*zero* or more times, so whatever's being repeated may not be present at all, -while ``+`` requires at least *one* occurrence. To use a similar example, -``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match -``ct``. - -There are two more repeating qualifiers. The question mark character, ``?``, -matches either once or zero times; you can think of it as marking something as -being optional. For example, ``home-?brew`` matches either ``homebrew`` or -``home-brew``. - -The most complicated repeated qualifier is ``{m,n}``, where *m* and *n* are -decimal integers. This qualifier means there must be at least *m* repetitions, -and at most *n*. For example, ``a/{1,3}b`` will match ``a/b``, ``a//b``, and -``a///b``. It won't match ``ab``, which has no slashes, or ``a////b``, which -has four. - -You can omit either *m* or *n*; in that case, a reasonable value is assumed for -the missing value. Omitting *m* is interpreted as a lower limit of 0, while -omitting *n* results in an upper bound of infinity --- actually, the upper bound -is the 2-billion limit mentioned earlier, but that might as well be infinity. - -Readers of a reductionist bent may notice that the three other qualifiers can -all be expressed using this notation. ``{0,}`` is the same as ``*``, ``{1,}`` -is equivalent to ``+``, and ``{0,1}`` is the same as ``?``. It's better to use -``*``, ``+``, or ``?`` when you can, simply because they're shorter and easier -to read. - - -Using Regular Expressions -========================= - -Now that we've looked at some simple regular expressions, how do we actually use -them in Python? The :mod:`re` module provides an interface to the regular -expression engine, allowing you to compile REs into objects and then perform -matches with them. - - -Compiling Regular Expressions ------------------------------ - -Regular expressions are compiled into :class:`RegexObject` instances, which have -methods for various operations such as searching for pattern matches or -performing string substitutions. :: + 任意の非英数文字とマッチします;これは集合 ``[^a-zA-Z0-9_]`` と同じ意味 です。 + +.. + These sequences can be included inside a character class. For example, + ``[\s,.]`` is a character class that will match any whitespace character, or + ``','`` or ``'.'``. + +これらのシーケンスは文字クラス内に含めることができます。 +例えば、 ``[\s,.]`` は空白文字や ``','`` または ``'.'`` にマッチする文字ク ラスです。 + +.. + The final metacharacter in this section is ``.``. It matches anything except a + newline character, and there's an alternate mode (``re.DOTALL``) where it will + match even a newline. ``'.'`` is often used where you want to match "any + character". + +この節での最後の特殊文字は ``.`` です。 +これは改行文字を除く任意の文字にマッチします、 +さらに改行文字に対してもマッチさせる代替モード (``re.DOTALL``) があります。 +``'.'`` は「任意の文字」にマッチさせたい場合に利用されます。 + +.. + Repeating Things + ---------------- + +繰り返し +-------- + +.. + Being able to match varying sets of characters is the first thing regular + expressions can do that isn't already possible with the methods available on + strings. However, if that was the only additional capability of regexes, they + wouldn't be much of an advance. Another capability is that you can specify that + portions of the RE must be repeated a certain number of times. + +さまざまな文字集合をマッチさせることは正規表現で最初にできるようになること で、 +これは文字列に対するメソッドですぐにできることではありません。 +しかし、正規表現がより力を発揮する場面がこれだけだとすると、正規表現はあま り先進的とはいえません。 +正規表現の力をもう一つの能力は、正規表現の一部が何度も繰り返されるようもの を指定できることです。 + +.. + The first metacharacter for repeating things that we'll look at is ``*``. ``*`` + doesn't match the literal character ``*``; instead, it specifies that the + previous character can be matched zero or more times, instead of exactly once. + +最初にとりあげる繰り返しのための最初の特殊文字は ``*`` です。 +``*`` は文字リテラル ``*`` とはマッチしません; +その代わりに前の文字が厳密に1回ではなく、0回以上繰り返されるパターンを指定 します。 + +.. + For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``), + ``caaat`` (3 ``a`` characters), and so forth. The RE engine has various + internal limitations stemming from the size of C's ``int`` type that will + prevent it from matching over 2 billion ``a`` characters; you probably don't + have enough memory to construct a string that large, so you shouldn't run into + that limit. + +例えば、 ``ca*t`` は ``ct`` (``a`` が0文字)、 ``cat`` (``a`` が1文字)、 +``caaat`` (``a`` 3文字)、続々。 +正規表現エンジンには C の ``int`` 型のサイズのために +20億文字の ``a`` とのマッチングができないなど多くの内部制限があります; +おそらくそれほど大きい文字列を構築するほどの十分なメモリはないので、 +その制限に達することはありません。 + +.. + Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching + engine will try to repeat it as many times as possible. If later portions of the + pattern don't match, the matching engine will then back up and try again with + few repetitions. + + +``*`` のような繰り返しは :dfn:`貪欲 (greedy)` です; +正規表現を繰り返したいとき、マッチングエンジンは可能な限り何度も繰り返そう と試みます。 +パターンの後ろの部分にマッチしない場合、マッチングエンジンは戻って少ない繰 り返しを再び試みます。 + +.. + A step-by-step example will make this more obvious. Let's consider the + expression ``a[bcd]*b``. This matches the letter ``'a'``, zero or more letters + from the class ``[bcd]``, and finally ends with a ``'b'``. Now imagine matching + this RE against the string ``abcbd``. + +例をステップ、ステップで進めていくとより明確にわかります。 +正規表現 ``a[bcd]*b`` を考えましょう。 +この表現は文字 ``'a'`` と文字クラス ``[bcd]`` の0回以上の文字と最後の ``'b'`` にマッチします。 +この正規表現が文字列 ``abcbd`` に対してマッチする作業を想像してみましょう。 + +.. + +------+-----------+---------------------------------+ + | Step | Matched | Explanation | + +======+===========+=================================+ + | 1 | ``a`` | The ``a`` in the RE matches. | + +------+-----------+---------------------------------+ + | 2 | ``abcbd`` | The engine matches ``[bcd]*``, | + | | | going as far as it can, which | + | | | is to the end of the string. | + +------+-----------+---------------------------------+ + | 3 | *Failure* | The engine tries to match | + | | | ``b``, but the current position | + | | | is at the end of the string, so | + | | | it fails. | + +------+-----------+---------------------------------+ + | 4 | ``abcb`` | Back up, so that ``[bcd]*`` | + | | | matches one less character. | + +------+-----------+---------------------------------+ + | 5 | *Failure* | Try ``b`` again, but the | + | | | current position is at the last | + | | | character, which is a ``'d'``. | + +------+-----------+---------------------------------+ + | 6 | ``abc`` | Back up again, so that | + | | | ``[bcd]*`` is only matching | + | | | ``bc``. | + +------+-----------+---------------------------------+ + | 6 | ``abcb`` | Try ``b`` again. This time | + | | | the character at the | + | | | current position is ``'b'``, so | + | | | it succeeds. | + +------+-----------+---------------------------------+ + ++----------+------------------+----------------------------------+ +| ステップ | マッチした文字列 | 説明 | ++==========+==================+==================================+ +| 1 | ``a`` | ``a`` が正規表現にマッチ。 | ++----------+------------------+----------------------------------+ +| 2 | ``abcbd`` | 正規表現エンジンが `[bcd]*`` で | +| | | 文字列の最後まで可能な限り進む。 | ++----------+------------------+----------------------------------+ +| 3 | *失敗* | エンジンが ``b`` とのマッチを | +| | | 試みるが、現在の位置が | +| | | 文字列の最後なので、失敗する。 | ++----------+------------------+----------------------------------+ +| 4 | ``abcb`` | 戻って ``[bcd]*`` は一文字少なく | +| | | マッチ。 | ++----------+------------------+----------------------------------+ +| 5 | *失敗* | 再び ``b`` へのマッチを | +| | | 試みるが、現在の文字は | +| | | 最後の文字 ``'d'`` 。 | ++----------+------------------+----------------------------------+ +| 6 | ``abc`` | 再び戻る, ``[bcd]*`` は ``bc`` | +| | | のみにマッチ。 | ++----------+------------------+----------------------------------+ +| 7 | ``abcb`` | 再び ``b`` を試みる。 | +| | | 今回の現在位置の文字は | +| | | ``'b'`` なので成功。 | ++----------+------------------+----------------------------------+ + +.. + The end of the RE has now been reached, and it has matched ``abcb``. This + demonstrates how the matching engine goes as far as it can at first, and if no + match is found it will then progressively back up and retry the rest of the RE + again and again. It will back up until it has tried zero matches for + ``[bcd]*``, and if that subsequently fails, the engine will conclude that the + string doesn't match the RE at all. + +正規表現の終端に達して、 ``abcd`` にマッチしました。 +この例はマッチングエンジンが最初に到達できるところまで進みマッチしなかった 場合、 +逐次戻って再度残りの正規表現とのマッチを次々と試みること様子を示していま す。 +エンジンは ``[bcd]*`` とマッチしなくなるまで戻ります、 +さらに続く正規表現とのマッチに失敗した場合にエンジンは +正規表現と文字列が完全にマッチしないと結論づけることになります。 + +.. + Another repeating metacharacter is ``+``, which matches one or more times. Pay + careful attention to the difference between ``*`` and ``+``; ``*`` matches + *zero* or more times, so whatever's being repeated may not be present at all, + while ``+`` requires at least *one* occurrence. To use a similar example, + ``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match + ``ct``. + +別の繰り返しの特殊文字は ``+`` です、この特殊文字は1回以上の繰り返しにマッ チします。 +``*`` と ``+`` に違いに対しては十分注意して下さい; +``*`` は *0回* 以上の繰り返しにマッチします、つまり繰り返す部分が全くなくて も問題ありません、 +一方で ``+`` は少なくとも *1回* は表われる必要があります。 +同様の例を使うと +``ca+t`` は ``cat`` (``a`` 1文字), ``caaat`` (``a`` 3文字), とマッチし、 +``ct`` とはマッチしません。 + +.. + There are two more repeating qualifiers. The question mark character, ``?``, + matches either once or zero times; you can think of it as marking something as + being optional. For example, ``home-?brew`` matches either ``homebrew`` or + ``home-brew``. + +2回以上の繰り返しを制限する修飾子も存在します。 +クエスチョンマーク ``?`` は0か1回のどちらかにマッチします; +これはオプションであることを示していると考えることもできます。 +例えば、 ``home-?brew`` は ``homebrew`` と ``home-brew`` のどちらにもマッ チします。 + +.. + The most complicated repeated qualifier is ``{m,n}``, where *m* and *n* are + decimal integers. This qualifier means there must be at least *m* repetitions, + and at most *n*. For example, ``a/{1,3}b`` will match ``a/b``, ``a//b``, and + ``a///b``. It won't match ``ab``, which has no slashes, or ``a////b``, which + has four. + +より複雑に繰り返しを制限するのは ``{m,n}`` です、ここで *m* と *n* は10進数 の整数です。 +この修飾子は最低 *m* 回、最大で *n* 回の繰り返すことを意味しています。 +例えば、 ``a/{1,3}b`` は ``a/b`` と ``a//b`` そして ``a///b`` にマッチしま す。 +これはスラッシュの無い ``ab`` や4つのスラッシュを持つ ``a////b`` とはマッチ しません。 + +.. + You can omit either *m* or *n*; in that case, a reasonable value is assumed for + the missing value. Omitting *m* is interpreted as a lower limit of 0, while + omitting *n* results in an upper bound of infinity --- actually, the upper bound + is the 2-billion limit mentioned earlier, but that might as well be infinity. + +*m* か *n* のどちらかは省略することができます; +そうした場合省略された値はもっともらしい値と仮定されます。 +*m* の省略は下限 0 と解釈され、 *n* の省略は無限の上限として解釈されます +--- 実際には上限は前に述べたように20億ですが、無限大とみなしてもいいでしょ う。 + +.. + Readers of a reductionist bent may notice that the three other qualifiers can + all be expressed using this notation. ``{0,}`` is the same as ``*``, ``{1,}`` + is equivalent to ``+``, and ``{0,1}`` is the same as ``?``. It's better to use + ``*``, ``+``, or ``?`` when you can, simply because they're shorter and easier + to read. + +還元主義的素養のある読者は、3つの修飾子がこの表記で表現できることに気づくで しょう。 +``{0,}`` は ``*`` と同じで ``{1,}`` は ``+`` と、そして ``{0,1}`` は ``?`` と同じです。 +利用できる場合には ``*``, ``+`` または ``?`` を利用した方が賢明です、 +そうすることで単純に、短く読み易くすることができます。 + +.. + Using Regular Expressions + ========================= + +正規表現を使う +============== + +.. + Now that we've looked at some simple regular expressions, how do we actually use + them in Python? The :mod:`re` module provides an interface to the regular + expression engine, allowing you to compile REs into objects and then perform + matches with them. + +これまででいくつかの単純な正規表現に触れてきました、 +実際に Python ではこれらをどう使えばいいのでしょう? +:mod:`re` モジュールは正規表現エンジンに対するインターフェースを提供してい て、 +それらを使うことで正規表現をオブジェクトにコンパイルし、マッチを実行するこ とができます。 + +.. + Compiling Regular Expressions + ----------------------------- + +正規表現をコンパイルする +------------------------ + +.. + Regular expressions are compiled into pattern objects, which have + methods for various operations such as searching for pattern matches or + performing string substitutions. :: + +正規表現はパターンオブジェクトにコンパイルされます、 +パターンオブジェクトは多くの操作、 +パターンマッチの検索や文字列の置換の実行などのメソッドを持っています:: >>> import re >>> p = re.compile('ab*') >>> print p - <re.RegexObject instance at 80b4150> - -:func:`re.compile` also accepts an optional *flags* argument, used to enable -various special features and syntax variations. We'll go over the available -settings later, but for now a single example will do:: + <_sre.SRE_Pattern object at 80b4150> + +.. + :func:`re.compile` also accepts an optional *flags* argument, used to enable + various special features and syntax variations. we'll go over the available + settings later, but for now a single example will do:: + +:func:`re.compile` はいくつかの *flags* 引数を受け付けることができます、 +この引数はさまざまな特別な機能を有効にしたり、構文を変化させたりします。 +利用できる設定に何があるかは後に飛ばすことにして、簡単な例をやることにしま しょう:: >>> p = re.compile('ab*', re.IGNORECASE) -The RE is passed to :func:`re.compile` as a string. REs are handled as strings -because regular expressions aren't part of the core Python language, and no -special syntax was created for expressing them. (There are applications that -don't need REs at all, so there's no need to bloat the language specification by -including them.) Instead, the :mod:`re` module is simply a C extension module -included with Python, just like the :mod:`socket` or :mod:`zlib` modules. - -Putting REs in strings keeps the Python language simpler, but has one -disadvantage which is the topic of the next section. - - -The Backslash Plague --------------------- - -As stated earlier, regular expressions use the backslash character (``'\'``) to -indicate special forms or to allow special characters to be used without -invoking their special meaning. This conflicts with Python's usage of the same -character for the same purpose in string literals. - -Let's say you want to write a RE that matches the string ``\section``, which -might be found in a LaTeX file. To figure out what to write in the program -code, start with the desired string to be matched. Next, you must escape any -backslashes and other metacharacters by preceding them with a backslash, -resulting in the string ``\\section``. The resulting string that must be passed -to :func:`re.compile` must be ``\\section``. However, to express this as a -Python string literal, both backslashes must be escaped *again*. - -+-------------------+------------------------------------------+ -| Characters | Stage | -+===================+==========================================+ -| ``\section`` | Text string to be matched | -+-------------------+------------------------------------------+ -| ``\\section`` | Escaped backslash for :func:`re.compile` | -+-------------------+------------------------------------------+ -| ``"\\\\section"`` | Escaped backslashes for a string literal | -+-------------------+------------------------------------------+ - -In short, to match a literal backslash, one has to write ``'\\\\'`` as the RE -string, because the regular expression must be ``\\``, and each backslash must -be expressed as ``\\`` inside a regular Python string literal. In REs that -feature backslashes repeatedly, this leads to lots of repeated backslashes and -makes the resulting strings difficult to understand. - -The solution is to use Python's raw string notation for regular expressions; -backslashes are not handled in any special way in a string literal prefixed with -``'r'``, so ``r"\n"`` is a two-character string containing ``'\'`` and ``'n'``, -while ``"\n"`` is a one-character string containing a newline. Regular -expressions will often be written in Python code using this raw string notation. +.. + The RE is passed to :func:`re.compile` as a string. REs are handled as strings + because regular expressions aren't part of the core Python language, and no + special syntax was created for expressing them. (There are applications that + don't need REs at all, so there's no need to bloat the language specification by + including them.) Instead, the :mod:`re` module is simply a C extension module + included with Python, just like the :mod:`socket` or :mod:`zlib` modules. + +正規表現は文字列として :func:`re.compile` に渡されます。 +正規表現は文字列として扱われますが、それは正規表現が Python 言語のコアシス テムに含まれないためです、 +そのため正規表現を表わす特殊な構文はありません。 +(正規表現を全く必要としないアプリケーションも存在します、 +そのためそれらを含めて言語仕様を無駄に大きくする必要はありません) +その代わり、 :mod:`re` モジュールは :mod:`socket` や :mod:`zlib` モジュール のような +通常の C 拡張モジュールとして Python に含まれています。 + +.. + Putting REs in strings keeps the Python language simpler, but has one + disadvantage which is the topic of the next section. + +正規表現を文字列としておくことで Python 言語はより簡素に保たれていますが、 +そのため1つの欠点があります、これについては次の節で話題とします。 + +.. + The Backslash Plague + -------------------- + +バックスラッシュ感染症 +---------------------- + +.. + As stated earlier, regular expressions use the backslash character (``'\'``) to + indicate special forms or to allow special characters to be used without + invoking their special meaning. This conflicts with Python's usage of the same + character for the same purpose in string literals. + +先に述べたように、正規表現は特別な形式や特殊な文字の特別な意味を意味を除く ことを示すために +バックスラッシュ文字 (``'\'``) を利用します。 +これは Python が文字列リテラルに対して、同じ文字を同じ目的で使うことと衝突 します。 + +.. + Let's say you want to write a RE that matches the string ``\section``, which + might be found in a LaTeX file. To figure out what to write in the program + code, start with the desired string to be matched. Next, you must escape any + backslashes and other metacharacters by preceding them with a backslash, + resulting in the string ``\\section``. The resulting string that must be passed + to :func:`re.compile` must be ``\\section``. However, to express this as a + Python string literal, both backslashes must be escaped *again*. + +``\section`` という文字列 (これは LaTeX ファイルでみかけます) +にマッチする正規表現を書きたいとします。 +どんなプログラムを書くか考え、マッチして欲しい文字列をはじめに考えます。 +次に、バックスラッシュや他の特殊文字をバックスラッシュに続けて書くことでエ スケープしなければいけません、 +その結果 ``\\section`` のような文字列となります。 +こうしてできた :func:`re.compile` に渡す文字列は ``\\section`` でなければい けません。 +しかし、これを Python の文字列リテラルとして扱うにはこの二つのバックスラッ シュを *再び* +エスケープする必要があります。 + +.. + +-------------------+------------------------------------------+ + | Characters | Stage | + +===================+==========================================+ + | ``\section`` | Text string to be matched | + +-------------------+------------------------------------------+ + | ``\\section`` | Escaped backslash for :func:`re.compile` | + +-------------------+------------------------------------------+ + | ``"\\\\section"`` | Escaped backslashes for a string literal | + +-------------------+------------------------------------------+ + ++-------------------+-------------------------------------------------------+ +| 文字 | 段 階 | ++===================+=======================================================+ +| ``\section`` | マッチさせるテキス ト | ++-------------------+-------------------------------------------------------+ +| ``\\section`` | :func:`re.compile` のためのバックスラッシュエスケー プ | ++-------------------+-------------------------------------------------------+ +| ``"\\\\section"`` | 文字列リテラルのためのバックスラッシュエスケー プ | ++-------------------+-------------------------------------------------------+ + +.. + In short, to match a literal backslash, one has to write ``'\\\\'`` as the RE + string, because the regular expression must be ``\\``, and each backslash must + be expressed as ``\\`` inside a regular Python string literal. In REs that + feature backslashes repeatedly, this leads to lots of repeated backslashes and + makes the resulting strings difficult to understand. + +要点だけをいえば、リテラルとしてのバックスラッシュにマッチさせるために、 +正規表現文字列として ``'\\\\'`` 書かなければいけません、 +なぜなら正規表現は ``\\`` であり、通常の Python の文字列リテラルとしては +それぞれのバックスラッシュは ``\\`` で表現しなければいけないからです。 +正規表現に関してこのバックスラッシュの繰り返しの機能は、 +たくさんのバックスラッシュの繰り返しを生むことになり、 +その結果として作られる文字列は理解することが難しくなります。 + +.. + The solution is to use Python's raw string notation for regular expressions; + backslashes are not handled in any special way in a string literal prefixed with + ``'r'``, so ``r"\n"`` is a two-character string containing ``'\'`` and ``'n'``, + while ``"\n"`` is a one-character string containing a newline. Regular + expressions will often be written in Python code using this raw string notation. + +この問題の解決策としては正規表現に対しては Python の raw string 記法を使う ことです; +``'r'`` を文字列リテラルの先頭に書くことでバックスラッシュは特別扱いされな くなります、 +つまり ``"\n"`` は改行を含む1つの文字からなる文字列であるのに対して、 +``r"\n"`` は2つの文字 ``'\'`` と ``'n'`` を含む文字列となります。 +多くの場合 Python コードの中の正規表現はこの raw string 記法を使って書かれ ます。 + + +.. + +-------------------+------------------+ + | Regular String | Raw string | + +===================+==================+ + | ``"ab*"`` | ``r"ab*"`` | + +-------------------+------------------+ + | ``"\\\\section"`` | ``r"\\section"`` | + +-------------------+------------------+ + | ``"\\w+\\s+\\1"`` | ``r"\w+\s+\1"`` | + +-------------------+------------------+ +-------------------+------------------+ -| Regular String | Raw string | +| 通常の文字列 | Raw string | +===================+==================+ | ``"ab*"`` | ``r"ab*"`` | +-------------------+------------------+ @@ -331,47 +661,93 @@ | ``"\\w+\\s+\\1"`` | ``r"\w+\s+\1"`` | +-------------------+------------------+ - -Performing Matches ------------------- - -Once you have an object representing a compiled regular expression, what do you -do with it? :class:`RegexObject` instances have several methods and attributes. -Only the most significant ones will be covered here; consult the :mod:`re` docs -for a complete listing. +.. + Performing Matches + ------------------ + +マッチの実行 +------------ + +.. + Once you have an object representing a compiled regular expression, what do you + do with it? Pattern objects have several methods and attributes. + Only the most significant ones will be covered here; consult the :mod:`re` docs + for a complete listing. + +一旦コンパイルした正規表現を表現するオブジェクトを作成したら、次に何をしま すか? +パターンオブジェクトはいくつかのメソッドや属性を持っています。 +ここでは、その中でも最も重要なものについて扱います; +完全なリストは :mod:`re` ドキュメントを参照して下さい。 + +.. + +------------------+-----------------------------------------------+ + | Method/Attribute | Purpose | + +==================+===============================================+ + | ``match()`` | Determine if the RE matches at the beginning | + | | of the string. | + +------------------+-----------------------------------------------+ + | ``search()`` | Scan through a string, looking for any | + | | location where this RE matches. | + +------------------+-----------------------------------------------+ + | ``findall()`` | Find all substrings where the RE matches, and | + | | returns them as a list. | + +------------------+-----------------------------------------------+ + | ``finditer()`` | Find all substrings where the RE matches, and | + | | returns them as an :term:`iterator`. | + +------------------+-----------------------------------------------+ +------------------+-----------------------------------------------+ -| Method/Attribute | Purpose | +| メソッド/属性 | 目的 | +==================+===============================================+ -| ``match()`` | Determine if the RE matches at the beginning | -| | of the string. | +| ``match()`` | 文字列の先頭で正規表現と | +| | マッチするか判定します | +------------------+-----------------------------------------------+ ***The diff for this file has been truncated for email.***