Taku Kudo
taku****@chase*****
2012年 6月 6日 (水) 00:49:51 JST
工藤です 少し遅れましたが、mecab-ipadicのCRF学習モデルを公開しました。 http://code.google.com/p/mecab/downloads/detail?name=mecab-ipadic-2.7.0-20070801.model.bz2 モデルファイルを用いることで 1. ユーザ辞書の単語の自動コスト推定 http://mecab.googlecode.com/svn/trunk/mecab/doc/dic.html 2. 少量の辞書・学習データを用いたモデルの再学習 / ドメイン適応 http://mecab.googlecode.com/svn/trunk/mecab/doc/learn.html#retrain が行えます。2の具体例を紹介します。なお、現在の学習データの文字コードの制約上すべて EUC-JP としてください。現モデルファイルも EUC-JPです。(いいかげんUTF8にしたいと思っていますが) $WORKが現在の作業ディレクトリです。 1. mecab-ipadic, とモデルの例 % cd $WORK % bzip2 bzip2 -d mecab-ipadic-2.7.0-20070801.model.bz2 % tar zxfv mecab-ipadic-2.7.0-20070801.tar.gz % ls mecab-ipadic-2.7.0-20070801 mecab-ipadic-2.7.0-20070801.model 2. 学習データの作成 (ファイル名 train) 以下のように、MeCabの出力結果と同じフォーマットで学習データを作ります。今は、終助詞「なう」と助動詞「まーす」を追加しています。 京都 名詞,固有名詞,地域,一般,*,*,京都,キョウト,キョート なう 助詞,終助詞,*,*,*,*,なう,ナウ,ナウ EOS ラーメン 名詞,一般,*,*,*,*,ラーメン,ラーメン,ラーメン なう 助詞,終助詞,*,*,*,*,なう,ナウ,ナウ EOS 行っ 動詞,自立,*,*,五段・カ行促音便,連用タ接続,行く,イッ,イッ て 助詞,接続助詞,*,*,*,*,て,テ,テ き 動詞,非自立,*,*,カ変・クル,連用形,くる,キ,キ まーす 助動詞,*,*,*,特殊・マス,基本形,まーす,マース,マース EOS 3. 辞書への追加 (mecab-ipadic-2.7.0-20070801/add.csv) 新規語彙を、mecab-ipadic-2.7.0-20070801 以下に新規 csv ファイルに記述します なう,0,0,0,助詞,終助詞,*,*,*,*,なう,ナウ,ナウ まーす,0,0,0,助動詞,*,*,*,特殊・マス,基本形,まーす,マース,マース 4. 学習の実行 まず、mecab-dict-index で、新規語彙が追加された辞書をコンパイルし、新規辞書と新規コーパスを使い学習します。 % /usr/local/libexec/mecab/mecab-dict-index -f euc-jp -t euc-jp -d mecab-ipadic-2.7.0-20070801 -o mecab-ipadic-2.7.0-20070801 reading mecab-ipadic-2.7.0-20070801/unk.def ... 40 emitting double-array: 100% |###########################################| mecab-ipadic-2.7.0-20070801/model.def is not found. skipped. reading mecab-ipadic-2.7.0-20070801/Noun.adjv.csv ... 3328 reading mecab-ipadic-2.7.0-20070801/Verb.csv ... 130750 reading mecab-ipadic-2.7.0-20070801/Noun.demonst.csv ... 120 reading mecab-ipadic-2.7.0-20070801/Suffix.csv ... 1393 reading mecab-ipadic-2.7.0-20070801/Noun.others.csv ... 151 reading mecab-ipadic-2.7.0-20070801/Adj.csv ... 27210 reading mecab-ipadic-2.7.0-20070801/Conjunction.csv ... 171 reading mecab-ipadic-2.7.0-20070801/Noun.name.csv ... 34202 reading mecab-ipadic-2.7.0-20070801/Postp.csv ... 146 reading mecab-ipadic-2.7.0-20070801/Interjection.csv ... 252 reading mecab-ipadic-2.7.0-20070801/Adverb.csv ... 3032 reading mecab-ipadic-2.7.0-20070801/Adnominal.csv ... 135 reading mecab-ipadic-2.7.0-20070801/Noun.nai.csv ... 42 reading mecab-ipadic-2.7.0-20070801/Noun.csv ... 60477 reading mecab-ipadic-2.7.0-20070801/Prefix.csv ... 221 reading mecab-ipadic-2.7.0-20070801/Noun.verbal.csv ... 12146 reading mecab-ipadic-2.7.0-20070801/Postp-col.csv ... 91 reading mecab-ipadic-2.7.0-20070801/Noun.place.csv ... 72999 reading mecab-ipadic-2.7.0-20070801/Symbol.csv ... 208 reading mecab-ipadic-2.7.0-20070801/add.csv ... 2 reading mecab-ipadic-2.7.0-20070801/Others.csv ... 2 reading mecab-ipadic-2.7.0-20070801/Noun.org.csv ... 16668 reading mecab-ipadic-2.7.0-20070801/Filler.csv ... 19 reading mecab-ipadic-2.7.0-20070801/Noun.adverbal.csv ... 795 reading mecab-ipadic-2.7.0-20070801/Noun.number.csv ... 42 reading mecab-ipadic-2.7.0-20070801/Auxil.csv ... 199 reading mecab-ipadic-2.7.0-20070801/Noun.proper.csv ... 27327 emitting double-array: 100% |###########################################| reading mecab-ipadic-2.7.0-20070801/matrix.def ... 1316x1316 emitting matrix : 100% |###########################################| done! % /usr/local/libexec/mecab/mecab-cost-train -M mecab-ipadic-2.7.0-20070801.model -d mecab-ipadic-2.7.0-20070801 train new_model Using previous model: mecab-ipadic-2.7.0-20070801.model --cost --freq and --eta options are overwritten. reading corpus ... Number of sentences: 3 Number of features: 1029250 eta: 0.00005 freq: 1 eval-size: 8 unk-eval-size: 4 threads: 1 charset: euc-jp C(sigma^2): 1.00000 iter=0 err=0.00000 F=1.00000 target=0.68291 diff=1.00000 iter=1 err=0.00000 F=1.00000 target=0.52948 diff=0.22467 iter=2 err=0.00000 F=1.00000 target=0.34616 diff=0.34623 iter=3 err=0.00000 F=1.00000 target=0.39982 diff=0.15501 iter=4 err=0.00000 F=1.00000 target=0.18924 diff=0.52668 iter=5 err=0.00000 F=1.00000 target=0.18608 diff=0.01672 iter=6 err=0.00000 F=1.00000 target=0.18260 diff=0.01866 iter=7 err=0.00000 F=1.00000 target=0.18253 diff=0.00039 iter=8 err=0.00000 F=1.00000 target=0.18253 diff=0.00003 iter=9 err=0.00000 F=1.00000 target=0.18252 diff=0.00001 iter=10 err=0.00000 F=1.00000 target=0.18252 diff=0.00000 Done! writing model file ... 5. 解析辞書の作成 新規モデルを使い新し辞書・連接表を構築します。new_dic ディレクトリに辞書が構築されます % /usr/local/libexec/mecab/mecab-dict-gen -d mecab-ipadic-2.7.0-20070801 -o new_dic -m new_model new_model is not a binary model. reopen it as text mode... reading mecab-ipadic-2.7.0-20070801/unk.def ... 40 reading mecab-ipadic-2.7.0-20070801/Noun.adjv.csv ... 3328 reading mecab-ipadic-2.7.0-20070801/Verb.csv ... 130750 reading mecab-ipadic-2.7.0-20070801/Noun.demonst.csv ... 120 reading mecab-ipadic-2.7.0-20070801/Suffix.csv ... 1393 reading mecab-ipadic-2.7.0-20070801/Noun.others.csv ... 151 reading mecab-ipadic-2.7.0-20070801/Adj.csv ... 27210 reading mecab-ipadic-2.7.0-20070801/Conjunction.csv ... 171 reading mecab-ipadic-2.7.0-20070801/Noun.name.csv ... 34202 reading mecab-ipadic-2.7.0-20070801/Postp.csv ... 146 reading mecab-ipadic-2.7.0-20070801/Interjection.csv ... 252 reading mecab-ipadic-2.7.0-20070801/Adverb.csv ... 3032 reading mecab-ipadic-2.7.0-20070801/Adnominal.csv ... 135 reading mecab-ipadic-2.7.0-20070801/Noun.nai.csv ... 42 reading mecab-ipadic-2.7.0-20070801/Noun.csv ... 60477 reading mecab-ipadic-2.7.0-20070801/Prefix.csv ... 221 reading mecab-ipadic-2.7.0-20070801/Noun.verbal.csv ... 12146 reading mecab-ipadic-2.7.0-20070801/Postp-col.csv ... 91 reading mecab-ipadic-2.7.0-20070801/Noun.place.csv ... 72999 reading mecab-ipadic-2.7.0-20070801/Symbol.csv ... 208 reading mecab-ipadic-2.7.0-20070801/add.csv ... 2 reading mecab-ipadic-2.7.0-20070801/Others.csv ... 2 reading mecab-ipadic-2.7.0-20070801/Noun.org.csv ... 16668 reading mecab-ipadic-2.7.0-20070801/Filler.csv ... 19 reading mecab-ipadic-2.7.0-20070801/Noun.adverbal.csv ... 795 reading mecab-ipadic-2.7.0-20070801/Noun.number.csv ... 42 reading mecab-ipadic-2.7.0-20070801/Auxil.csv ... 199 reading mecab-ipadic-2.7.0-20070801/Noun.proper.csv ... 27327 emitting new_dic/left-id.def/ new_dic/right-id.def emitting new_dic/unk.def ... 40 emitting new_dic/Noun.adjv.csv ... 3328 emitting new_dic/Verb.csv ... 130750 emitting new_dic/Noun.demonst.csv ... 120 emitting new_dic/Suffix.csv ... 1393 emitting new_dic/Noun.others.csv ... 151 emitting new_dic/Adj.csv ... 27210 emitting new_dic/Conjunction.csv ... 171 emitting new_dic/Noun.name.csv ... 34202 emitting new_dic/Postp.csv ... 146 emitting new_dic/Interjection.csv ... 252 emitting new_dic/Adverb.csv ... 3032 emitting new_dic/Adnominal.csv ... 135 emitting new_dic/Noun.nai.csv ... 42 emitting new_dic/Noun.csv ... 60477 emitting new_dic/Prefix.csv ... 221 emitting new_dic/Noun.verbal.csv ... 12146 emitting new_dic/Postp-col.csv ... 91 emitting new_dic/Noun.place.csv ... 72999 emitting new_dic/Symbol.csv ... 208 emitting new_dic/add.csv ... 2 emitting new_dic/Others.csv ... 2 emitting new_dic/Noun.org.csv ... 16668 emitting new_dic/Filler.csv ... 19 emitting new_dic/Noun.adverbal.csv ... 795 emitting new_dic/Noun.number.csv ... 42 emitting new_dic/Auxil.csv ... 199 emitting new_dic/Noun.proper.csv ... 27327 emitting matrix : 100% |###########################################| copying mecab-ipadic-2.7.0-20070801/char.def to new_dic/char.def copying mecab-ipadic-2.7.0-20070801/rewrite.def to new_dic/rewrite.def copying mecab-ipadic-2.7.0-20070801/dicrc to new_dic/dicrc copying mecab-ipadic-2.7.0-20070801/feature.def to new_dic/feature.def copying new_model to new_dic/model.def done! 6. 新規辞書のコンパイル % /usr/local/libexec/mecab/mecab-dict-index -f euc-jp -t utf8 -d new_dic -o new_dic new_dic/pos-id.def is not found. minimum setting is used reading new_dic/unk.def ... 40 emitting double-array: 100% |###########################################| new_dic/pos-id.def is not found. minimum setting is used reading new_dic/Noun.adjv.csv ... 3328 reading new_dic/Verb.csv ... 130750 reading new_dic/Noun.demonst.csv ... 120 reading new_dic/Suffix.csv ... 1393 reading new_dic/Noun.others.csv ... 151 reading new_dic/Adj.csv ... 27210 reading new_dic/Conjunction.csv ... 171 reading new_dic/Noun.name.csv ... 34202 reading new_dic/Postp.csv ... 146 reading new_dic/Interjection.csv ... 252 reading new_dic/Adverb.csv ... 3032 reading new_dic/Adnominal.csv ... 135 reading new_dic/Noun.nai.csv ... 42 reading new_dic/Noun.csv ... 60477 reading new_dic/Prefix.csv ... 221 reading new_dic/Noun.verbal.csv ... 12146 reading new_dic/Postp-col.csv ... 91 reading new_dic/Noun.place.csv ... 72999 reading new_dic/Symbol.csv ... 208 reading new_dic/add.csv ... 2 reading new_dic/Others.csv ... 2 reading new_dic/Noun.org.csv ... 16668 reading new_dic/Filler.csv ... 19 reading new_dic/Noun.adverbal.csv ... 795 reading new_dic/Noun.number.csv ... 42 reading new_dic/Auxil.csv ... 199 reading new_dic/Noun.proper.csv ... 27327 emitting double-array: 100% |###########################################| reading new_dic/matrix.def ... 1318x1318 emitting matrix : 100% |###########################################| done! 7. 解析 % echo echo 六本木なう | mecab -d new_dic 六本木 名詞,固有名詞,地域,一般,*,*,六本木,ロッポンギ,ロッポンギ なう 助詞,終助詞,*,*,*,*,なう,ナウ,ナウ EOS % echo そんなこと知ってまーす | mecab -d new_dic そんな 連体詞,*,*,*,*,*,そんな,ソンナ,ソンナ こと 名詞,非自立,一般,*,*,*,こと,コト,コト 知っ 動詞,自立,*,*,五段・ラ行,連用タ接続,知る,シッ,シッ て 助詞,接続助詞,*,*,*,*,て,テ,テ まーす 助動詞,*,*,*,特殊・マス,基本形,まーす,マース,マース EOS 「なう」も 「まーす」 も付属語なので、語彙化されており、連接表のサイズが 1316→1318に増えています。 % head -2 mecab-ipadic-2.7.0-20070801/matrix.def 1316 1316 0 0 -434 % head -2 new_dic/matrix.def 1318 1318 0 0 -260