Rev. | Time | Author | Message |
---|---|---|---|
r25 | 2012-01-01 17:41:53 | linuxchecker | updated for lucene-3.5 |
r24 | 2011-07-12 10:06:42 | tullio | introduced cache in loader mode introduced new distance f... |
r23 | 2011-06-16 20:25:54 | linuxchecker | improved startup shell |
r22 | 2011-05-30 16:51:57 | linuxchecker | improved wikipedia_parse |
r21 | 2011-05-24 20:31:21 | linuxchecker | improved wikipedia contents extraction |
r20 | 2011-05-23 19:07:03 | linuxchecker | fixed directory check bug in indexer |
r19 | 2011-05-20 21:29:35 | linuxchecker | homework to tullio |
r18 | 2011-05-19 20:18:51 | linuxchecker | implemented realy normarization version |
r17 | 2011-05-16 20:04:00 | linuxchecker | supported loader command |
r16 | 2011-05-16 15:56:24 | linuxchecker | added calc module |
* Required materials + Ruby http://www.ruby-lang.org/ja/ + Nokogiri http://nokogiri.org/ + Main source doces $ svn co svn checkout https://svn.sourceforge.jp/svnroot/nls/ $ $ ls -l nls/ + Wikipedia data; For japanese, http://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 $ ./nls/bin/wikipedia_parse.rb jawiki-latest-pages-articles.xml $ ls -l data/ + Lucene http://ftp.jaist.ac.jp/pub/apache/lucene/java/3.5.0/lucene-3.5.0.tgz $ tar xvf lucene-3.5.0.tgz $ cp -v lucene-3.5.0/lucene-core-3.5.0.jar nls/lib/ * Building $ cd nls $ make clean; make; make install * Indexing $ ./nls/bin/nld.sh indexer data $ ls -l nls/bin/index/ * Searching $ ./nls/bin/nld.sh searcher wikipedia * For Japanese + Igo http://iij.dl.sourceforge.jp/igo/46696/igo-0.4.2.jar http://jaist.dl.sourceforge.jp/igo/46696/igo-0.4.2-src.tar.gz + Igo-analyzer http://jaist.dl.sourceforge.jp/igo/46413/igo-analyzer-0.0.1.jar http://jaist.dl.sourceforge.jp/igo/46413/igo-analyzer-0.0.1-src.tar.gz $ mv igo-analyzer-0.0.1.jar igo-0.4.2.jar nls/lib/ + MeCab for making dictionary http://sourceforge.net/projects/mecab/files/mecab/0.98/mecab-0.98.tar.gz/download $ tar xvf mecab-0.98.tar.gz $ cd mecab-0.98 $ ./configur; make ; make install + NAIST Japanese Dictionary http://iij.dl.sourceforge.jp/naist-jdic/48487/mecab-naist-jdic-0.6.3-20100801.tar.gz $ tar xvf mecab-naist-jdic-0.6.3-20100801.tar.gz $ cd mecab-naist-jdic-0.6.3-20100801 $ grep -v -E '^\"' naist-jdic.csv > naist-jdic.tmp; mv naist-jdic.tmp naist-jdic.csv $ make clean; ./configure; make; cd .. $ java -cp ./nls/lib/igo-0.4.2.jar net.reduls.igo.bin.BuildDic ipadic mecab-naist-jdic-0.6.3-20100801 EUC-JP $ ls -ltr ipadic/ $ java -Dfile.encoding=UTF-8 -cp ./nls/lib/igo-0.4.2.jar net.reduls.igo.bin.Igo ipadic すもももももももものうち すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ,, も 助詞,係助詞,*,*,*,*,も,モ,モ,, もも 名詞,一般,*,*,*,*,もも,モモ,モモ,, も 助詞,係助詞,*,*,*,*,も,モ,モ,, もも 名詞,一般,*,*,*,*,もも,モモ,モモ,, の 助詞,連体化,*,*,*,*,の,ノ,ノ,, うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ,, EOS * OPTIONAL + jUnit $ wget --no-check-certificate https://github.com/downloads/KentBeck/junit/junit-4.9b2.jar $ mv -v junit-4.9b2.jar nls/lib/ > wget http://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 --2011-04-17 17:10:47-- http://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 Resolving dumps.wikimedia.org... 208.80.152.185 Connecting to dumps.wikimedia.org|208.80.152.185|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1362281344 (1.3G) [application/x-bzip] Saving to: `jawiki-latest-pages-articles.xml.bz2' 100%[=============================================================================================================>] 1,362,281,344 1.03M/s in 21m 26s 2011-04-17 17:32:14 (1.01 MB/s) - `jawiki-latest-pages-articles.xml.bz2' saved [1362281344/1362281344] > time bunzip2 jawiki-latest-pages-articles.xml.bz2 777.349u 18.024s 14:00.50 94.6% 36+1096k 9667+42845io 1pf+0w time ./nls/bin/wikipedia_parse.rb jawiki-latest-pages-articles.xml 1455.073u 25529.807s 7:33:16.24 99.2% 5+-469k 31835+44883io 20pf+0w $ time ./nls/bin/wikipedia_parse.rb jawiki-latest-pages-articles.xml real 125m23.183s user 51m10.832s sys 3m9.064s doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED)); > time ./nls/bin/nld.sh indexer data adding ../../data/1571104.txt Optimizing... 7255930 total milliseconds 6495.171u 381.982s 2:00:56.62 94.7% 88+1041k 1112532+85991io 7pf+0w $ ls -l nls/bin/index/ -rw-r--r-- 1 tetsato tetsato 40262986 2011-04-25 20:11 _76.fdt -rw-r--r-- 1 tetsato tetsato 8000004 2011-04-25 20:11 _76.fdx -rw-r--r-- 1 tetsato tetsato 32 2011-04-25 20:11 _76.fnm -rw-r--r-- 1 tetsato tetsato 431667786 2011-04-25 20:15 _76.frq -rw-r--r-- 1 tetsato tetsato 3000004 2011-04-25 20:15 _76.nrm -rw-r--r-- 1 tetsato tetsato 1235412936 2011-04-25 20:15 _76.prx -rw-r--r-- 1 tetsato tetsato 794095 2011-04-25 20:15 _76.tii -rw-r--r-- 1 tetsato tetsato 60288601 2011-04-25 20:15 _76.tis -rw-r--r-- 1 tetsato tetsato 20 2011-04-25 20:15 segments.gen -rw-r--r-- 1 tetsato tetsato 273 2011-04-25 20:15 segments_1 doc.add(new Field("path", f.getPath(), Field.Store.NO, Field.Index.NOT_ANALYZED)); writer=org.apache.lucene.index.IndexWriter@1aa57fb adding ../../data/267365.txt Optimizing... 22829755 total milliseconds real 380m31.397s user 127m44.471s sys 7m45.093s * Loader log to image map grep , tmp1.log |awk 'BEGIN{i=1;j=0;FS=","}{ val[i,j]=$3; val[j,i]=$3; i+=1; if(i==20){j+=1; i=j+1;} }END{for(k=0;k<20;++k){ printf("%d,",k); for(l=0;l<20;++l) {printf("%.3f",val[k,l]); if(l!=19) printf(",");}; printf("\n");}}'