• R/O
  • SSH
  • HTTPS

nls: Repository summary


Recent Commits RSS

Rev. Time Author Message
r25 2012-01-01 17:41:53 linuxchecker updated for lucene-3.5
r24 2011-07-12 10:06:42 tullio introduced cache in loader mode introduced new distance f...
r23 2011-06-16 20:25:54 linuxchecker improved startup shell
r22 2011-05-30 16:51:57 linuxchecker improved wikipedia_parse
r21 2011-05-24 20:31:21 linuxchecker improved wikipedia contents extraction
r20 2011-05-23 19:07:03 linuxchecker fixed directory check bug in indexer
r19 2011-05-20 21:29:35 linuxchecker homework to tullio
r18 2011-05-19 20:18:51 linuxchecker implemented realy normarization version
r17 2011-05-16 20:04:00 linuxchecker supported loader command
r16 2011-05-16 15:56:24 linuxchecker added calc module

README.txt

* Required materials
 + Ruby
  http://www.ruby-lang.org/ja/

 + Nokogiri
  http://nokogiri.org/

 + Main source doces
  $ svn co svn checkout https://svn.sourceforge.jp/svnroot/nls/
  $ $ ls -l nls/

 + Wikipedia data; For japanese,
  http://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2

  $ ./nls/bin/wikipedia_parse.rb jawiki-latest-pages-articles.xml
  $ ls -l data/

 + Lucene
  http://ftp.jaist.ac.jp/pub/apache/lucene/java/3.5.0/lucene-3.5.0.tgz

  $ tar xvf lucene-3.5.0.tgz
  $ cp -v lucene-3.5.0/lucene-core-3.5.0.jar nls/lib/

* Building
  $ cd nls
  $ make clean; make; make install

* Indexing
  $ ./nls/bin/nld.sh indexer data
  $ ls -l nls/bin/index/

* Searching
  $ ./nls/bin/nld.sh searcher wikipedia

* For Japanese
 + Igo
  http://iij.dl.sourceforge.jp/igo/46696/igo-0.4.2.jar
  http://jaist.dl.sourceforge.jp/igo/46696/igo-0.4.2-src.tar.gz
 + Igo-analyzer
  http://jaist.dl.sourceforge.jp/igo/46413/igo-analyzer-0.0.1.jar
  http://jaist.dl.sourceforge.jp/igo/46413/igo-analyzer-0.0.1-src.tar.gz

 $ mv igo-analyzer-0.0.1.jar igo-0.4.2.jar nls/lib/

 + MeCab for making dictionary
  http://sourceforge.net/projects/mecab/files/mecab/0.98/mecab-0.98.tar.gz/download
  $ tar xvf mecab-0.98.tar.gz
  $ cd mecab-0.98
  $ ./configur; make ; make install

 + NAIST Japanese Dictionary
  http://iij.dl.sourceforge.jp/naist-jdic/48487/mecab-naist-jdic-0.6.3-20100801.tar.gz
 $ tar xvf mecab-naist-jdic-0.6.3-20100801.tar.gz
 $ cd mecab-naist-jdic-0.6.3-20100801
 $ grep -v -E '^\"' naist-jdic.csv  > naist-jdic.tmp; mv naist-jdic.tmp naist-jdic.csv
 $ make clean; ./configure; make; cd ..
 $ java -cp ./nls/lib/igo-0.4.2.jar net.reduls.igo.bin.BuildDic ipadic mecab-naist-jdic-0.6.3-20100801 EUC-JP
 $ ls -ltr ipadic/
 $ java -Dfile.encoding=UTF-8 -cp ./nls/lib/igo-0.4.2.jar net.reduls.igo.bin.Igo ipadic
すもももももももものうち
すもも     名詞,一般,*,*,*,*,すもも,スモモ,スモモ,,
も       助詞,係助詞,*,*,*,*,も,モ,モ,,
もも      名詞,一般,*,*,*,*,もも,モモ,モモ,,
も       助詞,係助詞,*,*,*,*,も,モ,モ,,
もも      名詞,一般,*,*,*,*,もも,モモ,モモ,,
の       助詞,連体化,*,*,*,*,の,ノ,ノ,,
うち      名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ,,
EOS

 * OPTIONAL
  + jUnit
  $ wget --no-check-certificate https://github.com/downloads/KentBeck/junit/junit-4.9b2.jar
  $ mv -v junit-4.9b2.jar nls/lib/


> wget http://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2
--2011-04-17 17:10:47--  http://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2
Resolving dumps.wikimedia.org... 208.80.152.185
Connecting to dumps.wikimedia.org|208.80.152.185|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1362281344 (1.3G) [application/x-bzip]
Saving to: `jawiki-latest-pages-articles.xml.bz2'

100%[=============================================================================================================>] 1,362,281,344 1.03M/s   in 21m 26s

2011-04-17 17:32:14 (1.01 MB/s) - `jawiki-latest-pages-articles.xml.bz2' saved [1362281344/1362281344]

> time bunzip2 jawiki-latest-pages-articles.xml.bz2
777.349u 18.024s 14:00.50 94.6% 36+1096k 9667+42845io 1pf+0w
time ./nls/bin/wikipedia_parse.rb jawiki-latest-pages-articles.xml
1455.073u 25529.807s 7:33:16.24 99.2% 5+-469k 31835+44883io 20pf+0w

$ time ./nls/bin/wikipedia_parse.rb jawiki-latest-pages-articles.xml

real    125m23.183s
user    51m10.832s
sys     3m9.064s


doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
> time ./nls/bin/nld.sh indexer data
adding ../../data/1571104.txt
Optimizing...
7255930 total milliseconds
6495.171u 381.982s 2:00:56.62 94.7%     88+1041k 1112532+85991io 7pf+0w

$ ls -l nls/bin/index/
-rw-r--r-- 1 tetsato tetsato   40262986 2011-04-25 20:11 _76.fdt
-rw-r--r-- 1 tetsato tetsato    8000004 2011-04-25 20:11 _76.fdx
-rw-r--r-- 1 tetsato tetsato         32 2011-04-25 20:11 _76.fnm
-rw-r--r-- 1 tetsato tetsato  431667786 2011-04-25 20:15 _76.frq
-rw-r--r-- 1 tetsato tetsato    3000004 2011-04-25 20:15 _76.nrm
-rw-r--r-- 1 tetsato tetsato 1235412936 2011-04-25 20:15 _76.prx
-rw-r--r-- 1 tetsato tetsato     794095 2011-04-25 20:15 _76.tii
-rw-r--r-- 1 tetsato tetsato   60288601 2011-04-25 20:15 _76.tis
-rw-r--r-- 1 tetsato tetsato         20 2011-04-25 20:15 segments.gen
-rw-r--r-- 1 tetsato tetsato        273 2011-04-25 20:15 segments_1

doc.add(new Field("path", f.getPath(), Field.Store.NO, Field.Index.NOT_ANALYZED));

writer=org.apache.lucene.index.IndexWriter@1aa57fb
adding ../../data/267365.txt
Optimizing...
22829755 total milliseconds

real    380m31.397s
user    127m44.471s
sys     7m45.093s

* Loader log to image map
 grep , tmp1.log |awk 'BEGIN{i=1;j=0;FS=","}{ val[i,j]=$3; val[j,i]=$3; i+=1; if(i==20){j+=1; i=j+1;} }END{for(k=0;k<20;++k){ printf("%d,",k); for(l=0;l<20;++l) {printf("%.3f",val[k,l]); if(l!=19) printf(",");}; printf("\n");}}'
Show on old repository browser