Forums: users (Thread #4850)

tokenize_headers (2004-03-24 16:17 by a39 #8858)

現在の仕様では,To:, Cc:とかが1個の場合しか考慮されていないですよね.

RFC2822ではどちらも最大1つだけと定義していますが,実際にはそれを守っていない実装も多く,少なくともTo:, Cc:は複数記述された物が流通する事がありますから,それらも考慮しておいた方が良いと思います.

Reply to #8858×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: tokenize_headers (2004-03-25 00:12 by nabeken #8869)

対応しました
Reply to #8858

Reply to #8869×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: tokenize_headers (2004-03-25 00:27 by a39 #8871)

ありがとうございます.
でも... Recieved:の解析方法が意図した通りになっていないのではありませんか? ;-)
Reply to #8869

Reply to #8871×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: tokenize_headers (2004-03-25 00:44 by nabeken #8872)

直しました
Reply to #8871

Reply to #8872×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: tokenize_headers (2004-03-25 01:14 by a39 #8880)

小出しにして申し訳ありませんが,Received:も複数取り扱えるようにならないでしょうか.
理由は,spamの場合は一番最後のReceived:は,偽装されていることも珍しくないからです.

組織内などの配送経路が判っている自明なReceived:を除いて判断する処理を盛り込むのも一案ではありますが,cleanなメイルをある程度学習できれば,自明なReceived:とそうでないものとが篩い分けられコストに反映されるのではないかと予想しています.
Reply to #8872

Reply to #8880×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: tokenize_headers (2004-03-27 12:50 by a39 #8911)

取り敢えずRubyとかがわからないなりにも,quick hackしてみました.
一応全てのReceived:を拾うようにはしました.有効性の評価は週末にできるかな?
細かなbugとかRFCに則っていない部分も見つけたので,時間を見てレポートしたいと思います.
Reply to #8880

Reply to #8911×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: tokenize_headers (2004-03-27 14:04 by nabeken #8912)

http://nabeken.tdiary.net/20030518.html#p03
という歴史があります。
Reply to #8911

Reply to #8912×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: tokenize_headers (2004-03-27 21:01 by a39 #8919)

はい.承知はしています.
でも,Received: は簡単に詐称できてしまいます(詐称されています)し,それこそword saladと同じ影響を受ける状況になっていると思います.

だから一番良いのは,ISP等内のインダイレクトな配送経路,MLなどのサーバ内の配送経路に関する情報を除いたものを判別できると一番よいとは思うのです.
# でも,多くのMLはposter -> ML server間のReceived:を捨ててますよね...
Reply to #8912

Reply to #8919×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

patch (2004-04-01 00:58 by a39 #8987)

以下が,出来損ないのReceived:を全て学習対象にするpatchです.

余計なお世話ですが,序でにEnvelope SenderとEnvelope Recipientとの正規表現を修正しました.(特定char setを含むことのみの判定なので,アドレスの構造を正規表現で書いているわけではありません)
また,MTAのqueue IDの正規表現も改めました.

Received:については,MTAのqueue IDとタイムスタンプとをカットして行を継続しているだけです.元々の"envelope-from"と記されるケースがよくわかっていないので,中途半端な状態です.

Rubyを全く理解(しようと)していない輩が体感的に書いたAdHocなpatchなので,非常に恥かしい限りなのですが,情報共有のため&feed backのために晒します.;-)

*** bsfilter.1.55~ Thu Mar 25 00:42:51 2004
--- bsfilter.1.55 Thu Apr 1 00:48:34 2004
***************
*** 819,829 ****
str = str.chomp
if (str =~ /\A(\S+?):\s*(.*)/)
current = $1.downcase
! if (current == "received")
! headers[current] = $2.sub(/[\r\n]*\z/, '')
! else
! headers[current] = (headers[current] || "") + " " + $2.sub(/[\r\n]*\z/, '')
! end
elsif (str =~ /\Afrom\s+(\S+)/i)
headers["ufrom"] = $1
elsif (str =~ /\A\r*\z/)
--- 819,825 ----
str = str.chomp
if (str =~ /\A(\S+?):\s*(.*)/)
current = $1.downcase
! headers[current] = (headers[current] || "") + " " + $2.sub(/[\r\n]*\z/, '')
elsif (str =~ /\Afrom\s+(\S+)/i)
headers["ufrom"] = $1
elsif (str =~ /\A\r*\z/)
***************
*** 831,837 ****
elsif (! current)
break
else
! headers[current] += str.sub(/[\r\n]*\z/, '').sub(/\A\s*/, '')
end
end
if ((headers["content-type"] =~ /\bboundary=\s*"(.*?)"/i) ||
--- 827,837 ----
elsif (! current)
break
else
! if (current == "received")
! headers[current] += " " + str.sub(/;\s.*[\r\n]*\z/, '').sub(/\A\s*/, '')
! else
! headers[current] += " " + str.sub(/[\r\n]*\z/, '').sub(/\A\s*/, '')
! end
end
end
if ((headers["content-type"] =~ /\bboundary=\s*"(.*?)"/i) ||
***************
*** 979,1000 ****
head_db = TokenDB::new(lang)
reg_token = Regexp::compile("\\b\\d[\\d\\.]+\\d\\b|[\\w#{$mark_in_token}]+")

- if (headers["received"])
- str = headers["received"]
- str =~ /envelope\-from\s+([\w@\.\-]+)/
- efrom = $1
- str =~ /for\s+<([\w@\.\-]+)>/
- foraddress = $1
- str.sub!(/(\bid|;).*/im, '')
- str.sub!(/\(qmail[^\)]*\)/, '')
- str += " " + efrom if efrom
- str += " " + foraddress if foraddress
- headers["received"] = str
- end
headers.each do |header, content|
case header
when "ufrom", "from", "to", "cc", "subject", "reply-to", "return-path",
"content-transfer-encoding", "content-type", "content-disposition", "charset", "received"
if (lang == "ja")
content.gsub!(/=\?utf\-8\?([bq])\?(\S*)\?=/i) do |s|
b_or_q = $1
--- 979,998 ----
head_db = TokenDB::new(lang)
reg_token = Regexp::compile("\\b\\d[\\d\\.]+\\d\\b|[\\w#{$mark_in_token}]+")

headers.each do |header, content|
case header
when "ufrom", "from", "to", "cc", "subject", "reply-to", "return-path",
"content-transfer-encoding", "content-type", "content-disposition", "charset", "received"
+ if (header == "received")
+ content =~ /envelope-from\s+([\w@!#$\%&'*+\-\/=?^_`{|}~\.]+)/
+ efrom = $1
+ content =~ /for\s+<([\w@!#$\%&'*+\-\/=?^_`{|}~\.]+)>/
+ foraddress = $1
+ content.gsub!(/(\bid\s+\w+)/im, '')
+ content.gsub!(/\(qmail[^\)]*\)/, '')
+ content += " " + efrom if efrom
+ content += " " + foraddress if foraddress
+ end
if (lang == "ja")
content.gsub!(/=\?utf\-8\?([bq])\?(\S*)\?=/i) do |s|
b_or_q = $1
Reply to #8911

Reply to #8987×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

tokenizer (2004-03-25 16:17 by a39 #8887)

ヘッダ解析にも関係する話なので,このトピックの下にぶら下げさせて下さい.

現在,ヘッダ解析,url解析においては,IPアドレスの場合には完全なアドレスの形で保存されていますが,FQDNの場合には,ピリオドがtokenの区切りの境目として扱われてしまい,ドメインパートのレベルが全部バラバラの部品になってしまっています.
これをFQDNそのままで保存できないものでしょうか.

現状
http://www.foo.bar.baz.example.com -> www foo bar baz example com

案1.
http://www.foo.bar.baz.example.com ->http://www.foo.bar.baz.example.com (そのまま)

案2.
http://www.foo.bar.baz.example.com
->http://www.foo.bar.baz.example.com foo.bar.baz.example.com bar.baz.example.com
bar.baz.example.com baz.example.com example.com com

bigramとも違うけれど,より上位の同じドメインを学習対象にしてしまおうということです.現状でも十分に効果があるのは

案3. same as bigram(?)
http://www.foo.bar.baz.example.com
-> wwwhttp://www.foo foo.bar bar.baz baz.example example.com com
Reply to #8858

Reply to #8887×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

get_headers bug? (2004-03-29 18:14 by a39 #8945)

get_headers ですが,

$ diff -c bsfilter.1.55~ bsfilter.1.55
*** bsfilter.1.55~ Mon Mar 29 18:09:13 2004
--- bsfilter.1.55 Mon Mar 29 18:09:30 2004
***************
*** 831,837 ****
elsif (! current)
break
else
! headers[current] += str.sub(/[\r\n]*\z/, '').sub(/\A\s*/, '')
end
end
if ((headers["content-type"] =~ /\bboundary=\s*"(.*?)"/i) ||
--- 831,837 ----
elsif (! current)
break
else
! headers[current] += " " + str.sub(/[\r\n]*\z/, '').sub(/\A\s*/, '')
end
end
if ((headers["content-type"] =~ /\bboundary=\s*"(.*?)"/i) ||
$

とすべきところでしょうか.

originalではヘッダの継続行を繋げる際に,前行の行末のtokenと次行の行頭のtokenとが連結されてしまうように見えます.(Sendmail,Postfix のReceived:ヘッダで確認)

しかし,この修正を加えると,spam の判定率が大幅に低下するようです.
何故なんでしょ???
Reply to #8858

Reply to #8945×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: get_headers bug? (2004-03-29 23:37 by nabeken #8947)

実はknown bugです。
こう直すと、日本語subjectの途中で改行が入ったケース等で、スペースが問題になる場合があるはずです。ちゃんと直せばいいのですが、面倒なのでほってあります。
Reply to #8945

Reply to #8947×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: get_headers bug? (2004-04-01 09:15 by a39 #8989)

MIME(7bit)にのみ,超手抜き対応しました.
本当は,tokenize_hedersの中で考えるのが一番スマートなんでしょうけれど.

*** bsfilter.1.55~ Thu Mar 25 00:42:51 2004
--- bsfilter.1.55 Thu Apr 1 09:10:47 2004
***************
*** 831,837 ****
elsif (! current)
break
else
! headers[current] += str.sub(/[\r\n]*\z/, '').sub(/\A\s*/, '')
end
end
if ((headers["content-type"] =~ /\bboundary=\s*"(.*?)"/i) ||
--- 831,843 ----
elsif (! current)
break
else
! if (current == "received")
! headers[current] += " " + str.sub(/;\s.*[\r\n]*\z/, '').sub(/\A\s*/, '
')
! elsif (current == "subject" && str =~ /\s+=\?/)
! headers[current] += str.sub(/[\r\n]*\z/, '').sub(/\A\s*/, '')
! else
! headers[current] += " " + str.sub(/[\r\n]*\z/, '').sub(/\A\s*/, '')
! end
end
end
if ((headers["content-type"] =~ /\bboundary=\s*"(.*?)"/i) ||
Reply to #8947

Reply to #8989×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: get_headers bug? (2004-04-01 09:25 by a39 #8990)

別トピックの方にbenchmarkの結果を書きましたが,継続行間のtokenを区切るとspamの判定確率が下がるというのは,DBの学習状況の問題だったのかもしれません.
Reply to #8945

Reply to #8990×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login