Forums: users (Thread #4560)

</body>と</html>との間にゴミ(salad)が入っているケース(旧BBSからの継続) (2004-02-23 01:03 by Anonymous #8086)

やまぐちです.

一応,当方では何件も確認しています.
結局</BODY>とか</HTML>の後にsaladを混入する輩って,HTMLのパーサを積んだ環境では「見えないでしょ?」でもbsfilterには「見えるでしょ?」ってのを期待していると思うのです.

数が少ないうちはいいのですが,「これって盲点だよね」と思われた時には既に遅いといふ…(それってsecurity対策の基本?)
不毛な対策っていつまですればいいのでしょう.;-<

Reply to #8086×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: </body>と</html>との間にゴミ(salad)が入っているケース(旧BB (2004-02-24 00:42 by nabeken #8122)

先頭の<html>と<body>の間はどうですか?

## <body></body>の前後どちらも、mew + emacs-w3m + w3m-m17n では、見える領域だったりします。
Reply to #8086

Reply to #8122×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: </body>と</html>との間にゴミ(salad)が入っているケース(旧BB (2004-02-24 14:00 by a39 #8134)

<html>の前、<html>と<body>との間、いずれもゴミありのspamがあります。
<head>~</head>とかは完全無視でも実害ないですかね。
Reply to #8122

Reply to #8134×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: </body>と</html>との間にゴミ(salad)が入っているケース(旧BB (2004-02-25 01:19 by nabeken #8144)

<head>-</head>は拾うようにします。拾って意味があるかは、フィルタの統計処理に任せます。
Reply to #8134

Reply to #8144×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: </body>と</html>との間にゴミ(salad)が入っているケース(旧BB (2004-03-09 01:01 by a39 #8507)

<HEAD>~</HEAD>間のものも,一律bodyカテゴリとして収集されるのでしょうか?
例えば,<TITLE>~</TITLE>の間とかに,saladがゾロゾロ...ってのは比較的よくあるケースかと思うのですが,結局これも拾って惑わされちゃうのかなぁ~って.;-(

p.s.
最近,saladを含まないor少ししか含まないspamが増えてきたような気がする...
Reply to #8144

Reply to #8507×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: </body>と</html>との間にゴミ(salad)が入っているケース(旧BB (2004-03-10 00:45 by nabeken #8528)

bodyです。
Reply to #8507

Reply to #8528×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

emacs-w3m (2004-03-20 05:17 by a39 #8767)

w3mは元々pagerだからですかね.
Reply to #8122

Reply to #8767×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: </body>と</html>との間にゴミ(salad)が入っているケース(旧BBSからの継続) (2004-03-03 15:04 by a39 #8343)

現在の仕様では、</body>の後のゴミ(salad)を削るようになっていますが、</body>がなく、</html>で終わるものもありますので、個人的には1.38.4.12から</body>または</html>の後ろを捨てるように改造して使っています。

最新版の変更点は次のとおり。:)
$ diff -r bsfilter.1.40 bsfilter.1.40fix
1210,1211c1210,1211
< # remove salad after body
< if (str =~ Regexp::compile('\A(.*)</body>[^<>]*?</html>[^<>]*?\z', Regexp::MULTILINE | Regexp::IGNORECASE, 'n'))
---
> # remove salad after body or html
> if (str =~ Regexp::compile('\A(.*)</(body|html)>[^<>]*?\z', Regexp::MULTILINE | Regexp::IGNORECASE, 'n'))
Reply to #8086

Reply to #8343×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: </body>と</html>との間にゴミ(salad)が入っているケース(旧BB (2004-03-04 00:07 by nabeken #8377)

merged to 1.42
Reply to #8343

Reply to #8377×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login

RE: </body>と</html>との間にゴミ(salad)が入っているケース(旧BB (2004-03-04 02:13 by a39 #8380)

Thanks!

p.s.
1.42で--export-probabilityが付いたので、prob. DBの様子を観察できそうです。:)
Reply to #8377

Reply to #8380×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) Login