采用ruby和
hpricot库,逻辑步骤如下:
- 分析网页结构
- 采用open-uri打开网页,打开时注意完整的Referer,否则tubaba.com将认为盗链
- 将得到的页面html返回给hpricot
- 由hpricot定位要下载的.rar文件
- 调用shell程序wget下载.rar文件到指定位置,注意指定Referer
- 调用shell程序unrar解压,用-p指定密码解压
- 延时若干继续下载下一个.rar文件
程序在ubuntu 7.04中运行通过,源代码如下:
- #!/usr/bin/ruby
-
- require "hpricot"
- require "open-uri"
-
- i = 1
- doc = Hpricot(open("http://www.tubaba.com/picture/2006/0831/image_140.html",
- "User-Agent" => "Internet Explorer 6.0",
- "From" => "myals@gmail.com",
- "Referer" => "http://www.tubaba.com/picture/2006/0831/image_140.html"))
- doc.search("table/tbody/tr/td/a").each do |link|
- if link.attributes['href'].include?".rar"
- system "wget http://www.tubaba.com" + link.attributes['href'] + ' --referer=http://www.tubaba.com' + link.attributes['href'] + " -O /tmp/a/1.rar"
- system "unrar x -ptubaba.com /tmp/a/1.rar /home/water/share/网页矢量图标"
- end
- end
-
- while(i < 53)
- i = i + 1
- url = "http://www.tubaba.com/picture/2006/0831/image_140_" + i.to_s + ".html"
- puts '正在分析' + url + ' ...'
- doc = Hpricot(open(url,
- "User-Agent" => "Internet Explorer 6.0",
- "From" => "myals@gmail.com",
- "Referer" => url))
- doc.search("table/tbody/tr/td/a").each do |link|
- if link.attributes['href'].include?".rar"
- system "wget http://www.tubaba.com" + link.attributes['href'] + ' --referer=http://www.tubaba.com' + link.attributes['href'] + " -O /tmp/a/" + i.to_s + ".rar"
- system "unrar x -ptubaba.com /tmp/a/" + i.to_s + ".rar /home/water/share/网页矢量图标"
- end
- end
- sleep 2
- end
发表新评论