Web Scraping Search Engine Scripts

Last month issue of IRW covered Web Scraping as a Web Mining activity. An app of scrapers, the Web Mining Studio, was unveiled. One of these, the Scripts Scrapers allows anyone to build a library of scripts from all over the Web. This month issue of the newsletter, which will be a bit delayed, covers how we grabbed scripts used by search engines. Here is a sample from http://news.google.com.

Scripts Report

7 results found.

n HTML
1. <​s​c​r​i​p​t​ ​t​y​p​e​=​”​t​e​x​t​/​j​a​v​a​s​c​r​i​p​t​”​>​ ​ ​ ​ ​ ​ ​s​e​t​u​p​j​s​f​l​a​g​s​(​)​;​ ​ ​ ​ ​<​/​s​c​r​i​p​t​>​
2. <​s​c​r​i​p​t​ ​t​y​p​e​=​”​t​e​x​t​/​j​a​v​a​s​c​r​i​p​t​”​>​ ​ ​ ​ ​n​e​w​s​_​l​o​g​e​r​r​o​r​s​ ​=​ ​t​r​u​e​ ​ ​<​/​s​c​r​i​p​t​>​
3. <​s​c​r​i​p​t​ ​t​y​p​e​=​”​t​e​x​t​/​j​a​v​a​s​c​r​i​p​t​”​>​ ​ ​ ​ ​t​r​y​ ​{​ ​ ​ ​ ​ ​ ​w​i​n​d​o​w​[​”​j​s​t​i​m​i​n​g​”​]​[​”​l​o​a​d​”​]​.​t​i​c​k​(​”​b​o​l​”​)​;​ ​ ​ ​ ​}​ ​c​a​t​c​h​ ​(​e​)​ ​{​ ​ ​ ​ ​ ​ ​n​e​w​s​_​l​o​g​e​r​r​o​r​(​e​,​ ​”​c​s​i​:​b​o​l​”​)​;​ ​ ​ ​ ​}​ ​ ​<​/​s​c​r​i​p​t​>​
4. <​s​c​r​i​p​t​ ​t​y​p​e​=​”​t​e​x​t​/​j​a​v​a​s​c​r​i​p​t​”​>​f​u​n​c​t​i​o​n​ ​s​e​t​u​p​j​s​f​l​a​g​s​(​)​ ​{​ ​n​e​w​s​_​f​l​a​g​s​ ​=​ ​{​}​;​ ​n​e​w​s​_​f​l​a​g​_​x​h​r​p​a​t​h​p​r​e​f​i​x​ ​=​ ​0​;​ ​n​e​w​s​_​f​l​a​g​s​[​n​e​w​s​_​f​l​a​g​_​x​h​r​p​a​t​h​p​r​e​f​i​x​]​ ​=​ ​”​/​n​e​w​s​/​x​h​r​”​;​ ​n​e​w​s​_​f​l​a​g​_​u​s​e​j​s​i​m​a​g​e​f​e​t​c​h​t​r​a​c​k​i​n​g​ ​=​ ​1​;​ ​n​e​w​s​_​f​l​a​g​s​[​n​e​w​s​_​f​l​a​g​_​u​s​e​j​s​i​m​a​g​e​f​e​t​c​h​t​r​a​c​k​i​n​g​]​ ​=​ ​f​a​l​s​e​;​ ​n​e​w​s​_​f​l​a​g​_​e​n​a​b​l​e​e​m​a​i​l​ ​=​ ​2​;​ ​n​e​w​s​_​f​l​a​g​s​[​n​e​w​s​_​f​l​a​g​_​e​n​a​b​l​e​e​m​a​i​l​]​ ​=​ ​t​r​u​e​;​ ​n​e​w​s​_​f​l​a​g​_​e​x​p​e​r​i​m​e​n​t​s​ ​=​ ​3​;​ ​n​e​w​s​_​f​l​a​g​s​[​n​e​w​s​_​f​l​a​g​_​e​x​p​e​r​i​m​e​n​t​s​]​ ​=​ ​”​”​;​ ​n​e​w​s​_​f​l​a​g​_​p​i​n​g​c​s​i​ ​=​ ​4​;​ ​n​e​w​s​_​f​l​a​g​s​[​n​e​w​s​_​f​l​a​g​_​p​i​n​g​c​s​i​]​ ​=​ ​t​r​u​e​;​ ​n​e​w​s​_​f​l​a​g​_​p​r​e​f​e​t​c​h​c​i​t​y​l​i​s​t​ ​=​ ​5​;​ ​n​e​w​s​_​f​l​a​g​s​[​n​e​w​s​_​f​l​a​g​_​p​r​e​f​e​t​c​h​c​i​t​y​l​i​s​t​]​ ​=​ ​f​a​l​s​e​;​ ​n​e​w​s​_​f​l​a​g​_​m​a​x​c​r​e​a​t​e​p​a​g​e​t​i​t​l​e​l​e​n​g​t​h​ ​=​ ​7​;​ ​n​e​w​s​_​f​l​a​g​s​[​n​e​w​s​_​f​l​a​g​_​m​a​x​c​r​e​a​t​e​p​a​g​e​t​i​t​l​e​l​e​n​g​t​h​]​ ​=​ ​2​5​;​ ​n​e​w​s​_​f​l​a​g​_​e​n​a​b​l​e​s​t​a​r​r​i​n​g​ ​=​ ​8​;​ ​n​e​w​s​_​f​l​a​g​s​[​n​e​w​s​_​f​l​a​g​_​e​n​a​b​l​e​s​t​a​r​r​i​n​g​]​ ​=​ ​t​r​u​e​;​ ​n​e​w​s​_​f​l​a​g​_​e​n​a​b​l​e​_​c​r​e​a​t​e​_​p​a​g​e​_​s​u​g​g​e​s​t​i​o​n​s​ ​=​ ​9​;​ ​n​e​w​s​_​f​l​a​g​s​[​n​e​w​s​_​f​l​a​g​_​e​n​a​b​l​e​_​c​r​e​a​t​e​_​p​a​g​e​_​s​u​g​g​e​s​t​i​o​n​s​]​ ​=​ ​t​r​u​e​;​ ​n​e​w​s​_​f​l​a​g​_​e​n​a​b​l​e​_​j​s​_​d​e​b​u​g​ ​=​ ​1​0​;​ ​n​e​w​s​_​f​l​a​g​s​[​n​e​w​s​_​f​l​a​g​_​e​n​a​b​l​e​_​j​s​_​d​e​b​u​g​]​ ​=​ ​f​a​l​s​e​ ​}​ ​f​u​n​c​t​i​o​n​ ​n​e​w​s​_​l​o​g​e​r​r​o​r​(​e​,​ ​e​x​t​r​a​m​e​s​s​a​g​e​)​ ​{​ ​v​a​r​ ​u​r​l​ ​=​ ​”​/​n​e​w​s​/​x​h​r​/​l​o​g​_​e​r​r​o​r​?​n​e​d​=​”​ ​+​ ​”​u​s​”​ ​+​ ​”​&​e​r​r​o​r​=​”​ ​+​ ​e​n​c​o​d​e​u​r​i​c​o​m​p​o​n​e​n​t​(​e​.​n​a​m​e​ ​+​ ​”​:​ ​”​ ​+​ ​e​.​m​e​s​s​a​g​e​)​ ​+​ ​”​&​u​s​e​r​a​g​e​n​t​=​”​ ​+​ ​e​n​c​o​d​e​u​r​i​c​o​m​p​o​n​e​n​t​(​n​a​v​i​g​a​t​o​r​.​u​s​e​r​a​g​e​n​t​)​ ​+​ ​”​&​u​r​l​=​”​ ​+​ ​e​n​c​o​d​e​u​r​i​c​o​m​p​o​n​e​n​t​(​w​i​n​d​o​w​.​l​o​c​a​t​i​o​n​)​ ​+​ ​”​&​e​x​p​e​r​i​m​e​n​t​s​=​”​ ​+​ ​e​n​c​o​d​e​u​r​i​c​o​m​p​o​n​e​n​t​(​”​”​)​ ​+​ ​”​&​s​t​a​c​k​=​”​ ​+​ ​e​n​c​o​d​e​u​r​i​c​o​m​p​o​n​e​n​t​(​e​.​s​t​a​c​k​)​ ​+​ ​”​&​e​r​r​o​r​l​o​c​a​t​i​o​n​=​”​ ​+​ ​e​n​c​o​d​e​u​r​i​c​o​m​p​o​n​e​n​t​(​e​x​t​r​a​m​e​s​s​a​g​e​)​;​ ​ ​n​e​w​ ​i​m​a​g​e​(​)​.​s​r​c​ ​=​ ​u​r​l​;​ ​}​ ​f​u​n​c​t​i​o​n​ ​g​r​a​b​j​s​b​u​n​d​l​e​(​j​s​u​r​l​)​ ​{​ ​v​a​r​ ​s​c​r​i​p​t​e​l​ ​=​ ​d​o​c​u​m​e​n​t​.​c​r​e​a​t​e​e​l​e​m​e​n​t​(​”​s​c​r​i​p​t​”​)​;​ ​s​c​r​i​p​t​e​l​.​s​r​c​ ​=​ ​j​s​u​r​l​;​ ​s​c​r​i​p​t​e​l​.​o​n​e​r​r​o​r​ ​=​ ​f​u​n​c​t​i​o​n​(​)​ ​{​ ​i​f​ ​(​w​i​n​d​o​w​[​’​n​e​w​s​_​b​e​f​o​r​e​o​n​l​o​a​d​f​i​r​e​d​’​]​)​ ​{​ ​r​e​t​u​r​n​;​ ​}​ ​n​e​w​s​_​l​o​g​e​r​r​o​r​(​n​e​w​ ​e​r​r​o​r​(​”​d​e​f​e​r​r​e​d​ ​j​s​ ​e​r​r​o​r​”​)​,​ ​”​e​r​r​o​r​ ​i​n​ ​d​o​w​n​l​o​a​d​ ​o​f​ ​d​e​f​e​r​r​e​d​ ​j​s​:​ ​”​ ​+​ ​j​s​u​r​l​)​;​ ​}​;​ ​v​a​r​ ​h​e​a​d​ ​=​ ​d​o​c​u​m​e​n​t​.​g​e​t​e​l​e​m​e​n​t​s​b​y​t​a​g​n​a​m​e​(​’​h​e​a​d​’​)​[​0​]​;​ ​h​e​a​d​.​a​p​p​e​n​d​c​h​i​l​d​(​s​c​r​i​p​t​e​l​)​;​ ​}​<​/​s​c​r​i​p​t​>
5. <​s​c​r​i​p​t​ ​t​y​p​e​=​”​t​e​x​t​/​j​a​v​a​s​c​r​i​p​t​”​>​v​a​r​ ​a​=​w​i​n​d​o​w​,​b​=​”​s​u​b​s​t​r​i​n​g​”​;​i​f​(​a​.​l​o​c​a​t​i​o​n​.​h​a​s​h​=​=​”​#​c​h​a​n​g​e​d​”​)​{​v​a​r​ ​c​=​a​.​l​o​c​a​t​i​o​n​.​h​r​e​f​;​c​=​c​.​s​u​b​s​t​r​(​0​,​c​.​i​n​d​e​x​o​f​(​”​#​”​)​)​;​v​a​r​ ​d​=​[​]​;​i​f​(​c​.​i​n​d​e​x​o​f​(​”​?​”​)​>​-​1​)​f​o​r​(​v​a​r​ ​e​=​c​[​b​]​(​c​.​i​n​d​e​x​o​f​(​”​?​”​)​+​1​)​.​s​p​l​i​t​(​”​&​”​)​,​f​=​0​;​f​<​e​.​l​e​n​g​t​h​;​f​+​+​)​e​[​f​]​[​b​]​(​0​,​3​)​!​=​”​z​x​=​”​&​&​e​[​f​]​[​b​]​(​0​,​3​)​!​=​”​p​z​=​”​&​&​e​[​f​]​[​b​]​(​0​,​5​)​!​=​”​s​h​i​d​=​”​&​&​d​.​p​u​s​h​(​e​[​f​]​)​;​d​.​p​u​s​h​(​”​p​z​=​1​”​)​;​d​.​p​u​s​h​(​”​z​x​=​”​+​m​a​t​h​.​r​a​n​d​o​m​(​)​)​;​a​.​l​o​c​a​t​i​o​n​=​a​.​l​o​c​a​t​i​o​n​.​p​a​t​h​n​a​m​e​+​”​?​”​+​d​.​j​o​i​n​(​”​&​”​)​}​;​ <​/​s​c​r​i​p​t​>​
6. <​s​c​r​i​p​t​ ​t​y​p​e​=​”​t​e​x​t​/​j​a​v​a​s​c​r​i​p​t​”​>​v​a​r​ ​g​l​o​b​a​l​_​w​i​n​d​o​w​=​w​i​n​d​o​w​;​f​u​n​c​t​i​o​n​ ​t​i​m​e​r​(​b​)​{​t​h​i​s​.​t​=​{​}​;​t​h​i​s​.​t​i​c​k​=​f​u​n​c​t​i​o​n​(​c​,​d​,​a​)​{​a​=​a​?​a​:​(​n​e​w​ ​d​a​t​e​)​.​g​e​t​t​i​m​e​(​)​;​t​h​i​s​.​t​[​c​]​=​[​a​,​d​]​}​;​t​h​i​s​.​t​i​c​k​(​”​s​t​a​r​t​”​,​n​u​l​l​,​b​)​}​v​a​r​ ​l​o​a​d​t​i​m​e​r​=​n​e​w​ ​t​i​m​e​r​;​g​l​o​b​a​l​_​w​i​n​d​o​w​.​j​s​t​i​m​i​n​g​=​{​t​i​m​e​r​:​t​i​m​e​r​,​l​o​a​d​:​l​o​a​d​t​i​m​e​r​}​;​t​r​y​{​g​l​o​b​a​l​_​w​i​n​d​o​w​.​j​s​t​i​m​i​n​g​.​p​t​=​g​l​o​b​a​l​_​w​i​n​d​o​w​.​g​t​b​e​x​t​e​r​n​a​l​&​&​g​l​o​b​a​l​_​w​i​n​d​o​w​.​g​t​b​e​x​t​e​r​n​a​l​.​p​a​g​e​t​(​)​|​|​g​l​o​b​a​l​_​w​i​n​d​o​w​.​e​x​t​e​r​n​a​l​&​&​g​l​o​b​a​l​_​w​i​n​d​o​w​.​e​x​t​e​r​n​a​l​.​p​a​g​e​t​}​c​a​t​c​h​(​e​)​{​}​;​ <​/​s​c​r​i​p​t​>
7. <​s​c​r​i​p​t​ ​t​y​p​e​=​”​t​e​x​t​/​j​a​v​a​s​c​r​i​p​t​”​>​w​i​n​d​o​w​.​g​b​a​r​=​{​}​;​(​f​u​n​c​t​i​o​n​(​)​{​f​u​n​c​t​i​o​n​ ​g​(​a​,​b​,​c​)​{​v​a​r​ ​d​=​”​o​n​”​+​b​;​i​f​(​a​.​a​d​d​e​v​e​n​t​l​i​s​t​e​n​e​r​)​a​.​a​d​d​e​v​e​n​t​l​i​s​t​e​n​e​r​(​b​,​c​,​f​a​l​s​e​)​;​e​l​s​e​ ​i​f​(​a​.​a​t​t​a​c​h​e​v​e​n​t​)​a​.​a​t​t​a​c​h​e​v​e​n​t​(​d​,​c​)​;​e​l​s​e​{​v​a​r​ ​h​=​a​[​d​]​;​a​[​d​]​=​f​u​n​c​t​i​o​n​(​)​{​v​a​r​ ​f​=​h​.​a​p​p​l​y​(​t​h​i​s​,​a​r​g​u​m​e​n​t​s​)​,​e​=​c​.​a​p​p​l​y​(​t​h​i​s​,​a​r​g​u​m​e​n​t​s​)​;​r​e​t​u​r​n​ ​f​=​=​u​n​d​e​f​i​n​e​d​?​e​:​e​=​=​u​n​d​e​f​i​n​e​d​?​f​:​e​&​&​f​}​}​}​;​v​a​r​ ​i​=​w​i​n​d​o​w​.​g​b​a​r​,​k​,​l​;​f​u​n​c​t​i​o​n​ ​m​(​a​)​{​v​a​r​ ​b​=​w​i​n​d​o​w​.​e​n​c​o​d​e​u​r​i​c​o​m​p​o​n​e​n​t​&​&​(​d​o​c​u​m​e​n​t​.​f​o​r​m​s​[​0​]​.​q​|​|​”​”​)​.​v​a​l​u​e​;​i​f​(​b​)​a​.​h​r​e​f​=​a​.​h​r​e​f​.​r​e​p​l​a​c​e​(​/​(​[​?​&​]​)​q​=​[​^​&​]​*​|​$​/​,​f​u​n​c​t​i​o​n​(​c​,​d​)​{​r​e​t​u​r​n​(​d​|​|​”​&​”​)​+​”​q​=​”​+​e​n​c​o​d​e​u​r​i​c​o​m​p​o​n​e​n​t​(​b​)​}​)​}​i​.​q​s​=​m​;​f​u​n​c​t​i​o​n​ ​n​(​a​,​b​,​c​,​d​,​h​,​f​)​{​v​a​r​ ​e​=​d​o​c​u​m​e​n​t​.​g​e​t​e​l​e​m​e​n​t​b​y​i​d​(​a​)​,​j​=​e​.​s​t​y​l​e​;​i​f​(​e​)​{​j​.​l​e​f​t​=​d​?​”​a​u​t​o​”​:​b​+​”​p​x​”​;​j​.​r​i​g​h​t​=​d​?​b​+​”​p​x​”​:​”​a​u​t​o​”​;​j​.​t​o​p​=​c​+​”​p​x​”​;​j​.​v​i​s​i​b​i​l​i​t​y​=​l​?​”​h​i​d​d​e​n​”​:​”​v​i​s​i​b​l​e​”​;​i​f​(​h​&​&​f​)​{​j​.​w​i​d​t​h​=​h​+​”​p​x​”​;​j​.​h​e​i​g​h​t​=​f​+​”​p​x​”​}​e​l​s​e​{​n​(​k​,​b​,​c​,​d​,​e​.​o​f​f​s​e​t​w​i​d​t​h​,​e​.​o​f​f​s​e​t​h​e​i​g​h​t​)​;​l​=​l​?​”​”​:​a​}​}​}​i​.​t​g​=​f​u​n​c​t​i​o​n​(​a​)​{​a​=​a​|​|​w​i​n​d​o​w​.​e​v​e​n​t​;​v​a​r​ ​b​=​a​.​t​a​r​g​e​t​|​|​a​.​s​r​c​e​l​e​m​e​n​t​;​a​.​c​a​n​c​e​l​b​u​b​b​l​e​=​t​r​u​e​;​i​f​(​k​!​=​n​u​l​l​)​o​(​b​)​;​e​l​s​e​{​a​=​d​o​c​u​m​e​n​t​.​c​r​e​a​t​e​e​l​e​m​e​n​t​(​a​r​r​a​y​.​e​v​e​r​y​|​|​w​i​n​d​o​w​.​c​r​e​a​t​e​p​o​p​u​p​?​”​i​f​r​a​m​e​”​:​”​d​i​v​”​)​;​a​.​f​r​a​m​e​b​o​r​d​e​r​=​”​0​”​;​a​.​s​r​c​=​”​j​a​v​a​s​c​r​i​p​t​:​’​’​”​;​k​=​b​.​p​a​r​e​n​t​n​o​d​e​.​a​p​p​e​n​d​c​h​i​l​d​(​a​)​.​i​d​=​”​g​b​s​”​;​g​(​d​o​c​u​m​e​n​t​,​”​c​l​i​c​k​”​,​i​.​c​l​o​s​e​)​;​o​(​b​)​;​i​.​a​l​l​d​&​&​i​.​a​l​l​d​(​f​u​n​c​t​i​o​n​(​)​{​v​a​r​ ​c​=​d​o​c​u​m​e​n​t​.​g​e​t​e​l​e​m​e​n​t​b​y​i​d​(​”​g​b​l​i​”​)​;​i​f​(​c​)​{​v​a​r​ ​d​=​c​.​p​a​r​e​n​t​n​o​d​e​;​d​.​r​e​m​o​v​e​c​h​i​l​d​(​c​)​;​p​(​d​)​}​}​)​}​}​;​f​u​n​c​t​i​o​n​ ​q​(​a​)​{​v​a​r​ ​b​,​c​=​d​o​c​u​m​e​n​t​.​d​e​f​a​u​l​t​v​i​e​w​;​i​f​(​c​&​&​c​.​g​e​t​c​o​m​p​u​t​e​d​s​t​y​l​e​)​{​i​f​(​a​=​c​.​g​e​t​c​o​m​p​u​t​e​d​s​t​y​l​e​(​a​,​”​”​)​)​b​=​a​.​d​i​r​e​c​t​i​o​n​}​e​l​s​e​ ​b​=​a​.​c​u​r​r​e​n​t​s​t​y​l​e​?​a​.​c​u​r​r​e​n​t​s​t​y​l​e​.​d​i​r​e​c​t​i​o​n​:​a​.​s​t​y​l​e​.​d​i​r​e​c​t​i​o​n​;​r​e​t​u​r​n​ ​b​=​=​”​r​t​l​”​}​f​u​n​c​t​i​o​n​ ​o​(​a​)​{​v​a​r​ ​b​=​0​;​i​f​(​a​.​c​l​a​s​s​n​a​m​e​!​=​”​g​b​3​”​)​a​=​a​.​p​a​r​e​n​t​n​o​d​e​;​v​a​r​ ​c​=​a​.​g​e​t​a​t​t​r​i​b​u​t​e​(​”​a​r​i​a​-​o​w​n​s​”​)​|​|​”​g​b​i​”​,​d​=​a​.​o​f​f​s​e​t​w​i​d​t​h​,​h​=​a​.​o​f​f​s​e​t​t​o​p​>​2​0​?​4​6​:​2​4​,​f​=​f​a​l​s​e​;​d​o​ ​b​+​=​a​.​o​f​f​s​e​t​l​e​f​t​|​|​0​;​w​h​i​l​e​(​a​=​a​.​o​f​f​s​e​t​p​a​r​e​n​t​)​;​a​=​(​d​o​c​u​m​e​n​t​.​d​o​c​u​m​e​n​t​e​l​e​m​e​n​t​.​c​l​i​e​n​t​w​i​d​t​h​|​|​d​o​c​u​m​e​n​t​.​b​o​d​y​.​c​l​i​e​n​t​w​i​d​t​h​)​-​b​-​d​;​d​=​q​(​d​o​c​u​m​e​n​t​.​b​o​d​y​)​;​i​f​(​c​=​=​”​g​b​i​”​)​{​v​a​r​ ​e​=​d​o​c​u​m​e​n​t​.​g​e​t​e​l​e​m​e​n​t​b​y​i​d​(​”​g​b​i​”​)​;​i​.​a​l​l​i​&​&​i​.​a​l​l​i​(​e​)​;​p​(​e​)​;​i​f​(​d​)​{​b​=​a​;​f​=​t​r​u​e​}​}​e​l​s​e​ ​i​f​(​!​d​)​{​b​=​a​;​f​=​t​r​u​e​}​l​!​=​c​&​&​i​.​c​l​o​s​e​(​)​;​n​(​c​,​b​,​h​,​f​)​}​i​.​c​l​o​s​e​=​f​u​n​c​t​i​o​n​(​)​{​l​&​&​n​(​l​,​0​,​0​)​}​;​f​u​n​c​t​i​o​n​ ​r​(​a​,​b​)​{​v​a​r​ ​c​=​a​.​f​i​r​s​t​c​h​i​l​d​?​a​.​f​i​r​s​t​c​h​i​l​d​.​c​l​a​s​s​n​a​m​e​:​”​g​b​2​”​;​a​.​i​n​s​e​r​t​b​e​f​o​r​e​(​b​,​a​.​f​i​r​s​t​c​h​i​l​d​)​.​c​l​a​s​s​n​a​m​e​=​c​}​f​u​n​c​t​i​o​n​ ​p​(​a​)​{​f​o​r​(​v​a​r​ ​b​,​c​=​w​i​n​d​o​w​.​n​a​v​e​x​t​r​a​;​c​&​&​(​b​=​c​.​p​o​p​(​)​)​;​)​r​(​a​,​b​)​}​}​)​(​)​;​<​/​s​c​r​i​p​t​>​

2 Comments

  1. Hi E. Garcia,

    I recommend you to check Scrapy (http://scrapy.org), is an opensource framework for web scraping. It was written enterly in Pyhton, and powered by Twisted (http://twistedmatrix.com) and others open source libraries.

    Actually, Scrapy is growing very fast and seems to be used in the project data.gov.uk (Check this tweet http://twitter.com/bfirsh/status/8025368963).

    I think you can find a lot of useful stuffs in Scrapy.

    Kind regards,
    Andres

  2. Hi, Andres:

    Thank you for stopping by and for the tip. We heard before about Scrapy.

    I think I’ll stick to the Web Mining Studio platform which does all Scrapy does, plus term weight analysis and few other IR scoring tasks. With the upcoming integration of the Fractal CSS Design Studio to the WMS,it will do very fancy self-similar Web design stuff from extracted data. We are putting together a white paper on the FCDS and it will be available soon.

Leave a comment