A handy resource:
1. If you are using Spybot Search & Destroy on Windows32 systems, enter
2. Spybot Search & Destroy will list all blocked connections; i.e. those that are redirected to the localhost.
3. You can manually add/delete entries.
4. It is a another layer of security!!
For details, check:
Yesterday we had a brainstorming session with our programmers on google hacking. It is soooooo easy to grab php codes, passwords, databases from all over the Web, thanks to sloppy coders. For instance, do a search for
or check the list at http://www.thenetworkadministrator.com/googlesearches.htm These types of searches will spit out directory trees.
There are many “smart cookies” posting derivatives of these lists all over the Web.
And how about typos?
Try filetype command searches with extra characters in extensions like
Servers will spit out entire php codes.
The great offenders are large sites like those belonging to .edu, .gov, .org, not to mention large .com and .net sites.
Ho, Ho, Ho, Merry Christmas, Santa.
Today I updated my Tutorial on Correlation Coefficients to include a new section on the effect of sample size on the significance of correlation coefficients. This was motivated by some comments from search engine marketers on correlation strengths. (http://searchenginewatch.com/3641002). The new material might help those interested in learning whether a reported correlation coefficient is statistically different from zero. It is given below. Enjoy it.
The problem with correlation strength scales is that these say nothing about how the size of a sample impacts the significance of a correlation coefficient. This is a very important issue that is now addressed.
Consider three different correlation coefficients: 0.50, 0.35, and 0.17. Assume that we want to test that there is no significant relationship between the two variables at hand. The null hypothesis (H0) to be tested is that these r values are not statistically different from zero (rho = 0). How to proceed?
As recommended by Stevens (17), for rho = 0, H0 can be tested using a two tailed (i.e.,two sided) t-test at a given confidence level, usually at a 95% level. If tcalculated ≥ ttable, H0 is rejected. However, if tcalculated < ttable H0 is not rejected and there is no significant correlation between variables.
Here tcalculated is computed as r/SEr = r*SQRT[((n – 2)/(1 – r2))] while ttable values are obtained from the literature (http://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values ). Table 2 summarizes the result of testing the null hypothesis at different sample size values.
|Table 2. H0 tests at different sample sizes; two-tailed, 95% confidence.|
|n||df = n – 2||r||SEr||t(calc)||t (0.95)||Reject (H0 : rho = 0)?|
The table addresses at which size level an r value is high enough to be statistically significant.
For n = 14, all three r values (0.50, 0.35, and 0.17) are not statistically different from zero.
For n = 30, r = 0.50 is statistically different from zero while r = 0.35 and r = 0.17 are not.
Conversely, r = 0.50 is not statistically different from zero when n is equal or less than 14 while r = 0.35 is not different from zero when n is equal or less than 30.
Finally, r = 0.17 is not statistically different from zero at any of the sample sizes tested.
LDA and Google’s ranks well correlated?
After the hilarious example of this guy with the SEOMOZ LDA tool (http://smackdown.blogsblogsblogs.com/2010/09/09/proof-that-the-new-seomoz-tools-is-at-least-half-accurate/ ) I can only laugh out loud. Have anyone tried something like that?
Regarding the new fiasco with their LDA tool. Oh, no, another one… (http://www.seomoz.org/blog/lda-correlation-017-not-032) : What can I said? They sound pathetic and apologetic. The words overhyped, shitty, sloppy, flawed, etc are not enough to describe their “research work”.
What will happen now with those Mute Speakerphones that were misled? Those that listen to fools become one.
I don’t feel any sympathy for their 15 minutes of “honesty”. The damage was done already to naïve readers.
Also, note that this latest flaw was discovered by them. It was not the result of any peer review process from external referees, as those throwing a towel at them would like to believe.
As mentioned before, beware of SEOs statistical “studies” and their quack “science” (http://irthoughts.wordpress.com/2010/04/23/beware-of-seo-statistical-studies/ ), especially if coming from SEOMOZ.
Probably their snakeoil will make a comeback soon. (Oh, no. Again?)
If they still think they have a valid LDA implementation, why not announce it at David Blei’s Topic-Models werein a community of LDA experts will review it and compare it against other implementations?
Two things can happen:
(a) It will be reviewed.
(b) it will be ignored.
I “invite” them to do so.
Please, just don’t show up with your snakeoil, yellow shoes, your seo mom, paid cheerleaders, vested investors, overhyped claims, etc, etc.
More on their hype machine here: http://skitzzo.com/archives/seomoz-hype-machine.php
It appears that even Danny Sullivan is not buying SEOmoz’s “research” on LDA. Accordingly, “He didn’t think it was the remarkable change that SEOmoz was making it out to be.” (http://outspokenmedia.com/internet-marketing-conferences/evening-forum-with-danny-sullivan/). He even confronted and put into question their “highly correlated” numbers. And that was even before they recanted.
Back in 2009/04/03 we wrote a nice comparative between LDA, LSI, and Vector Space theory. http://irthoughts.wordpress.com/2009/04/03/vector-space-probabilistic-lsi-and-lda/ LDA was also discussed by its creator (David Blei) at the 2006 IPAM’s Document Space Workshop (http://www.miislita.com/ipam/ipam-document-space-workshop.pdf ). Years before, in 8/25/2006, we wrote in an old asp-based blog a post about warning users against SEOs selling snakeoil in the form of SVD, LSI, and LDA arguments. The problem with these approaches is that they don’t scale well for the Web. I ended up that 2006 post with a prediction:
“At this point I got tired of highlighting more flaws in the claims of these search marketing firms. A sample list of the latest LSI myths is available for your perusal.
Next stop for these snakeoil marketers? How about PCA (Principal Component Analysis) or LDA (Latent Dirichlet Allocation)?”
That post was eventually referenced in a rebuttal I posted at that cesspool of quacks known as seomoz and later fully reproduced at this blog in 2007/05/03 (http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/).
It was a matter of time for johnny-comes-late to “discover” LDA, the Niagara Falls and the Grand Canyon. Oh my God, what a “bombshell”.
Expect a new wave of marketers trying to game naïve cheerleaders and their clients with their latest crap.
Nothing new under the Sun. Will the next stop of these snakeoil marketers disguised as “scientists” be NMF? How about Diffusion Geometries?
One more thing, for those that really want to learn LDA: subscribe to Topic-Models at https://lists.cs.princeton.edu
This is a list forum on LDA run by David Blei and others. I’ve being subscribed for many years now and the discussion on the topic is really useful.
Here in the Caribbean we are “surviving” hurricane Earl, so the IRW issue has been delayed. This and the upcoming two issues will be a give away about different type of inverted index architectures and fast indexing techniques. This is what I plan to cover:
Part One: Inverted Index Types
Part Two: Fast Indexing Techniques
Part Three: Fast Posting Lists Intersecting & Sharding
The QA section will cover redirection harvesting. I’m going to do now something not done before: releasing the QA section in advanced, so those not subscribed to IRW will realize what they are missing.
Q: What is Redirection Harvesting?
A: Redirection harvesting is a phishing technique wherein a hacker or spammer identifies a trusted site that redirects users to specific pages by appending name-value commands to the redirection mechanism.
The mechanism is often a form or a URL without a security layer for filtering appended URLs. The idea is to replace the landing URL with the hacker or spammer’s URL which is often obfuscated.
Although it no longer works, the best known example of this was due to Ebay. For details, check http://www.google.com/search?q=ebay+redirections
The URL mechanism abused was
Note that a trusted site (Ebay) is the one doing the redirection.
Naïve or unaware users receiving an email with such doctored URLs might think these belong to Ebay and that it will redirect to a page within EBay when in fact it takes users to a malicious page. Once there, users are exposed to all kind of attacks.
Many large and popular sites, including educational and government sites, are still guilty of allowing this to happen. The lesson here is that redirection mechanisms without URL filtering layers can and will be abused.
Today, I’ve updated the Tutorial on Correlation Coefficients in order to add a new section on correlation strength scales. I feel this is granted.
In a 7/16/2010 Search Engine Watch post, a search marketer reported an r value of 0.67 and stated:
“If 1 is perfectly correlative, then 0.67 is certainly a strong correlative relationship and a figure of some interest, when we consider there are a couple hundred factors that reportedly contribute to rank.” (http://searchenginewatch.com/3641002).
This raises the question on how to characterize correlation strengths. Several attempts have been made at classifying r values as ‘weak’, ‘poor’, ‘moderate’, ‘strong’, or ‘very strong’ using scales of correlations.
The problem with these scales is that their boundaries are often defined using subjective arguments, not to mention that not all researchers agree with using such boundaries or scales at all.
Feel free to read the updated version now.
BTW, during the updating process I found an involuntary missing “not” in one line of example 3 in the tutorial. It should read as follows: “The difference between r1 and r2 is not significant at the 95% confidence level.” This should be obvious from reading the null hypothesis. My mistake, nevertheless.
Another correction made was in the Olkin-Pratt formula for bias. In this case, there is a missing parenthesis. Instead of 2n – 3, this should be written as 2(n -3). The parenthesis was originally included in the Excel program used to draw the graphs. So the graphs were not affected.
In the future and if I can find the time, I may add an exercise section applied to IR and search engines.
The tutorial on correlation coefficients is complete and will be out at some point next week. It is also a response to certain “rebuttal”.
After reading it you will understand why statistics is a loss for SEOs, at least for those SEOs that have embraced the quack Science from SEOMOZ.
There is a surprise there for those that misquote search results and abstracts of research papers they have never read thoroughly: That’s the signature of sloppy “research”.
Until then, Happy 4th of July.
If you are enrolled in the IE-Part 1 course, here is some reference material on Email Headers for today’s lecture:
Exposing email headers
Tracking the source of email spam
How to read email headers
Reading the email header
Reading email headers
Spamlinks: Reading email headers
ACCC: Reading Email Headers
E-mail Headers and SMTP Commands
All About Email Headers
Security Optimization Strategies in the Workplace