Back in 2009/04/03 we wrote a nice comparative between LDA, LSI, and Vector Space theory. https://irthoughts.wordpress.com/2009/04/03/vector-space-probabilistic-lsi-and-lda/ LDA was also discussed by its creator (David Blei) at the 2006 IPAM’s Document Space Workshop (http://www.miislita.com/ipam/ipam-document-space-workshop.pdf ). Years before, in 8/25/2006, we wrote in an old asp-based blog a post about warning users against SEOs selling snakeoil in the form of SVD, LSI, and LDA arguments. The problem with these approaches is that they don’t scale well for the Web. I ended up that 2006 post with a prediction:
“At this point I got tired of highlighting more flaws in the claims of these search marketing firms. A sample list of the latest LSI myths is available for your perusal.
Next stop
Next stop for these snakeoil marketers? How about PCA (Principal Component Analysis) or LDA (Latent Dirichlet Allocation)?”
That post was eventually referenced in a rebuttal I posted at that cesspool of quacks known as seomoz and later fully reproduced at this blog in 2007/05/03 (https://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/).
It was a matter of time for johnny-comes-late to “discover” LDA, the Niagara Falls and the Grand Canyon. Oh my God, what a “bombshell”.
Expect a new wave of marketers trying to game naïve cheerleaders and their clients with their latest crap.
Nothing new under the Sun. Will the next stop of these snakeoil marketers disguised as “scientists” be NMF? How about Diffusion Geometries?
PS
One more thing, for those that really want to learn LDA: subscribe to Topic-Models at https://lists.cs.princeton.edu
This is a list forum on LDA run by David Blei and others. I’ve being subscribed for many years now and the discussion on the topic is really useful.
Thank you for the writeup. Academically I’m from IR in my undergrad, working now professionally in SEO. I’ve see a huge response to SEOMoz’s LDA tool, and yet have seen literally nothing in terms of the hardcore IR science that goes behind the tool. Colorful language to be sure, but one of the first opinions I’ve heard from the IR community.
That’s usual from SEOs posing as “scientists”. After their hype their snakeoil will be obvious.
If you really wants to get into real LDA and coding subscribe to david blei’s topic-models at princeton
You probably won’t be glad to know that it seems your May 2007 article has been reused by a India SEO company for link building.
Not everyone who deals with SEO topics is snakeoil – in some ways I am, in that I don’t even try to understand the math.
I do try hard to understand the concepts and spot the wholes in what is said – those holes often contain clues to what might be happening.
Observation more than reverse engineering.
Hi, there:
I’m not worry about others copying my content. It has happened many times before.
These days it is hard to find companies providing valid SEO services.
Glad to know your are an observer trying to grasp the IR concepts behind search engines.
Sean Golliher, CEO/editor of SEMJ magazine has pointed out more flaws on the quack “statistical studies” of seomoz. http://www.seangolliher.com/2010/seo/185/
Sean, before leaving to the IR conference, glad you spotted their nonsensical “experiments”. They keep thinking that their best defense is the offense or repeating lies many times over and over. Ha, ha. Poor crooks.
On and on, I know we have discussed this before, but it will be helpful to share with others:
(a) it is not possible to compute an arithmetic average of correlation coefficients simply because they are not additive. One would need to compute a Fisher weighted mean coefficient by making r-to-Z conversions, averaging all Z values and then do a Z-to-r conversion to get a weighted mean correlation.
(b) Pearson’s can be applied to nonlinear problems if the data can be linearized, as it would be the case for example with power or exponential laws. If the data cannot be linearized, then certainly Pearson cannot be applied; i.e., at least for the raw data. Still one could assign ranks and compute Spearman’s as Pearson’s on the ranked data. This is described at https://irthoughts.wordpress.com/2010/06/10/on-spearmans-correlation-coefficients-with-excel/
(c) PCA makes no assumption about the distribution of the original raw data to be treated. Contrary to popular opinion, PCA is not limited to linear data, as nonlinear PCA is possible.
All this is mentioned in my tutorial on correlation coefficients:
http://www.miislita.com/information-retrieval-tutorial/a-tutorial-on-correlation-coefficients.pdf
It is amazing how these “scientists” and snakeoil sellers keep deceiving the public. I guess they will never admit to their quack “experiments”.
Pingback: Google Update; Semantic Terminology Framework Utilization | SEO Bullshit