• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: June 2008

On the Stability of Global Weights in the Presence of Spam

30 Monday Jun 2008

Posted by egarcia in Data Mining, Spam, Vector Space Models

≈ Leave a Comment

In A Comparison of Document, Sentence, and Term Event Spaces, Blake analyses the stability of various global weight models (IDF, ISF, and ITF) with respect to document and journal corpuses. She uses stratified samples collected based on term frequency information. Abstracts and section partitions of full-text scientific articles were studied.

We are conducting studies along similar lines, but using web collections. Such studies are extremely important in a dynamic environment like the web, which is different from abstracts and scientific journal collections. Web collections often consist of documents of general interest or noisy content. Documents can be designed using dubious practices like keyword repetition techniques. Such techniques, commonly known as keyword spam, can introduce biased term frequency data.

It should be underscored that keywords relevant to products and services are also intentionally repeated across the web. In addition, not all document content found in web collections is valid. Some are near duplicates, have been artificially generated, or are the result of vested alliances like link exchange programs. Thus, the stability of IDF, ISF, and ITF with respect to web collections or subsets of these in such a noisy environment is still an open question.

Any pointer or reference research about this topic is greately appreciated.

Answers to IR Quiz

27 Friday Jun 2008

Posted by egarcia in IR Quizzes, IR Tutorials, Queries

≈ Leave a Comment

Answers to the IR Quiz are given below:

Term Independence Assumption:

If k1 and k2 are statistically independent they should occur by chance, co-occurring in only

(100)(200)/500 = 40 documents.

Thus, if they occur by chance, the number of documents mentioning the k1 k2 sequence should be unknown, but certainly no greater than 40.

Term Dependence Assumption:

If terms actually co-occur in 70 documents, they are co-occuring more often than by chance (70 > 40). So, terms are statistically dependent and positively correlated. It is a given that the k1 k2 terms sequence is present in 25 out of the 70 documents wherein terms co-occur.

Detailed Results:

Results are given below, rounded off to two decimal places. First/second results respectively are for terms independence/dependence assumptions. You should be able to double check these results.

1. k1 NOT k2: 60, 30

2. k2 NOT k1: 160, 130

3. k1 OR k2 (unconditional OR): 260, 230

4. k1 OR k2 (conditional OR): 220, 160

5. NOT k1: 400, 400

6. NOT k2: 300, 300

7. NOT (k1 AND k2): 460, 430

8. k1 AND k2 NOT (k1 k2): NC, 45

9. EF-Ratio of the k1 k2 terms sequence: NC, 0.36

10. c12-index of the k1 k2 terms sequence: NC, 0.11

11. c12-index of k1 AND k2: 0.15, 0.30

12. IDF of k1: 0.70, 0.70

13. IDF of k2: 0.40, 0.40

14. IDF of k1 AND k2: 1.10, 0.85

15. IDF of k1 k2 terms sequence: NC, 1.30

Additional exercises open to discussion:

Calculate the associated odds, odd ratios, and logits.

 

Complimentary version of IR Watch

26 Thursday Jun 2008

Posted by egarcia in Newsletters

≈ Leave a Comment

By now subscribers should have the current issue of IR Watch – The Newsletter in their inbox.

Non subscribers: A complimentary version with few minor changes is available online at http://www.miislita.com. Take advantage of this freebie while you can. Let others know about IRW and what they are missing.

IRW might be subject to few changes in the future.

 

 

Sneak Preview of IRWatch: Understanding IDF

23 Monday Jun 2008

Posted by egarcia in Queries, SEO Myths, Vector Space Models

≈ Leave a Comment

idf

“IDF is simply neither a pure heuristic, nor the
theoretical mystery many have made it out to be.
We have a pretty good idea why it works as well
as it does.” –Stephen E. Robertson

Here is a sneak preview of IR Watch for the month of June, 2008. It should be in subscribers inbox during the day or at the latest tomorrow. 

It is discussed within the context of co-occurrence theory and term independence/dependence assumptions. Issues and misconceptions related with this measure are addressed. Initially we made plans for including current ongoing work we are conducting on specificity measures, but we have chosen not to since is not the appropriate forum. 

IRW-2008-06: Understanding Inverse Document Frequency (IDF)

In this issue:

Introduction
Robertson-Sparck Jones Early Work on IDF
What IDF Is Not
What IDF Really Is
On Terms Independence
On Terms Dependence
Few Examples
Estimating the IDF of a Phrase
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Link Sellers from 1995

20 Friday Jun 2008

Posted by egarcia in Marketing Research, Miscellaneous

≈ Leave a Comment

I was looking for the oldest evidence of marketing firms formally selling links and came across this one from 1995 that predates Google and most current search engines. Back then the Internet-on-a-Disk newsletter was hot.

Their November 1995 issue http://bubl.ac.uk/archive/journals/ioad/n1395.htm  reports this:

A NEW KIND OF ADVERTISING — Webconnect http://www.worldata.com/webcon.htm These folks act as hyperlink brokers. They have signed up hundreds of Web sites. They go to potential advertisers and offer them a package deal. For $X per month, you can have hyperlinks to your Web site from Y Web sites which attract the kinds of audiences you want to appeal to. The revenue is shared with the Web sites, which have the right to refuse any advertiser they don’t feel is appropriate for them.

They contacted us about a month ago, and now already we have our first “advertiser” — The Encyclopedia Britannica. For including a hypertext link to their site (with a little graphic), we receive $45 a month. That’s not bad considering the run our entire Web site on free space that we get with our $29 a month SLIP account with TIAC. So our one advertiser more than pays for our Internet access and our Web space. And the advertiser is a company we’re glad to help promote – they have a site that we have wanted to point to anyway as an important educational resource (http://www.eb.com)

*********************************
“HIT-VITATIONS” – WHAT’S GOING ON? AND HOW DO YOU PLAY THIS GAME? by Richard Seltzer, B&R Samizdat Express
I never expected that blatant commerical advertising would work on the Internet. The medium is much better suited for providing detailed information to people who want it, when they want it, and how they want it. Surprisingly, some of the much travelled on-ramp sites like Netscape are showing impressive results from “hyper- banner” advertising. I recently spoke with Kathleen Gilroy of Kathleen Gilroy Associates, a distance education company in Cambridge, Mass.. In exchange for sponsorship of an Internet training program, she got a hyperlinked “banner” on the Netscape site. The result was 500,000 hits on her Web site in the first month (http://www.kga.com).

Well, if you learn anything from dealing with the Internet and human behavior there, it’s that you’ve got to expect the unexpected and adjust quickly to change.

So is advertising “in” now? Is that the way to go?

I’ve heard people comparing hits or visits at a Web site to responses to a direct mail campaign. That seems far-fetched — not the right ballpark, not the right order of magnitude in terms of predicting audience behavior.

The first-time visitor who clicks to your site by way of a hyper-banner does so on random impulse. You’ve generated some street traffic by making it easy for people to impulsively move in your direction from some other site — a click costs the user little time and almost no effort — little thinking is involved — curiosity is enough.

When you buy an ad on television or in a newspaper, you are buying an opportunity to catch the attention of an established audience. When you buy a hyper-banner on the Internet, you buy an opportunity to induce people to come to your site and be (at least once) part of your audience. You have not yet begun to catch their attention.

A reminder and invitation to check a website (not a direct ad for a product or service) is a step or two removed from traditional advertising. It is audience acquisition for another program.

Once they “hit” your site, you have an opportunity to catch their interest, to provide them with useful information or an enjoyable experience or a discussion with people of like mind. You have earned a chance to give them good reason to come back again and again to your site. If, at that point, you simply shove a blatant ad in their face or ask them to fill out a long form before you let them see or do anything else, you could be throwing away that opportunity.

In other words, a hyper-banner is a “hit-vitation,” an invitation to hit another site. And the success of this approach does not mean that blatant advertising is thriving on the Internet.

In the Hit-vitation business, you are in do-it-yourself mode. Your Web site is the equivalent of a publication or a broadcast station – run by you. You need to build an audience — by serving an audience — before you can expect to get results. And raw hits – randomly gleaned from pointers and paid-for banner links — are not an audience, they are just an opportunity to build an audience.

Generating hits by way of hyperlink invitations is analogous to acquiring a list of prospects for one-time direct-mail use. These people have not yet even seen, much less read, an ad or marketing material, and the vast majority, once at your site, will do the equivalent of throwing your marketing material in the wastebasket. In other words, this is a step removed from direct mail responses, and marketers should set their expectations of results accordingly.

At this point in the evolution of commerce on the Internet, the experience of the user with a Web site is simply too complex to reduce to statistics. For the long term, success should be measured not by hits or visits but by some index of user loyalty — how likely they are to retun again and again. For today, remember that if you pay for a banner/link, you are sending out invitations to anyone and everyone to click on over to your site and take a look. And what that’s worth to you depends on what you have at your site — how useful and compelling people find it.

I still believe that the most interesting opportunities on the Internet are likely to come from serving audiences rather than selling advertising.

In my ideal model, you provide a place where people can interact with one another about matters of common interest; you provide related free information and useful pointers; and once you have built an audience and interact with those people regularly, you begin to provide them with services and products which they need. The better you serve them, the more likely you are to be successful. And in this mode very small operations could be very profitable and very beneficial as well.

****************************************************
REACTIONS TO “HIT-VITATIONS”
by Tom Camp, camp@zeke.enet.dec.com

Some interesting thoughts regarding “hit-vitation”. Another way to view these interesting “sign-posts” is from the perspective of someone driving down a city street loaded with signs for organizations (e.g. churchs, clubs, etc.), businesses (stores, commercial sites, etc.) and leisure activities (theatres, parks, amusements, etc.).

The Internet allows individuals to return to first days of driving (a.k.a. teenagers) when “cruising” in and of itself was compelling. While cruising, we looked at all the signs. They were new, exciting and had never been seen from the drivers seat. A great way to just enjoy ourselves as we thrilled at the freedom. Computers prior to the Internet didn’t allow us much freedom, you know. We saw the same view of office applications and accounting programs, spreadsheets, lists, etc.

When we first drove our cars, we may have driven by those signs thousands of times and driven into a few parking lots and browsed in some stores. Slowing to check things out, talking with people on the sidewalk – just enjoying the thrill. The places we checked out had a high degree of relationship to our interests.

As we matured though, driving became routine and lost some of its thrill and excitment. We went from one place to another because we had a purpose. Sometimes that purpose was to browse or loose ourselves for few hours in a Mall or store that we liked, but most often it was guided by a very specific purpose. When driven by such a purpose, every red light, yellow light, traffic jam and small yellow volkswagon in front of us proved a maddening distraction. Eventually we stop only where we have a purpose.

Much of what we’re pursuing with the Internet today is an attempt to match our Internet content and services to purposes which people find compelling in their lives. For the consumer market this will not be an easy task. The business user will benefit significantly in the short term for all the reasons you’ve described before.

Obviously, we need to understand more about the habits and effects of maturation of the Internet driver. I know I still act like a teenager sometimes, clicking and clicking and clicking… But when I’m looking for specific information on a company or a product – I want it NOW (one click away). Long delays (regardless what the cause) drive me to look for a horn to blow or some gesture to make at some faceless Webmaster in the sky. I maintain my hot list and constantly scribble URLs to avoid those long lines.

As a marketeer, I know there is power in this new medium. Measuring its effectiveness will keep us all employeed for many years to come. I agree with your concept that return is important. But as in life on the road, for some sites how often is not so important — pure hits may be. The type of site is a critical component in measuring how successful it is. For example, a site which provides information for a specific event might be effectively measured on total hits, while a commercial site offering a variety of information over time might be better measured by some combination of new hits and returns.

Over time we’ll see an evolution of sites and a maturity of users. As with any new market, niches will evolve that we can’t anticipate today and specialized services will develop to meet these needs. For us the challenge is to keep looking to identify these trends and help characterize them, measure their success and build (as we say) compelling solutions.

Just a few thoughts…

**********************************************************
HOW DO YOU DEFINE SUCCESS? THE REAL RESULTS DIRECTORY

People who run Web sites have many different objectives — from making the world a better place to live, to building a business or both — and hence they have very different definitions of success and methods of trying to achieve it. If you run a Web site and believe that it brings you results, send email to samizdat@samizdat.com and ask for “results.txt”, and we’ll send back a questionnaire. We’ll gather the responses (no hobbies and personal pages please — just sites designed to produce results), and we’ll make them available to all for free on our Web site at http://www.tiac.net/users/samizdat/results.html (Remember, we’re just getting started. There’s not much to see yet.)

We hope that by sharing our experiences we can help one another make better use of this strange and exciting new medium. And at the same time, this is a vehicle for those who run Web sites to let people know what they’re doing and why, and why people should visit.

We’re calling this project “Real Results: The directory of successful Web sites.” Please spread the word.

I am not claiming these are the earliest link brokers and probably they aren’t.

If some have pointers to earliest link brokers, let me know as there is nothing new under the Sun. Nevertheless, it is a great fun reading about how online marketing components started. I always learn something new by reading or lecturing on pieces of online history.

Send me pointers to the original sources, not articles some marketer wrote or compiled about the history of the Internet. I am putting together material for a new lecture titled: The History of Online Marketing.

It is time for Yang to leave Yahoo!

19 Thursday Jun 2008

Posted by egarcia in Miscellaneous

≈ Leave a Comment

After Yahoo’s CEO Jerry Yang blew it with Microsoft and before the most likely August’s proxy fight between the Icahn-Yang sides, more executives are leaving the company. Flickr’s founders Caterina Fake and Stewart Butterfield have joined the exodus of senior Yahoo managers.

Also missing in action are:

Usama Fayyad, Chief Data Officer
Jeff Winer, President of Yahoo’s Network division
Jeremy Zawodny, Software Developer/blogger and well known in SEO/SEM conferences

Meanwhile, Yahoo! is making the same dumbest and stupid mistake many businesses from the 90′s did: slave your revenue model to the revenue model of others (Google, in this case).

Google effectively controlling Yahoo! revenue stream channels and getting away with it. Great.

Schofield reports that TechCrunch is compiling a Then-Now table of formers Yahoo! executives.

Update: TechCrunch reports rumors that three more executives are leaving Yahoo! These are:

Vish Makhijani, the SVP and General Manager of Search
Brad Garlinghouse, head of Communications & Communities at Yahoo (Mail, Groups, Messenger, Flickr, and Zimbra)
Qi Lu, EVP engineering for Search and Advertising Technology Group and chief architect of Panama platform.

Wait! It is getting worse!

Today TechCrunch announced that Joshua Schacter, founder of Delicious, is also leaving Yahoo!.

And in a new twist, Information Week reports that Jason Zajac, general manager of social media at Yahoo is leaving.

 

IR Quiz

18 Wednesday Jun 2008

Posted by egarcia in Graduate Courses, IR Tutorials, Machine Learning

≈ Leave a Comment

Here is a question I included during the final examination of the Search Engines Architecture course. I am modifying the question. It might serve as a little quiz for non IR readers:

A collection consists of 500 documents. Some documents mention k1 and/or k2 keywords. If 100 mention k1, 200 mention k2, 70 mention k1 and k2, and 25 mention the k1 k2 terms sequence. Calculate the number of results for the following queries first, assuming terms independence and second assuming terms dependence. If the calculation is not possible from the provided data, write NC, ‘Not Computable’.

1. k1 NOT k2

2. k2 NOT k1

3. k1 OR k2 (unconditional OR)

4. k1 OR k2 (conditional OR)

5. NOT k1

6. NOT k2

7. NOT (k1 AND k2)

8. k1 AND k2 NOT (k1 k2)

9. EF-Ratio of the k1 k2 terms sequence

10. c12-index of the k1 k2 terms sequence

11. c12-index of k1 AND k2

12. IDF of k1

13. IDF of k2

14. IDF of k1 AND k2

15. IDF of k1 k2 terms sequence

Total Possible Scores: 15 points for terms independence and 15 points for terms dependence correct results.

Grading Yourself: A (100 – 90), B (89 – 80), C (79 – 70), D (69 -60), F(59 – 0)

Correct answers will be given during the week.

 

SEOs and their IDF Myths

17 Tuesday Jun 2008

Posted by egarcia in Marketing Research, SEO Myths, Spam

≈ 5 Comments

Now that the semester is over we can take on other projects. After a little break from the blog, it is good to be back. We are putting the final touches to this month issue of IR Watch – The Newsletter.  During the break dozen of new subscribers signed.

The piece takes on several IDF myths and misconceptions promoted by SEOs and on what IDF is/is not. Here is an excerpt:

One recurrent misconception found across online media channels (search marketing blogs, forums, etc) is the assertion that IDF can be used to assess how important or relevant a term might be to the content of a document. This claim has no basis.

 

It should be stressed that as a measure of term specificity over N, IDF is not a local, but a global measure. IDF evaluates the discriminating power of a term within a collection of documents. A term ti might be relevant or important to the content of a document. However, if this document is part of a collection wherein all documents repeat ti, the term loses its discriminating power since N = ni and IDFi = log(N/ni) = 0.

Somehow, these marketers are mistaking IDF for the RSJ model or who knows what to possibly, as is often the case, promote themselves or whatever they sell.

SEOs Scams: LSI, KW, and Markov Chains

03 Tuesday Jun 2008

Posted by egarcia in SEO Myths, Spam

≈ Leave a Comment

I’m happy to learn that Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at Indian Institute of Technology Madras, India is using my LSI and Term Vector tutorials for his graduate courses:

http://aidb.cs.iitm.ernet.in/cs625/11.SVD-LSI.pdf

http://aidb.cs.iitm.ernet.in/cs625/10.VectorSpace-model.pdf

It is great to see that more and more IRs and graduate students are realizing how certain SEOs have induced the public and their clients into error; that is, by selling their snakeoil in the form of “LSI optimization” and keyword density services. The most recent scam comes in the form of “markov chain” services. Like if they really know about matrix algebra and markov chain processes. Same old tricks…

It is not surprising to hear colleagues referring to these SEOs as vulgar crooks and scammers.

June 2008
M T W T F S S
« May   Jul »
 1
2345678
9101112131415
16171819202122
23242526272829
30  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.