Archive for the ‘Web Mining Course’ Category

Keyword Density Tools and SEOs

February 26, 2008

SEOs are still debating whether keyword density is good for something. The most recent debate is at http://www.hobo-web.co.uk/seo-blog/index.php/keyword-density-seo-myth/

Overall, the agreement is that is not useful.

Two issues that strikes me as these suggest a lack of understanding of how search engines work accomodate to the following questions:

1. Could KD be used by search engines or users to check for spam keyword?
2. Is Vector Space currently in use by modern search engines?

Let me clarify these points.

Could KD be used by search engines or web page creators to check for spam keyword?

Word repetition determined by search engines as spam keyword should be of more concern than to what web page creators or a KD tool tag as spam keyword. After all search engines and not designers of web pages are the one that assign a rank to the documents. This goes with the user-machine relevance perception mismatch and the concept of document linearization as a gap analysis. We have thoroughly discussed both in our IRWatch Newsletter, at this blog, and at Mi Islita.

However, this does not mean end users are a zero to the left, as they are the one that pay the bills. And even if they don’t, why rank high a page just to see users going to some place else after visiting it because is not suitable for human consumption? So, rather than using a KD tool, just write as natural and useful to your prospective clients and readers as you can.

Regarding the use of KD tools for checking for spam, this allegation reminds me of certain seo books, marketers, and community forums that insist in such non sense, just to keep their KD tools relevant and alive.

During the Web Mining Course we debunked almost on a rutinary basis these and similar SEO myths. For instance, grad students learned about several local weight models that attenuate frequencies, hence serving the purpose of both scoring local weights and dampening down the effect of keyword repetition. Two for the price of one!

This is more cost effective at neutralizing keyword repetition than computing (and comparing against) a whole new ratio, KD. Best of all, it does not require of the two extra loops one would have to use to compute KD (one for every term i in a doc and another for every doc j across a collection). Thus, whatever the % ratio computed by a KD tool, it will be compacted/attenuated within the corresponding scales of the local weight model used. So, from the search engine side, KD is not even a cost-effective tool for fighting spam.

To be sure students understood, I included the following three questions in the Final Exam section that consisted of multiple choices. (The problem-solving section of the test is even more interesting, but is too long to include it here.)

#10. It is a false statement:

a. Distance is anti-similarity.
b. Keyword density estimates keyword relevance.
c. In Vector Space Theory, a document is a vector of terms.
d. In Vector Space Theory, a query is a vector of terms.

#15. Which model does not attenuate frequencies?

a. SQRT
b. FREQ
c. LOGA
d. LOGN

#16. Consider two documents d1 and d2 wherein local term weights are computed using the LOGA model. d1 repeats a term once. How many times this term should be repeated in d2 to triplicate its d1 weight? Assume Log 10 base.

a. 3 times.
b. 30 times
c. 100 times
d. 1000 times

Answers: 10. b, 15. b, and 16. c. (sorry I’ve made a typo).

Is Vector Space currently in use by modern search engines?

Suggesting the contrary is non sense. Vector Space models are used on a regular basis to score and rank documents. Implementation is not that hard across large collections if you use the right scoring system with updating and precaching techniques on a term-doc matrix. In fact, I’ll be teaching this Spring the graduate course Search Engines Architecture.

I will blog the syllabus tomorrow, but is already available from the Electrical & Computer Engineering and Computer Science Department of PUPR.edu. This is a lecture and lab session course. Students will build their own search engines, crawlers, parsers, stemmers, and vector space scoring systems using open source components and some of their own authorship.

On and on, SEOs still have no clue about what a search engine can or cannot do.

Web Mining Week 11

February 11, 2008

Week 11 Agenda

Web Mining Course Final Examination

Please be on time and bring with you:

1. a #2 pencil and eraser. Pens are not allowed.
2. a scientific calculator. PCs, laptops, software are not allowed.

Notes, books, or additional material are not allowed.

The test is based on material covered in the lectures, take home tests, and reading material provided. It consists of two parts and as follows:

Part 1 consists of multiple choice exercises (40%).
Part 2 consists of problem solving exercises (60%).

Please review how-to linear algebra calculations.

LSI data will be provided so you won’t need to use an SVD calculator during the test.

Good luck and thank you for taking this graduate course.

Cheers

Dr. E. Garcia

Web Mining Week 10

February 4, 2008

Week 10 Agenda 

Intelligence through JavaScript Spam Techniques: From Doorways to Click Throughs Fraud (PPT presentation) - Covers several scripts for injecting redirection mechanisms to the Web surfing experience of end users. Includes also click fraud scripts and scripts for faking Web traffic.

Co-Occurrence Theory: From C-Indices to EF-Ratios (PPT presentation) - Covers the latest on co-occurrence theory in relation to c-indices and EF-Ratios and how to use these metrics for site optimization and keyword-brand association campaigns.

Review of Test 3 and Bonus.

Required Reading Material

1. http://airweb.cse.lehigh.edu/2007/papers/paper_115.pdf

2. http://www.miislita.com/semantics/c-index-2.html

Web Mining Week 9

January 28, 2008

Week 9 Agenda

Intelligence Searching for Penetration Testers (PPT Presentation)
Searching for Terrorist Threats and Identity Thefts, the SSN Way (PPT Presentation)
Mining VIN numbers, Email Headers, and other Undocumented Commands (PPT Presentation)

Required Reading Material

Provided during lecture.

Association & Scalar Clusters Tutorial - Part 1

January 22, 2008

I am writing a tutorial series on Cluster Analysis. It is my pleasure to announce that the
Association and Scalar Clusters Tutorial - Part 1: Back Mapping Term Clusters to Documents was uploaded few days ago.

Online publication was announced in advanced to subscribers of the IR Watch - The Newsletter, so they already have an edge over regular readers and visitors of Mi Islita

Abstract follows:

In this tutorial you will learn how to extract association and scalar clusters from a term-document matrix. A “reaction” equation approach is used to break down the classification problem to a sequence of steps. From the initial matrix, two similarity matrices are constructed, and from these association and scalar clusters are identified. A back mapping technique is then used to classify documents based on their degree of pertinence to the clusters. Matched documents are treated as distributions over topics. Applications to topic discovery, term disambiguation, and document classification are discussed.

During last night lecture (Web Mining Course), I applied the back mapping technique to scalar clusters generated from LSI. The technique provides additional information and reasons as to how and why documents score as observed after implementing SVD. A clear connection with Fuzzy Set Theory was made.

Students taking the Web Mining Course will find this tutorial quite handy.

Web Mining Week 8

January 21, 2008

Week 8 Agenda

 Take-Home Work 3 and Web Mining Course FAQs
LSI and Scalar Cluster Analysis: An EXCEL Spreadsheet Approach (PPT presentation)
LSI and Fuzzy Sets = Fuzzy LSI
Introduction to Intelligence Searches (PPT presentation)
Bonus: My IPAM Lost Pictures at the 2006 Document Indexing Workshop

Required Reading Material

http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf 
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf

Global Term Weights based on Entropies

January 16, 2008

A grad student taking the Web Mining, Search Engines, and Business Intelligence course asked me to clarify global weights G defined as entropies.

Global weights based on entropies are frequently combined with local and normalization weights into overall weights.  These are then used to populate a term-doc matrix. The matrix can be used with term vector models to rank documents. The same matrix can be decomposed with SVD (LSI) and used to rank documents.

The following set of equations define the global entropy weight of term i in a collection of just 3 documents (N=3). I am providing two extreme cases:

Global Entropy Weights

Evidently,

G = 0 if the term is equally mentioned in all documents of the collection.
G = 1 if the term is present in just one document.

Any other combination of frequencies yields G values somewhere between 0 and 1. Thus, the model gives higher weights to terms that appear fewer times in a small number of documents, while lowering the weights of terms that are frequently used across the collection.

Note that the convention is to default p log p values when a condition is met; e.g., p log p = 0 if p = 0 or 1.

Web Mining Week 7

January 14, 2008

Week 7 Agenda

Review of Association and Scalar Clusters
Review of Vector Space Models
LSI & SVD: Demystifying LSI SEO Myths (OJOBuscador Congress, Madrid; PDF Presentation)
LSI & Keyword Research (PDF Presentation)
SVD Noise Filtering: Principal Component Analysis (PCA)

Required Reading Material

Tutorial Series
This is part one of a five-part tutorial series:
http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html

Fast Tracks
These are quick tutorials, with to-the-point calculations:
http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf

Blog Posts
These are IR blog posts designed to fight back against misinformation promoted by unethical SEOs and Spammers:
http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/
http://irthoughts.wordpress.com/page/1/?s=lsi
http://irthoughts.wordpress.com/page/2/?s=lsi

Blog Category
This is a blog category pointing to a collage of posts that demystify SEO non sense about LSI. Some are about topics that overlap with LSI:
http://irthoughts.wordpress.com/category/latent-semantic-indexing/   

Web Mining and Search Engines Architecture Courses

January 11, 2008

Winter back to school.

Here is the schedule for the Web Mining, Search Engines, and Business Intelligence graduate course for the next weeks.

Jan 14 - LSI and SVD: A hands-on approach. Covers SEO LSI Myths

Jan 21 - Intelligence Searching: Ethical hacking and penetration testing with search engines

Jan 28 - Spam Intelligence: Ethical Spamming, spamdexing, and Adversarial IR strategies

Feb 4 - On-Topic Analysis and Co-Occurrence Theory

Feb 11 - TBA

Next Spring I will be teaching the advanced graduate course Search Engines Architecture

This is a hands-on course where students will spend most of the time in the Software Testing Lab. We will build crawlers, dbas, parsers, search interfaces, etc. Students doing or interested in working on projects/theses with me are encouraged to take the course.

Web Mining Week 6

December 17, 2007

Week 6 Agenda

Revisiting Understanding Ranking Algorithms (PPT Presentation)
 

Local Term Weight Models (PPT Presentation)
  Discussion of FREQ, BNRY, LOGN, LOGA, ATF1, SQRT, and other models.
 

Global Term Weight Models (PPT Presentation)
  Discussion of IDF, Prob IDF, Entropic, and other models.

Required Reading Material

http://csmr.ca.sandia.gov/~tgkolda/pubs/ornl-tm-13756.pdf  

Web Mining Week 5

December 10, 2007

Week 5 Agenda 

Student oral presentations on search engine result pages experiment:     

Extracting keyword distribution patterns from

search engine ranked documents

Discussion of tools developed by students: 

1. Binary Search Interface

2. HTML Cruncher

Term Weight Models – Understanding Ranking Algorithms (PPT Presentation) 

Required Reading Material

http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html  

http://www.miislita.com/term-vector\term-vector-1.html  

http://csmr.ca.sandia.gov/~tgkolda/pubs/ornl-tm-13756.pdf

Web Mining Week 4

December 3, 2007

Week 4 Agenda 

1. Association and Scalar Clusters (PPT Presentation).

2. Applications: Keyword Discovery through a Similarity and Co-Occurrence Matrix.

3. Notes About Vectors (PPT Presentation).

4. Practice Exercise: Extracting Keyword Clusters from Google’s Search Results. 

Required Reading Material 

http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html

http://www.miislita.com/term-vector/term-vector-1.html

Web Mining Week 3

November 26, 2007

Week 3 Agenda

1. Introduction to Parsing: Building a Query Normalizer (PPT Presentation).
2. Understanding Search Engine Snippets via the SES 2005, San Jose presentation:
Patents on Duplicated Content (PDF Presentation).
3. Demonstration of Snippy Software: A Snippet Generator and Constrain Searcher.
4. Introduction to Association and Scalar Clusters: Keyword Clustering through a Similarity Matrix (PDF Presentation).

Required Reading Material

http://www.grammarbook.com/punctuation/apostro.asp
http://owl.english.purdue.edu/handouts/grammar/g_apost.html
http://tartarus.org/~martin/PorterStemmer/def.txt
http://www.json.org
http://www.json.org/js.html
http://www.crockford.com/javascript/javascript.html
http://www.miislita.com/search-engine-conferences/duplicated-content-patents.html
  Note: The Word and PDF versions of this talk are no longer available on the Web.

Bonus for Take-Home Work 1

This bonus is available to individual students, not to student groups. Each student must provide two separate html working scripts: (a) one containing the binary interface and (b) one containing the cruncher.

After learning this week how to build the search interface of a query normalizer.

1. Modify this search interface so that it only accepts binary data (1’s or 0’s).
2. Modify the search interface so that it becomes an HTML cruncher application. Add customized prototype methods so that the cruncher removes tabs, carriage returns, newlines, and unnecessary white space found in a typical HTML document. In addition, it should be non-invasive; i.e., it should not affect the functionality of HTML documents containing scripts, CSS instructions, or comment lines. You might need to retest the cruncher thoroughly with real documents from the Web, preferably with spam documents and with the source code of emails. To grab the source of an Outlook Express email, open an email you have received and navigate to:

File > Properties > Details > Message Source

Then, right-click message source and click Select to highlight all and right-click again to click Copy. You can also press Crtl+A to select all and Crtl+C to copy. Paste source in your HTML cruncher and crunch it.

Web Mining Week 2

November 19, 2007

Week 2 Agenda:

1. The User-Machine Relevance Perception Gap (PPT presentation)
2. Introduction to Document Indexing (PPT presentation)
3. Linearization: markup removal
4. Tokenization: punctuation removal
5. Filtration: stopword removal
6. Stemming: suffix/prefix removal
7. Tools to approximate document linearization
8. Demonstration of Minerazzi software
9. Take-Home Work 1: Document Gap Analysis

Required Reading Material

IR Watch Newsletter; 2007-6: The User-Machine Relevance Perception Gap - This is a free newsletter back issue, available only for students taking the course.
http://www.useit.com/alertbox/reading_pattern.html
http://psychology.wichita.edu/surl/usabilitynews/91/eyegaze.html
http://www.miislita.com/fractals/keyword-density-optimization.html
http://irthoughts.wordpress.com/2007/05/09/keyword-density-the-devils-advocate/
http://irthoughts.wordpress.com/2007/05/07/keyword-density-kd-revisiting-an-seo-myth/
https://www.google.com/adsense/support/bin/answer.py?answer=17954
http://www.miislita.com/information-retrieval-tutorial/indexing.html
http://www.dcs.qmul.ac.uk/~mounia/CV/Papers/ker_ruthven_lalmas.pdf

Web Mining Week 1

November 12, 2007

Course Description 

The CECS 6824B/21 Special Topics in KDDM graduate course Web Mining: A First Course in Web Mining, Search Engines, and Business Intelligence (Department of Computer Engineering & Computer Sciences of Polytechnic University) starts today.

Syllabus: Available at http://www.pupr.edu/pdf/Web-Mining.pdf
Time: Monday, 6:30PM – 10:30PM
Location: Turing Laboratory, Room 301
These are also office hours for those working on projects and theses with me can attend.

Grading System 

Grading: Take-Home Work and Final Exam

Three partial take home tests. The lowest score is dropped.

The following scale is used to score a final grade G:

 G = (F)(w) + (ave P)(1 - w)

Where F is the score of the Final Exam and ave P is the average Partial score defined as follows:

ave P = (Bonus points + sum of two highest partial tests)/2

w = weight factor to curve scores.

General Instructions for working in groups:

1. After completion of a project, each group will conduct a 15-minute PPT or PDF presentation. You only need to explain your results and main findings.
2. The day of the presentation the group should submit a written 2-page max report in English or Spanish, in a Word or PDF format. Use a 10-point Arial font and a single space, 1”-margin format.
3. The 2-page report should consist of a centered title, followed by co-author names, and a 50-word max abstract, a one-paragraph Introduction, Procedure (referenced), Results, Conclusion, and a Reference.
4. Raw data, figures, tables, or codes, if any, should be referenced (e.g., “See Figure 1.”) and appended in separate pages as an Appendix section. Number each of these (e.g., Figure 1, Table 1, etc.) and add a descriptive caption to each one.

If you are a student, read this blog (http://irthoughts.wordpress.com) for announcements and updates under the Web Mining Course category(http://irthoughts.wordpress.com/category/web-mining-course/).

Web Mining is a hands-on course; thus, all weekly agendas are tentative, flexible, and can be extended or shortened according to class needs. Posts at this blog will reflect these changes.

Week 1 Agenda:

1. Overview of the course
2. Falling in Love with Web Mining:
  A Brief History of the Internet and Search Engines (PPT presentation).
3. Search Engines and Search Marketing (PDF presentation)

Required reading material

http://www.cienciapr.org/news_view.php?id=711
http://irthoughts.wordpress.com/2007/06/29/a-week-before-greatness/
http://www.computerhistory.org/internet_history/
http://www.searchenginehistory.com/
http://searchenginewatch.com/showPage.html?page=3422781

Optional reading material:

It might help later in the course if you can start reading a bit about building search, match, and replace applications with regular expressions. Feel free to use your favorite programming language. Any programming flavor is fine, but as long as your applications can be interpreted by a browser (e.g., IE, Firefox, etc). Keep in mind that this is a hands-on course.