• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Category Archives: IR Tools

The Binary Similarity Calculator

08 Friday Mar 2013

Posted by egarcia in Data Mining, IR Tools, Marketing Research, Software, Statistics and Mathematics

≈ Leave a Comment

We have just launched The Binary Similarity Calculator. This is a new tool for computing binary-based similarity measures that is available now.

What it is

The Binary Similarity Calculator (BSC) can be used to compare binary sets, groups consisting of only two types of items or states. These are item sets that can be represented as sequences of 1′s and 0′s.

Who can benefit from it

• Marketing analysts that need to examine Yes/No-type questionnaires about products and services.

• Teachers and examiners that must score Yes/No-type exams or assess plagiarism cases.

• Engineers, mathematicians, and physicists that must evaluate On/Off-type records.

• Statisticians, bioanalysts, and others involved with sequencing analysis.

• To sum up, anyone that uses binary sets.

Reconstruction and Iteration of Windows 16-Color VGA Palette

22 Friday Feb 2013

Posted by egarcia in Data Mining, Fractal Geometry, IR Tools, Programming

≈ Leave a Comment

With The Color Miner, we have programmatically reconstructed the classic Windows 16-color VGA palette with few basic algorithmic rules.

We also found that iterating the 16-color VGA palette, with these rules, the result converges to a 42-color palette. As given below.

Advantages?

The algorithms utilized allow one to:

  • reconstruct large palettes with a small set of seed colors.
  • store a small set of colors instead of a large palette file.
  • build basic palette generators and color tools.
  • use an initial palette to discover colors or propose new ones, then use these to expand the initial palette.

For additional information and to verify these findings, visit the The Color Miner page.

The Color Miner – A Tool for Mining Colors

21 Thursday Feb 2013

Posted by egarcia in Data Mining, IR Tools, Marketing Research, Software

≈ 1 Comment

The Color Miner, is a tool for mining colors from Web documents.

The traditional way of presenting color palettes to users is by rendering these as static arrays of colors. This limits users to staring at color squares to make color-color comparisons instead of engaging them in data mining and critical thinking, activities that promotes discovery and learning—in a research or school setting.

When we developed The Color Miner, we did so with a fractal design strategy in mind. As a result, we developed a tool capable of generating what we call fractalettes or palettes within palettes. That is, each cell of a generated palette behaves as a smaller palette, containing color space information and relationships for the current color.

We could iterate the individual cells for ever, but in practice we found that a one-level iteration was a good start to encourage users to investigate color-color, space-space, and color-space relationships, to find basic trends and information patterns. In general, this architecture can be iterated to organize additional attributes and relational data.

Launching The Net Miner Tool Set

21 Monday Jan 2013

Posted by egarcia in Data Mining, Hacking, IR Tools

≈ Leave a Comment

As part of the migration of Mi Islita to a new home at http://www.miislita.com, I’m happy to announce the initial release of The Net Miner Tool Set, v. 1

The tool set has been around for about a week with some few speed issues, but now you can enjoy it for good. So, what you can do with it? Well, visit the site, try it, and let me know if you like it or if there is room for improvement.

With The Net Miner Tool Set, now anyone can do some basic network security tests. You would be surprise to learn of how many sites are exposing things like php.ini files, making easy the life of attackers, or leaking unnecessary information in their configuration headers.

Why time spent on a site is so important?

13 Tuesday Nov 2012

Posted by egarcia in IR Tools, Marketing Research, Miscellaneous, Software

≈ Leave a Comment

That is a recurrent question being asked by some of my readers. Here is my answer.

Back in 1995, I wrote in the Dedication section of my doctoral thesis:

“If I have a theory, but no experimental results, I may have nothing. And if I have a theory without practical applications, I may have an artifact.”

So, don’t give your visitors hearsays, half-lies, or misrepresentation of facts found across the Web, but things that they can really test, use, and that solve a real or urgent problem for them. Don’t waste your time repeating interesting -perhaps catchy concepts-, but that at the end of the day are just useless.

In addition to textual and audiovisual content of good quality, give them TOOLS. However, provide tools that make them interact more time with your site and that authoritative pages will recommend or link to.

This is important because the amount of time spent by users in a site is directly correlated to several web metrics/analytics like:

  • frequency cap – restriction on the amount of time a specific visitor is shown a particular advertisement.
  • stickiness – the amount of time spent at a site over a given time period.
  • underdelivery – delivery of less impressions, visitors, or conversions than contracted for a specified period of time.
  • unique visitors – individuals who have visited a site (or network) at least once during a fixed time frame.
  • bandwidth – how much data (e.g., content, ads, creatives) can be transmitted in a time period over a communication channel, often expressed in kilobits per second (kbps). Data is any alphanumeric content. This includes parameters, variables or any text/pixel-based creative.

Other time-based metrics inherited from traditional media (TV, radio) and that are based on the time spent by users viewing a communication channel can be applied to web channels and sites; among others:

  • average audience – the average number of people who tuned into the given time selected and expressed in thousands or as a percentage (also known as a Rating) of thetotal potential audience of the demographic selected. It is also known as a T.A.R.P -Targeted Audience Rating Point.
  • channel share – the share one channel has of all viewing for a particular time period. The share, expressed as a percentage, is calculated by dividing the channel’s average audience by the average audience of all channels (PUTs) (It is held in higher esteem by networks than media buyers on a day to day basis and is only referred to by the latter group when apportioning budgets and evaluating a programme for sponsorship).
  • cummulative audience or reach – the total number of different people within the selected demographic who tuned into the selected time period for 8 minutes or more (i.e., reached at least once by a specific schedule or advertisement).
  • frequency – the average number of times that a person within the target audience has had the opportunity to see an advertisement over the campaign period.
  • time spent viewing or TVS – how many minutes/hours an audience has viewed a particular channel.

[Sources: WebSiteMagazine, WebMediaSolutions, NielsenMedia].

So, any tool that helps your visitors to wisely improve their time spent on your site -in an effective manner, of course- cannot hurt you. For this to be true, however, the tool provided must be engaging, useful, effortless, and with a minimum learning curve; otherwise the user experience of your visitors can be frustrating and a waste of time.

New Whois User Interface

25 Tuesday Oct 2011

Posted by egarcia in Data Mining, IR Tools, Programming

≈ Leave a Comment

Last night we uploaded a new user interface (UI) for the Minerazzi Multiple Whois Miner (http://www.minerazzi.com/labs/whois.php).

 

Added support to:

1. generic third-level domains (gTLDs).

2. country-code TLDs.

3. subdomain TLDs.

4. status persistency of form fields (without using cookies, sessions, JavaScript, but just pure PHP).

 

As we keep improving and adding new TLDs and whois servers to its index, we expect this to become a destination for our regular users.

The tool was designed in such a way that even support to the upcoming dotBrand Revolution is possible.

 

Enjoy it

The Scope Hypothesis in IR: Who is Right?

13 Saturday Aug 2011

Posted by egarcia in Data Mining, IR Tools, IR Tutorials, Queries

≈ 8 Comments

In previous posts, we have presented two tutorials on Okapi BM25 and BM25F, which are based on the Verbosity and Scope Hypotheses.

However…

Here I would like to reference research at both sides of the Scope Hypothesis.

In the abstract of ”Revisiting the relationship between document length and relevance” (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.3786&rep=rep1&type=pdf), Losada, D.E., Azzopardi, L. and Baillie, M. (2008) state:

“The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.”

Really….?

However in the abstract of “Enhancing ad-hoc relevance weighting using probability density estimation” (http://www.sigir2011.org/papershow.asp?PID=104), Zhou, Huang, and He (2011) state:

“Classical probabilistic information retrieval (IR) models, e.g. BM25, deal with document length based on a trade-off between the Verbosity hypothesis, which assumes the independence of a document’s relevance of its length, and the Scope hypothesis, which assumes the opposite. Despite the effectiveness of the classical probabilistic models, the potential relationship between document length and relevance is not fully explored to improve retrieval performance. In this paper, we conduct an in-depth study of this relationship based on the Scope hypothesis that document length does have its impact on relevance. We study a list of probability density functions and examine which of the density functions fits the best to the actual distribution of the document length. Based on the studied probability density functions, we propose a length-based BM25 relevance weighting model, called BM25L, which incorporates document length as a substantial weighting factor. Extensive experiments conducted on standard TREC collections show that our proposed BM25L markedly outperforms the original BM25 model, even if the latter is optimized.”

My take…

I haven’t reviewed BM25L vs. BM25F, yet. Still the question on the Scope Hypothesis is intriguing. For what I can tell (and this is my sole opinion), if an author writes more about a topic or several topics in a given document, more likely he will be using more instances of index terms. A cluster of the top index term density values (IDs) spreaded over said document should give some insight about its scope. We have developed a tool that computes these clusters. We are testing now whether that would translate into an improved relevance.

Assuming that Web IR systems out there (e.g,, search engines) use these algorithms or derivatives of these: What would be the implications for content writers trying to understand algos based on the Verbosity and Scope Hypotheses? Hello, copywriters, SEOs, etc. This puppy is nice to watch.

The minerazzi crawler: now crawling script tags and few more tags

29 Friday Apr 2011

Posted by egarcia in Data Mining, IR Tools, Programming, Software

≈ Leave a Comment

We are pleased to enable new crawling capabilities for the minerazzi web crawler:

04-29-11: Script tags detection capabilities added.
04-22-11: DocType, Base, and Link tags detection capabilities added.

If you are a Web developer, these new features of our crawler can help you to mine programming “gems” by examining, isolating, and collecting script lines that have been embedded in the source code of documents. And by crawling or accessing URLs of external files or link tags, programmers can view and dissect hidden scripts.

To learn more about these new features or previously added features, visit

http://www.minerazzi.com/labs/crawlinker.php

New additions to the minerazzi web crawler

18 Monday Apr 2011

Posted by egarcia in Data Mining, IR Tools, Programming, Software

≈ Leave a Comment

Back in 03-25-11 we released the minerazzi web crawler and link checker tool. As a beta, it is not perfect and needs improvements. Actually, this is the online version of the crawler used by the minerazzi search architecture (beta). Hopefully, it will evolved into a diagnostic tool.

We mentioned that the online version will undergo several changes, all intended to provide an online ”web crawler for the masses”. The idea is to put users in control of the crawling process since current crawlers lack of human intuition with regard to the next URL to crawl from a to-do list.

We are pleased to announce the following changes.

Changelogs
04-19-11: Robots Text File detection capabilities added. Meta data parsing changes. (*)
04-18-11: Title and Meta Tags detection capabilities added. Layout changes.
04-16-11: User Environment detection capabilities added.
04-14-11: Timer capabilities added.
04-10-11: Deduplication capabilities added.
04-09-11: Color palette reporting capabilities added.
04-05-11: DNS and MX reporting capabilities added.
04-03-11: Source code reporting capabilities added.
03-30-11: Relative URL resolving capabilities added.
03-28-11: Hypertext wrapping, ip, and headers reporting capabilities added.

(*) Just added.

There is something for everyone here.

Web Designers: Want to use a color palette from another site or tweak yours? Easy. Launch a crawl to a css file already discovered by the crawler.

Web Developers: Want to see diamonds? View the source of any file, including PDFs.

Data miners:  Need to mine links? Crawl a document. Better: launch a crawl to a site map file already discovered by the crawler.

Researchers: Want to check system configurations? Check IPs, DNS, MX and header traces (including data from cookies/sessions, etc)

More updates are coming soon!

On Power Analysis and SEO Quack Science

21 Thursday Oct 2010

Posted by egarcia in Data Mining, IR Tools, Quack Science, SEO Myths, Statistics and Mathematics

≈ 1 Comment

One of the trickiest aspects of publishing statistical studies is the sample size to be used. Not stipulating a valid procedure for estimating a proper sample size can hurt, for instance, a grant proposal. Ethical committees are concerned about the right number of observations in a study, asking submitters to justify on statistical grounds how they arrived at a given sample size. Research projects with too few or too many observations or no sample size methodology at all often get rejected. This is something those conducting SEO quack “science” don’t seem to understand or are not aware of.

Too small samples are unethical, because the researcher cannot be specific enough about the size of, for example, the effect of a drug in a population. Too large samples are also unethical, because represent a waste of funding. True that a large sample improves precision, but it might involve an unjustified cost. Stratification is preferred, but it gets too complicated with huge sample sizes, not to mention that statistical significance not necessarily scales between samples.

As Rahul Dodhia from RavenAnalytics (http://ravenanalytics.com/Articles/Sample_Size_Calculations.htm ) indicates: a 2000-sample might not be very different from a 20000-sample, but a 200-sample maybe very different from a 2000-sample even when in each case the sample ratio is 10. So, a large sample not always is justified, even if such a sample size improves statistical significance and precision.

Consider the case of search engine ranking results. Upon a query, search engines are capable of finding many results, frequently in the range of thousand or million results per query. Still search engines and retrieval systems show to users a limited answer set. For instance, Google limits its viewable answer set to a maximum of 1,000 results (100 pages, 10 results/page).

Like in most retrieval systems, relevant results are accumulated at the first few result pages forming clusters. This is in agreement with Rijsbergen’s Cluster Hypothesis, which states that documents that cluster together have a similar relevance to a given query. Moving down the list of search results one often find cluster transitions wherein the quality and aboutness of documents is polluted with off-topic content.

Documents buried in a list of results often contain content irrelevant to the initial query or full of spam techniques. If one wants to conduct a statistical study of ranking results versus a particular document feature, one can do better by considering a sample from the first few result pages than from the entire answer set of 1,000 results.

In general, in a non-search engine scenario one cannot just arbitrarily select large samples to “force” the statistical significance of very low correlation coefficients and then use those values to draw conclusions. Furthermore, what is the selection criterion for using 1,000 or 10,000 results?

Simply stated: If 10,000 observations are arbitrarily selected, why not use 100,000 or 1,000,000 instead? We already know that very small correlation coefficients between any two arbitrary pair of random variables will be significant at those huge sample levels, anyway. And?

As noted in a Wikipedia entry, “given a sufficiently large sample size, a statistical comparison will always show a significant difference unless the population effect size is exactly zero. (http://en.wikipedia.org/wiki/Effect_size ).

For example, a correlation coefficient of r = 0.04 would be significant at a 95% confidence level if coming from a 10,000-sample (t-calc = 4.003 >> t-table = 1.96) while a correlation coefficient of r = 0.01 would be significant at a 95% confidence level if coming from a 100,000-sample (t-calc = 3.162 >> t-table = 1.96). And? This proves nothing, especially when the magnitude of a “signal” approaches the magnitude of its “noise”.

As noted at the above Wikipedia entry, a correlation coefficient of 0.1 is strongly statistically significant when sample size is 1000, (t-calc = 3.175 >> t-table = 1.96) but reporting only the small p-value from this analysis could be misleading if a correlation of 0.1 is too small to be of interest in a particular application. (http://en.wikipedia.org/wiki/Effect_size ).

Statistical significance of extremely small r values is not surprising as is just a mathematical consequence of the fact that a t-value is a function (F) of a weighted ratio: the ratio of explained-to-unexplained variations weighted by the number of degree of freedoms:

F(r, n) = t = SQRT[(r2/(1 – r2))*(n – 2)]
F(r, n) = t = r*SQRT[((n – 2)/(1 – r2))]

For a given r value, increasing n increases t. No surprise here. One thing is what a math equation tells you and another different thing is what the nature and obvious boundaries of a physical system tell you.

At trivially low r values any claim with regards to the statistical significance or strength of some results proves nothing and one cannot do much with such trivial r values. For instance for r = 0.04, r2 = 0.0016, meaning that 1 – r2 = 0.9984 or 99.84% of the variations in the dependent variable (y) are not explained by variations in the independent variable (x).

In such a scenario, assessing the effect of x on y is a futile exercise. Such a model would be useless for drawing conclusions or predicting anything. And here is the point that many SEOs at SEOMOZ (http://www.seo.co.uk/seo-news/seo-tools/the-seomoz-lda-tool-%E2%80%93-our-disappointing-findings.html , Fishkin, Hendrickson, and others elsewhere) don’t seem to grasp:

When a correlation coefficient is useless for all practical purposes.

If the raw data constantly changes, that’s another “Chaos Layer” that compounds the problem.

Enters Cohen’s Power

According to Cohen’s work, when conducting a sample size study of correlation coefficients, one needs to consider the required confidence level and power of the test, the desired probability for Type I and Type II Errors, and the hypothesized or anticipated correlation coefficient (http://www.medcalc.be/manual/correlation_coefficient.php ). One cannot just use an arbitrary sample size for testing things.

In general, given any three of the following, the fourth one can be determined (http://www.statmethods.net/stats/power.html ):

1. sample size
2. effect size
3. significance level = P(Type I error) = probability of finding an effect that is not there
4. power = 1 – P(Type II error) = probability of finding an effect that is there

One also needs to consider what is the statistical parameter that is undergoing the power analysis. One needs to ask questions like the following:

Are we testing means from a given group? http://www.nss.gov.au/nss/home.nsf/pages/Sample+Size+Calculator+Description?OpenDocument

Are we testing means from different groups? http://www.ncbi.nlm.nih.gov/pmc/articles/PMC137461/

Are we testing correlation coefficients? Read Simon’s take on the impact of sample size on the desired level of precision in correlation coefficients (http://www.childrens-mercy.org/stats/weblog2005/CorrelationCoefficient.asp ).

Are we interested in significance level, effect size, sample effect, or power?

When conducting an effect size analysis one must keep in mind that effect sizes estimate the strength of a possible relationship, rather than assigning a significance level. However, effect sizes do not determine significance levels, or vice-versa.

So, how do we go about implementing Power Analysis?

For those interested in implementing power analysis written in the R Language, I recommend the libraries at http://www.statmethods.net/stats/power.html

Software for conducting power analysis is also available elsewhere, as shown in the following table. My favorites are G*Power and SPSS SamplePower (http://www.spss.com/software/statistics/samplepower/).

Power Analysis SoftwareSource: http://www.epibiostat.ucsf.edu/biostat/sampsize.html
Software Remarks
G*Power License: Free Uses both exact and approximate methods to calculate power. It will deal with sample size/power calculations for t-tests, 1-way ANOVAs, regression, correlation, and chi-square goodness of fit. For t-tests and ANOVAs you find the effect size by supplying mean and variance information. For correlation coefficients the effect size is a function of r2. http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/
PC-Size License: Free Deals with sample size/power calculations for t-tests, 1-way and 2-way ANOVA, simple regression, correlation, and comparison of proportions. http://www.esf.edu/efb/gibbs/monitor/usingDSTPLANandPCSIZE.pdf
ftp://ftp.simtel.net/pub/simtelnet/msdos/statstcs/size102.zip
DSTPLAN License: Free Uses approximate methods to calculate power. It will calculate sample size/power for t-tests, correlation, a difference in proportions, 2xN contingency tables, and various survival analysis designs. http://biostatistics.mdanderson.org/SoftwareDownload/SingleSoftware.aspx?Software_Id=41
PS License: Free Performs sample size/power calculations for t-tests, Chi-square, Fisher’s exact, McNemar’s, simple regression, and survival analysis. http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleSize
Tibco Spoffire S+ License: Paid The only commercially-supported statistical analysis software that delivers a cross-platform IDE for the award-winning S programming language, the ability to analyze gigabyte class data sets on the desktop, and a package system for sharing, reuse and deployment of analytics in the enterprise and in validated environments. Used widely in validated production environments (e.g., 21 CFR Part 11).http://spotfire.tibco.com/products/s-plus/statistical-analysis-software.aspx
NQuery Advisor License: Paid Performs sample size/ power calculations for t-tests, 1 and 2 way ANOVAS, tests of contrasts in 1-way ANOVAs, univariate repeated measures designs, regression (simple, multiple and logistic), correlation, difference of proportions, 2XN contingency tables, and survival analyses. http://www.statsol.ie/nquery/nquery.htm
PASS License: Paid Performs sample size/power calculations for z-tests, t-tests, 1, 2, and 3-way ANOVAs, univariate repeated measures designs, regression (simple, multiple and logistic), correlations, difference in proportions, 2xN contingency tables, survival analyses and simple non-parametric analyses.  http://www.ncss.com/pass.html
Stata License: Paid It has some simple built-in power and sample size functions. http://www.stata.com/
SPSS SamplePower License: Paid  If your sample size is too small, you could miss important research findings. If it’s too large, you could waste valuable time and resources. Finds the right sample size for your research in minutes and test the possible results before you begin your study, with IBM SPSS SamplePower. Strikes the right balance among confidence level, statistical power, effect size, and sample size using IBM SPSS SamplePower. Compares the effects of different study parameters with its flexible analytical tools. http://www.spss.com/software/statistics/samplepower/  
← Older posts
Newer posts →
June 2013
M T W T F S S
« May    
 12
3456789
10111213141516
17181920212223
24252627282930

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.