The Binary Distance Calculator, a new tool for computing the distance or dissimilarity (lack of resemblance) between any two binary sets of same size is available now at http://www.miislita.com Its FAQs section includes a clear definition of distance in the context of Information Retrieval and Mathematics.
This tool was developed to complement The Binary Similarity Calculator, one of our popular tools.
Indeed, the AZZOO measure outperforms all conventional measures in the application of IRIS biometrics and handwritten character recognition.
At least that’s what is claimed.
We have just launched The Binary Similarity Calculator. This is a new tool for computing binary-based similarity measures that is available now.
What it is
The Binary Similarity Calculator (BSC) can be used to compare binary sets, groups consisting of only two types of items or states. These are item sets that can be represented as sequences of 1′s and 0′s.
Who can benefit from it
• Marketing analysts that need to examine Yes/No-type questionnaires about products and services.
• Teachers and examiners that must score Yes/No-type exams or assess plagiarism cases.
• Engineers, mathematicians, and physicists that must evaluate On/Off-type records.
• Statisticians, bioanalysts, and others involved with sequencing analysis.
• To sum up, anyone that uses binary sets.
With The Color Miner, we have programmatically reconstructed the classic Windows 16-color VGA palette with few basic algorithmic rules.
We also found that iterating the 16-color VGA palette, with these rules, the result converges to a 42-color palette. As given below.
The algorithms utilized allow one to:
- reconstruct large palettes with a small set of seed colors.
- store a small set of colors instead of a large palette file.
- build basic palette generators and color tools.
- use an initial palette to discover colors or propose new ones, then use these to expand the initial palette.
For additional information and to verify these findings, visit the The Color Miner page.
The Color Miner, is a tool for mining colors from Web documents.
The traditional way of presenting color palettes to users is by rendering these as static arrays of colors. This limits users to staring at color squares to make color-color comparisons instead of engaging them in data mining and critical thinking, activities that promotes discovery and learning—in a research or school setting.
When we developed The Color Miner, we did so with a fractal design strategy in mind. As a result, we developed a tool capable of generating what we call fractalettes or palettes within palettes. That is, each cell of a generated palette behaves as a smaller palette, containing color space information and relationships for the current color.
We could iterate the individual cells for ever, but in practice we found that a one-level iteration was a good start to encourage users to investigate color-color, space-space, and color-space relationships, to find basic trends and information patterns. In general, this architecture can be iterated to organize additional attributes and relational data.
As part of the migration of Mi Islita to a new home at http://www.miislita.com, I’m happy to announce the initial release of The Net Miner Tool Set, v. 1
The tool set has been around for about a week with some few speed issues, but now you can enjoy it for good. So, what you can do with it? Well, visit the site, try it, and let me know if you like it or if there is room for improvement.
With The Net Miner Tool Set, now anyone can do some basic network security tests. You would be surprise to learn of how many sites are exposing things like php.ini files, making easy the life of attackers, or leaking unnecessary information in their configuration headers.
That is a recurrent question being asked by some of my readers. Here is my answer.
Back in 1995, I wrote in the Dedication section of my doctoral thesis:
“If I have a theory, but no experimental results, I may have nothing. And if I have a theory without practical applications, I may have an artifact.”
So, don’t give your visitors hearsays, half-lies, or misrepresentation of facts found across the Web, but things that they can really test, use, and that solve a real or urgent problem for them. Don’t waste your time repeating interesting -perhaps catchy concepts-, but that at the end of the day are just useless.
In addition to textual and audiovisual content of good quality, give them TOOLS. However, provide tools that make them interact more time with your site and that authoritative pages will recommend or link to.
This is important because the amount of time spent by users in a site is directly correlated to several web metrics/analytics like:
- frequency cap – restriction on the amount of time a specific visitor is shown a particular advertisement.
- stickiness – the amount of time spent at a site over a given time period.
- underdelivery – delivery of less impressions, visitors, or conversions than contracted for a specified period of time.
- unique visitors – individuals who have visited a site (or network) at least once during a fixed time frame.
- bandwidth – how much data (e.g., content, ads, creatives) can be transmitted in a time period over a communication channel, often expressed in kilobits per second (kbps). Data is any alphanumeric content. This includes parameters, variables or any text/pixel-based creative.
Other time-based metrics inherited from traditional media (TV, radio) and that are based on the time spent by users viewing a communication channel can be applied to web channels and sites; among others:
- average audience – the average number of people who tuned into the given time selected and expressed in thousands or as a percentage (also known as a Rating) of thetotal potential audience of the demographic selected. It is also known as a T.A.R.P -Targeted Audience Rating Point.
- channel share – the share one channel has of all viewing for a particular time period. The share, expressed as a percentage, is calculated by dividing the channel’s average audience by the average audience of all channels (PUTs) (It is held in higher esteem by networks than media buyers on a day to day basis and is only referred to by the latter group when apportioning budgets and evaluating a programme for sponsorship).
- cummulative audience or reach – the total number of different people within the selected demographic who tuned into the selected time period for 8 minutes or more (i.e., reached at least once by a specific schedule or advertisement).
- frequency – the average number of times that a person within the target audience has had the opportunity to see an advertisement over the campaign period.
- time spent viewing or TVS – how many minutes/hours an audience has viewed a particular channel.
So, any tool that helps your visitors to wisely improve their time spent on your site -in an effective manner, of course- cannot hurt you. For this to be true, however, the tool provided must be engaging, useful, effortless, and with a minimum learning curve; otherwise the user experience of your visitors can be frustrating and a waste of time.
Last night we uploaded a new user interface (UI) for the Minerazzi Multiple Whois Miner (http://www.minerazzi.com/labs/whois.php).
Added support to:
1. generic third-level domains (gTLDs).
2. country-code TLDs.
3. subdomain TLDs.
As we keep improving and adding new TLDs and whois servers to its index, we expect this to become a destination for our regular users.
The tool was designed in such a way that even support to the upcoming dotBrand Revolution is possible.
In previous posts, we have presented two tutorials on Okapi BM25 and BM25F, which are based on the Verbosity and Scope Hypotheses.
Here I would like to reference research at both sides of the Scope Hypothesis.
In the abstract of ”Revisiting the relationship between document length and relevance” (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.3786&rep=rep1&type=pdf), Losada, D.E., Azzopardi, L. and Baillie, M. (2008) state:
“The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.”
However in the abstract of “Enhancing ad-hoc relevance weighting using probability density estimation” (http://www.sigir2011.org/papershow.asp?PID=104), Zhou, Huang, and He (2011) state:
“Classical probabilistic information retrieval (IR) models, e.g. BM25, deal with document length based on a trade-off between the Verbosity hypothesis, which assumes the independence of a document’s relevance of its length, and the Scope hypothesis, which assumes the opposite. Despite the effectiveness of the classical probabilistic models, the potential relationship between document length and relevance is not fully explored to improve retrieval performance. In this paper, we conduct an in-depth study of this relationship based on the Scope hypothesis that document length does have its impact on relevance. We study a list of probability density functions and examine which of the density functions fits the best to the actual distribution of the document length. Based on the studied probability density functions, we propose a length-based BM25 relevance weighting model, called BM25L, which incorporates document length as a substantial weighting factor. Extensive experiments conducted on standard TREC collections show that our proposed BM25L markedly outperforms the original BM25 model, even if the latter is optimized.”
I haven’t reviewed BM25L vs. BM25F, yet. Still the question on the Scope Hypothesis is intriguing. For what I can tell (and this is my sole opinion), if an author writes more about a topic or several topics in a given document, more likely he will be using more instances of index terms. A cluster of the top index term density values (IDs) spreaded over said document should give some insight about its scope. We have developed a tool that computes these clusters. We are testing now whether that would translate into an improved relevance.
Assuming that Web IR systems out there (e.g,, search engines) use these algorithms or derivatives of these: What would be the implications for content writers trying to understand algos based on the Verbosity and Scope Hypotheses? Hello, copywriters, SEOs, etc. This puppy is nice to watch.