On Words and Strings
In String Frequency Distributions, Mark Liberman blogs about the flaws involved when co-occurrence studies are reported without defining what is a “word” in the first place.
I agree 100% with him. Using a Google data set without defining what a “word” is can be misleading. If no data clean-up is done first, the best that we can do is to call those studies “string frequency co-occurrence”.
A string is a linear sequence of symbols (characters, words, phrases, etc.). If we limit the term “string” to mean a string token, then a string is a sequence of characters. I prefer the following definition: A string (meaning a string token), is a non-space character or a sequence of these. Thus, a string no need to be a dictionary word.
On Co-Occurrence Studies
Back in the early ’00s I conducted research on co-occurrence. In 2004, I introduced SEOs to co-occurence theory concepts when the Search Engine Watch Forums was created back then by Danny Sullivan. I described what is/is not co-occurrence. I also introduced SEOs to two key concepts: the C-Index and the EF-Ratio, which were well explained at several SEO sites and SEO conferences through basic Venn Diagram Theory.
To make the story short, the first one of the above ratios is simply the AND/OR ratio and the latter the EXACT/AND ratio. Both are computable from search results, but as signals contaminated with some noise since search engines can expand the query and answer sets in the background. Despite of being noisy signals, calculating these ratios provides some clue as whether two or more keywords exhibit some form of relatedness in a database or corpus. Co-Occurrence is also applicable to HTML elements like links, anchor texts, titles, etc…
These days, a new generation of easy to impress SEOs is rediscovering co-occurrence as something “new” around the block. Please…
More than 10 years later, I’m completing research on new ratios and on their complementary relationships with the above two. This includes a search engine that calculates these ratios on the fly in its search result pages. More on this will soon follow.