In the Row-Pruning Algorithm Tutorial post I introduced an algorithm for identifying terms sequences hidden in collections. The tutorial is available in Mi Islita.com site. Several applications were also presented.
In a nutshell, given a set of terms T={t1, t2..tm} extracted from a document collection, C={d1, d2..dn} the goal is to combine each of these terms with all other terms of T and then find the family of term sequences more frequently present in C. Such families are of the form
“k1”
“k1 + k2”
“K1 + k2 +..km”
The double quotes indicate that these are term sequences.
The algorithm then is aimed at establishing the identity of the k’s using T and C. Once k1 is identified, the algorithm reduces to conducting a binary partition on the corresponding largest answer sets. Normally T consists of few terms (m < threshold value) and as such is a subset of the index of terms. Whenever possible these should be on-topic terms and preselected by a suitable algorithm or heuristic criterion.
Confidence-based pruning is used, but one could as well use other measures like co-occurrence, support, or similar measures. One does not need to restrict the algorithm to just terms.
I see many applications to this, including visitations:
“Users clicking on X and then on Y tend to click next on Z“
“Customers visiting store X and then store Y tend to visit next store Z.”
and so forth…