Tags

, , , ,

A positional inverted index essentially is a set of posting lists storing term weights, term positions, docids, etc from a collection of documents.

Posting lists can also be generated from a single piece of text.

Said lists come handy when we want to conduct text forensics or analyze writing styles; for instance, to check if there is evidence of plagiarism, to imputate authorship, or to analyze how a writer distributes stopwords, rare words, or specific combinations of terms across paragraphs, chapters, etc…

However, counting words and term positions by hand can be time consuming, unless you have a tool that does it for you.

We have developed such a tool, precisely. It is available at

http://www.minerazzi.com/tools/posting/lists.php

The tool generates posting lists of the form:

term: {frequency value, [array of positions]}

where frequency values are taken for term weights and an array of positions is associated to a term.

At this time the tool analyzes plain text, only.

With minor modifications, it can be used to build a positional inverted index where Robertson’s BM25 weights are stored.

Advertisements