Document Linearization (DL) is aimed at creating a pseudo document from which words are extracted; e.g., for vocabulary construction/indexing.
DL involves removal of markup tags, punctuations, stopwords, upper cases, and some times stemming (reduction of words to common roots) and code/decoration filtering (removal of code & style lines). The end result can be used as part of analyses aimed at exploiting semantics and user intent.
Frequently, a pseudo document is represented as a stream of tokens or lowercased terms without any punctuation in it.
In addition to information retrieval specialists, end users can benefit from DL. For instance, collection curators can use DL to identify keywords representative of a refined collection. Search and digital marketers can use DL to find relevant keywords to be placed in ads and optimized content.
We added three tools to all of our Minerazzi miners (productivity search engines) to help users do DL on the fly for a single search result: The Plain Text Extractor, The Tokens Extractor, and the Words Extractor. These are found in the “Crawlers & Extractors” section displayed under each search result of a miner.
To access them, just do a search in any miner at http://www.minerazzi.com and, under a search result, click file icons labeled “text”, “tokens”, or “words”. For instance, try with our most recent RSS/Atom Feeds miner at
Support for PDF files & graphical analysis will be available in the near future.