• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: December 2009

Fractal CSS Design

28 Monday Dec 2009

Posted by egarcia in Programming

≈ 4 Comments

Now that 2009 is almost over, I am please to announce the publication of Fractal CSS Design, a paper that introduces a technique for generating CSS layout patterns based on fractal theory. Applications to the design of Web pages are presented. The technique introduced allows one to create tableless layouts.

While the concept of using CSS rules for emulating HTML table layouts is not new and is well documented, the main difference between our approach with other solutions is that ours is purely algorithmic and based on fractal theory; i.e., it is result of defining a small set of CSS rules for individual motifs, iterating these, and using the resultant fractal layouts as building blocks for larger layouts, which frequently turn out to be multifractals.

I will elaborate more on this in the IR Watch Newsletter.

Ho, Ho, Ho: My Tips as Gifts

22 Tuesday Dec 2009

Posted by egarcia in Marketing Research, Miscellaneous

≈ Leave a Comment

Ho, Ho, Ho: My Tips as Gifts. Santa’s here.

Here are my Holiday’s Gifts to you in the form of useful tips.

Headers

Some search engines weight texts in strong, header, and anchor tags. So instead of this:

<h1>keyword 1</h1>

try this:

<h1><strong>keyword 1</strong></h1>

or this:

<h1><a …><strong>keyword 1</strong</a></h1>

CSS-style them to your heart needs. Try with several keywords and headers (h1, h2, h3, etc).

Element Size Rendering

On most browsers, medium size text is 16 pixels which is rendered as 1em when body is 100%.

Thus, for any body font-size %, X px*(1em/16px)*(body %) = Y em.
When body font-size is set to 100%, converting px into ems reduces to dividing by 16. If no nested em units are used, then

20px*(1em/16px) = 1.25em, rendered as 20px
19px*(1em/16px) = 1.1875em, rendered as 19px
18px*(1em/16px) = 1.125em, rendered as 18px
17px*(1em/16px) = 1.0556em, rendered as 17px
16px*(1em/16px) = 1em, rendered as 16px
15px*(1em/16px) = 0.9375em, rendered as 15px
14px*(1em/16px) = 0.875em, rendered as 14px
13px*(1em/16px) = 0.8125em, endered as 13px
12px*(1em/16px) = 0.75em, rendered as 12px
11px*(1em/16px) = 0.6875em, rendered as 11px
8px*(1em/16px) = 0.5em, rendered as 8px
1px*(1em/16px) = 0.0625em, rendered as 1px

At this body %, the mininum size is 1px = 0.0625em. To render 1px to smaller  ems, lower body font-size %;  e.g.,  if it  is 80%

1px*(1em/16px)*0.80 = 0.05em

Nesting elements with ems can also do the trick (*Please, see note).

BTW. Watch out for nested ems as you would need to account for these; For instance,  for body 100% all these are rendered same size:

<pre><p style=”font-size:1em;”>edel</p>
<p style=”font-size:16px;”>edel</p>
<p style=”font-size:2em;”><span style=”font-size:0.5em;”>edel</span></p></pre>

Under the Counter Prescription Medication

Extra Strength Tylenol PM contains  500 mg acetaminophen (pain reliever) plus 25 mg diphenhydramine HCl (antihistamine, nighttime sleep aid).

Regular Tylenol contains 325 mg acetaminophen

Regular Benadryl Allergy contains 25 mg diphenhydramine HCl

Which one would you take and when?

PS. I modified/corrected  some lines.

* Conversely, to retain 1px exactly as 1px at the 80% level, use 1px*(1em/16 px)/0.80 = 0.07812em.

* Some authors like to set body to 62.5% and then simply divide by 10 all pixels to get ems which is easier to remember. Your choice. I prefer the 100% mark across all  browsers.

*Pre tag above are to insure rendering in the post and not needed in the actual HTML code. I though this was obvious.

*You can also try

<h1><strong><a …>keyword</a></strong></h1>.

The above nesting techniques can also  be tried with in-page navigation (jump links) and even with img tags. If you care about W3C conformance, validate your code before putting to use  any nesting technique.

TREC Call For Participation

21 Monday Dec 2009

Posted by egarcia in Conferences

≈ Leave a Comment

Dr. Ellen Voorhees from TREC at NIST.gov sent me their most recent Call For Participation. To help with its dissemination, I am posting it in its entirety, so it will reach out through  IRThoughts and Mi Islita.com  wider audience.

              CALL FOR PARTICIPATION

                    TEXT RETRIEVAL CONFERENCE (TREC)

                      February 2010 – November 2010

                          Conducted by:
      National Institute of Standards and Technology (NIST)

The Text Retrieval Conference (TREC) workshop series encourages
research in information retrieval and related applications by
providing a large test collection, uniform scoring procedures,
and a forum for organizations interested in comparing their
results.  Now in its nineteenth year, the conference has become
the major experimental effort in the field.  Participants in
the previous TREC conferences have examined a wide variety
of retrieval techniques and retrieval environments,
including cross-language retrieval, retrieval of web documents,
multimedia retrieval, and question answering.  Details about TREC
can be found at the TREC web site,  http://trec.nist.gov.

You are invited to participate in TREC 2010.  TREC 2010 will
consist of a set of tasks known as “tracks”.  Each track focuses
on a particular subproblem or variant of the retrieval task as
described below.  Organizations may choose to participate in any or
all of the tracks.  Training and test materials are available from
NIST for some tracks; other tracks will use special collections that
are available from other organizations for a fee.

Dissemination of TREC work and results other than in the (publicly
available) conference proceedings is welcomed, but the conditions of
participation specifically preclude any advertising claims based
on TREC results.  All retrieval results submitted to NIST are
published in the Proceedings and are archived on the TREC web site.
The workshop in November is open only to participating groups that
submit retrieval results for at least one track and to selected
government personnel from sponsoring agencies.

Schedule:
——–

  By February 18 — submit your application to participate in
        TREC 2010 as described below.  Submitting an application
        will add you to the active participants’ mailing list.
        On Feb 24, NIST will announce a new password for the “active
        participants” portion of the TREC web site.  Included
        in this portion of the web site is information regarding
        the permission forms needed to obtain the TREC document
        disks.

   Beginning March 2 — document disks used in some existing
        TREC collections distributed to participants who have
        returned the required forms.  Please note that no disks
        will be shipped before March 2.

   July–August  — results submission deadline for most tracks
        Specific deadlines for each track will be included in
        the track guidelines, which will be finalized in the spring.

   September 9  (estimated) — speaker proposals due at NIST.

   September 30 (estimated) — relevance judgments and individual
        evaluation scores due back to participants.

   Nov 16-19 — TREC 2010 conference at NIST in Gaithersburg, Md. USA

Task Description:
—————-

Below is a brief summary of the tasks.  Complete descriptions of
tasks performed in previous years are included in the Overview
papers in each of the TREC proceedings (in the Publications section
of the web site).

The exact definition of the tasks to be performed in each track for
TREC 2010 is still being formulated.  Track discussion takes place
on the track mailing list.  To be added to a track mailing list,
follow the instructions for contacting that mailing list as
given below.  For questions about the track, send mail to the
track coordinator (or post the question to the track mailing list
once you join).

TREC 2010 will contain seven tracks.  The blog, chemical IR,
entity, legal, relevance feedback, and web tracks will
continue from TREC 2009.  The million query track will be
incorporated into the web track.  TREC 2010 will also contain
a new “session” track.

Blog Track — The purpose of the blog track is to explore information
    seeking behavior in the blogosphere.

    Track coordinators: Craig Macdonald, Iadh Ounis, Ian Soboroff
                        trecblog-organisers@dcs.gla.ac.uk
    Mailing list:  send a mail message to listproc@nist.gov
        such that the body consists of the line
        subscribe trec-blog <FirstName> <LastName>

Chemical IR Track — The goal of the chemical IR track is to develop
    and evaluate technology for large scale search in chemical
    documents including academic papers and patents to better
    meet the needs of professional searchers: specifically patent
    searchers and chemists.

    Track co-ordinators: John Tait, john.tait@ir-facility.org
                         Jimmy Huang, jhuang@yorku.ca
                         Jianhan Zhu, j.zhu@adastral.ucl.ac.uk
                         Mhai Lupu, m.lupu@ir-facility.org

    Track Web Page: http://www.ir-facility.org/the_irf/trec_chem.htm
    Mailing List: follow the link on the web page to join the list

Entity Track — The overall aim of this track is to perform
    entity-related search on Web data.  These search tasks
    (such as finding entities and properties of entities) address
    common information needs that are not that well modeled as
    ad hoc document search.

    Track coordinators: Krisztian Balog, k.balog@uva.nl
                        Paul Thomas, Paul.Thomas@csiro.au
                        Arjen P. de Vries, arjen@acm.org
                        Thijs Westerveld, thijs.westerveld@teezir.nl
    Track web page: http://ilps.science.uva.nl/trec-entity/
    Mailing list: visit http://groups.google.com/group/trec-entity
        to apply for membership.

Legal Track — The goal of the legal track is to develop search technology
    that meets the needs of lawyers to engage in effective discovery
    in digital document collections.

    Track coordinators: Gord Cormack, gvcormac@uwaterloo.ca
                        Maura Grossman, MRGrossman@wlrk.com
                        Bruce Hedin, bhedin@h5.com
                        Doug Oard, oard@umd.edu
    Track web page: http://trec-legal.umiacs.umd.edu
    Mailing list: contact oard@umd.edu to be added to the list.

Relevance Feedback Track — The goal of the relevance feedback track
   is to provide a framework for exploring the effects of different
   factors on the success of relevance feedback.

   Track coordinators: Chris Buckley, cabuckley@sabir.com
                       Matt Lease, ml@ischool.utexas.edu
                       Mark Smucker, msmucker@engmail.uwaterloo.ca
   Track web page:  http://groups.google.com/group/trec-relfeed
   Mailing list: follow the instructions given on the track web page
                 to join the email list

Session Track –  The Session track has two primary goals: (1) to test
        whether systems can improve their performance for a given query
        by using a previous query (and search results from the search
        session), and (2) to evaluate system performance over an entire
        query session instead of a single query.

    Track coordinators: Ben Carterette, carteret@cis.udel.edu
                        Paul Clough, p.d.clough@sheffield.ac.uk
                        Evangelos Kanoulas, ekanou@ccs.neu.edu
                        Mark Sanderson, m.sanderson@sheffield.ac.uk

    Track web page: http://ir.cis.udel.edu/sessions
    Mailing list: Use the link given on the track web page to
                  join the email list

Web Track –  The Web track explores Web-specific retrieval tasks,
    including diversity and efficiency tasks, over collections of
    up to 1 billion Web pages.
    Track coordinators: Nick Craswell, nickcr@microsoft.com
                        Charles Clarke, claclark@plg.uwaterloo.ca
    Mailing list:  send a mail message to listproc@nist.gov
        such that the body consists of the line
        subscribe trec-web <FirstName> <LastName>

Conference Format
—————–

The conference itself will be used as a forum both for presentation
of results (including failure analyses and system comparisons),
and for more lengthy system presentations describing retrieval
techniques used, experiments run using the data, and other issues
of interest to researchers in information retrieval.  As there
is a limited amount of time for these presentations, the TREC
program committee will determine which groups are asked to speak
and which groups will present in a poster session.  Groups that
are interested in having a speaking slot during the workshop
should submit a 200-300 word abstract in September describing
the experiments they performed.  The program committee will use
these abstracts to select speakers.

Data
—-
Many of the existing TREC English collections (documents, topics,
and relevance judgments) are available for training purposes and
may also be used in some of the tracks.  Parts of the training
collection (Disks 1-3) were assembled from Linguistic Data
Consortium (LDC) text, and a signed User Agreement will be required
from all participants.  The documents are an assorted collection
of newspapers, newswire, journals, and technical abstracts.
A second agreement is needed for disks (4-5).

All documents are typical of those seen in a real-world situation
(i.e. there will not be arcane vocabulary, but there may be
missing pieces of text or typographical errors).  For most tracks,
the relevance judgments against which each system’s output will be
scored will be made by experienced relevance assessors based on the
output of all TREC participants using a pooled relevance methodology.
See the Overview paper in the TREC-8 proceedings (on the TREC
web site) for a detailed discussion of pooling.
  
  
Application details:
——————–
  
Organizations wishing to participate in TREC 2010 should respond
to this call for participation by submitting an application.
Participants in previous TRECs who wish to participate
in TREC 2010 must submit a new application.

To apply, follow the instructions at
     http://ir.nist.gov/trecsubmit.open/application.html
to submit an online application.  The application system
will send an acknowledgement to the email address
supplied in the form once it has processed the form.

Any questions about conference participation should be sent
to the general TREC email address, trec@nist.gov .

Hackarandoso 1

16 Wednesday Dec 2009

Posted by egarcia in Hacking

≈ Leave a Comment

30 millon accounts compromised from Facebook, MySpace, and Orkut. Are you one of the victims?

TJX Hacker, turned into a snitcher Snitcher = sapo, chota, traitor.

Adobe PDF attachments as hacking vehicles. A do-be yourself.

Computing Email Time Stamp Differences

15 Tuesday Dec 2009

Posted by egarcia in Newsletters

≈ Leave a Comment

The current issue of IR Watch features this question. In a nutshell:

1. View the source code of an email and identify the time stamps from any two mail server records.  Time stamps are often given in the following format:

hr:min:sec +/- GMT

where hr is hour, min are minutes, and sec are seconds. +/- GMT is the Greenwich Mean Time difference which can be positive (+), negative (-), or neutral (0).

2. To compute the time difference between the two records, do as follows:

Step 1: Convert time stamps to GMT times.

Step 2: Convert the new times into seconds.

Step 3. Take differences.

You should be able to convert this difference back to  a hr:min:sec format

This can  give you an idea of the estimated email delivery time between the two records; that is, if you trust email headers information.

IRW:2009-12: DNS Intelligence Retrieval

11 Friday Dec 2009

Posted by egarcia in Newsletters

≈ Leave a Comment

The current issue of the IR Watch Newsletter should arrive to subscribers’s inbox today or tomorrow at the latest.

Featuring Article:

Every Internet domain must be mapped to Name Servers (NS) containing specific DNS configuration files. In this issue of the newsletter we explain how to retrieve the content of these files. The information gathered can then be used for intelligence purposes.

Enjoy it.

Google DNS Services: The Google Internet?

03 Thursday Dec 2009

Posted by egarcia in Internet Engineering

≈ Leave a Comment

DNS is one of  the nerves of the Internet as it is used to resolve IP addresses to numbers. This is a critical service provided by ISPs and domain name registrar services.

Apparently Google is taking a step closer to become The Internet by providing free, open DNS services to any web site in the world. This news is making nervous almost all ISP providers, domain registrar companies, and large organizations like Microsoft.

http://www.pcmag.com/article2/0,2817,2356618,00.asp

They claim they are doing it to speed up the Web and to improve security (DOS, DNS Cache, DNS request attacks, topics covered in the Internet Engineering 1 course).

http://code.google.com/speed/public-dns/docs/intro.html

http://code.google.com/speed/public-dns/docs/security.html

Their DNS service  might evolve into a global connectivity ecosystem.

That’s scary.

A Simple Search Strategy to beat them all

01 Tuesday Dec 2009

Posted by egarcia in Data Mining, Machine Learning, Programming, Queries, Vector Space Models

≈ 6 Comments

Now that I’m out of school, I am doing what I love the most: programming and testing IR systems.

I’m currently testing a ranking algorithm for an IR system built over the last years. The answer set is based on a simple matching (SM) search strategy.

Mistake not a simple matching strategy  for a simple or basic search approach as it can evolve into  the most complex one.

Unlike classic boolean searches (i.e., AND, OR, XOR),  SM is suitable for constructing answer sets and subsets based on coordination levels. Add a supporting scoring function (tf-IDF derivatives, RSJ-PM, BM25, etc) and… TA DA: a customizable clustering algorithm for retrieving and ranking search results.

Proper fine tuning allows presenting end-users with answer sets wherein AND results are accumulated at the top of the search results. As users move down the search results, they are presented with OR results and the search experience is perceived as if the system expands the answer set by switching query modes.

I’ve also added a query reduction mechanism for discoverying related searches. Brazilian Wax, nice!

In preliminary tests, results compare favorably with answer sets from search engines that claim to do search expansion/reduction, query mode switching, or clustering.

Next step is to check if with a large corpus and a thesaurus, results compare favorably with results from search engines that claim to use semantics.

So far, my one is cost effective and does not require of extra libraries.

PS: I forget to mention that my ranking algorithm is not based on computing vectors or cosine similarities, so any overhead from a Vector Space Model is avoided. That’s the icing on the cake!

December 2009
M T W T F S S
« Nov   Jan »
 123456
78910111213
14151617181920
21222324252627
28293031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.