Remove "boilerplate" (templated HTML) from web page while indexing

Most web sites have pages with a lot of repetitive formatting, sidebars etc. When you index them, it would be nice to strip this out and retain only the "real" text in the page. We do not have a general-purpose solution for this in LucidWorks. We can recommend two approaches:

  1. Crawl the mobile version of the site, if possible. 
  2. Use the Boilerpipe library to preprocess the page.

To get the mobile version of the site, you will need a user-agent that triggers the mobile content. In conf/lwe-core/defaults.yml, you will find these lines:

http.agent.browser: Mozilla/5.0
http.agent.email: crawler at example dot com
http.agent.name: LucidWorks
http.agent.url: ''
http.agent.version: ''

Set these to the http agent you need. The Fennec browser from Mozilla lets you experiment with user-agents.

The Boilerpipe library is available from Google Code. It is integrated into Tika, but Solr does not give access to this. You can write your own program with Tika to pre-process the pages, then upload. If you use 'Original Content' to store the actual binary of the page (base-64 encoded for web pages) you can download the document, process it with Tika/Boilerpipe, and re-upload it.

http://code.google.com/p/boilerpipe/

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk