Web Crawl Datasource best practices

Testing was done in regards to trying to determine how many Web Crawler data sources should be run simultaneously (i.e. in parallel).  This led to a best practices rule of thumb.

Testing done on Fusion 1.2.4, against Wikipedia.org, utilizing a virtual linux server assigned 4 CPU threads.  Also, this rule of thumb assumes that the primary bottleneck is CPU and Connector related (i.e. if you're getting OOMs in some other process, fix that issue first).

Rule of Thumb:

The number of running crawler fetch threads should be no greater than the number of CPU threads on the OS.

Attempting to go beyond that leads to:

  • No significant increase in throughput.  The CPU utilization bottleneck will just cause less work done per thread, leading to no significant increase in the number of documents crawled.
  • TimeoutException and NoHttpResponseException errors, which seem to be related to the crawler (not the target web server), will appear and increase in frequency as the number of threads is increased beyond that 'limit'.
  • UI responsiveness will decrease as the number of threads is increased beyond that 'limit'.
  • It's worth noting that those last 2 negative side effects don't seem to occur if only 1 datasource is being run at a time.  And that those negative side effects increase as the number of datasources (i.e. threads) increases.  So while running 1 DS with 15 threads may not show any negative side effects, running 3 DS with 5 threads each (still 15 threads total) will have some fetch failures, and running 5 DS with 3 threads each (again, 15 threads total), will have even more fetch failures.

It's also worth noting that those negative side effects may not be of concern to some users and some use cases.  For example, if the crawler is controlled via API rather than UI, lack of UI responsiveness is no big deal.  Or if incremental crawls are configured, it's very, very unlikely that the fetch errors will occur on the exact same document twice, which means subsequent crawls are likely to erase the impact of the fetch failures that occurred on the first crawl, which may be an acceptable situation.

Have more questions? Submit a request

2 Comments

  • 0
    Avatar
    Garth Grimm

    When configuring the datasource, you can use the Advanced options, and look for the Fetch section. There are fields there where you can control the number of fetchers used.

  • 0
    Avatar
    Matt Kuiper

    Version number listed above for Fusion is "Fusion 1.2.4," is this correct?

Please sign in to leave a comment.
Powered by Zendesk