Goal:
The only items that I intend to crawl are those that are modified in the sitemap. What steps can I take to achieve that?
Environment:
Fusion 4 (Web V1 connector)
Guide:
Please follow the following steps to achieve your desired result:
-
Create a Web DS to crawl the sitemap urls.
-
Set the same sitemap URL in the startLinks and in the Sitemap URLs section. Each item, in the sitemap has the below format:
<url>
<loc>http://localhost:8000/test/page1.html</loc>
<lastmod>2024-01-13T12:15:00+05:30</lastmod>
</url> -
Set the property Recrawl all items as false.
-
Set the property Process Sitemap URLs as true (Under Recrawl rules group - Advanced properties must be enabled)
-
Save the datasource
-
Clear the datasource
-
Start to crawl
-
Make changes into the web. Add couple of new sitemap items where the lastmod time of those new items are greater than the last successful crawl completion time. For example, if the last crawl was successfully completed on Jan 16, 2024 at 10:00 AM, then add some items where the lastmod value of those items are Jan 16, 2024 at 10:01 AM. Then trigger a crawl, say at Jan 16, 2024 at 10:05 AM. After that crawl, the crawler should incrementally crawl the new items and index them.
-
Start to crawl the datasource again, now only modified pages should be processed.
Comments
0 comments
Article is closed for comments.