Goal
In some cases, it may be desirable to modify the IDs of documents that you are indexing before they are ingested. This can sometimes cause unexpected issues, though, as some datasource types - such as the Web Crawler - will by default use the URL for a crawled page as its ID. At the same time, the corresponding entry in the crawlDB will store the original, unmodified, URL. Consequently, when later incremental crawling is done there is a mismatch between the two entries that can cause issues with updates and deletions.
Environment
Fusion (all versions)
Guide
Take the following example: A web page (https://doc.lucidworks.com/) is crawled and an index-pipeline stage overwrites the ID field on the document from the URL to an ID from an external database - "a1f092d0-376a-440a-9445-48e3dbbfcbd0". A document like the following is created:
{
id: "a1f092d0-376a-440a-9445-48e3dbbfcbd0",
url: "https://doc.lucidworks.com",
body_t: "This Body",
...
}
Following this, an entry for "https://doc.lucidworks.com" is added to the crawlDB. The web page is later un-published, resulting in a 404 on a subsequent crawl. The PipelineDocument object for the "https://doc.lucidworks.com" is then sent through the index-pipeline with a DELETE command attached, resulting in a delete-by-query request to Solr using the URL as the ID from the Solr Index stage. But, because no document with that ID exists, the document remains in the index. Finally, the entry in the crawlDB for that URL is removed. As a result, this document will never be removed or updated in future crawls.
This can be solved by adding an extra stage to the pipeline. For purposes of the example, say we were appending an ID to the URL in our index-pipeline stage that is retrieved from an external database. We could then add a Javascript stage that checks for future documents with commands attached that ensures that the right ID is used for the delete-by-query request to Solr like below:
function (doc, ctx) {
var commands = doc.getCommands();
/*
SOME CALL TO AN EXTERNAL DATABASE SETTING AN ID
TO THE VARIABLE LIKE
var my_id = "a1f092d0-376a-440a-9445-48e3dbbfcbd0"
*/
if (commands.length > 0){
var commands_list = doc.getCommands();
var this_command = commands_list[0];
if (this_command.getName() == 'delete_query'){
var new_doc = doc.removeCommands();
new_doc.addCommand(new com.lucidworks.apollo.common.pipeline.Command("delete_query", java.util.Collections.singletonMap("query", "{!field f='id' v='https\:\/\/doc.lucidworks.com\/" + my_id + "'} OR _query_:{!prefix f='id' v='https\:\/\/doc.lucidworks.com\/" + my_id + "#'}")));
return new_doc;
}else{
return doc;
}
}
return doc;
}
Comments
0 comments
Please sign in to leave a comment.