Replacing Unwanted Characters in a Document's ID

In some cases, you'll want to remove special -- or unwanted -- characters from a field in your document.  This can easily be accomplished by adding a small, custom JavaScript stage to your indexing pipeline.  The code for accomplishing this is fairly straightforward:

 

function (doc) {
                if (doc.getId() != null) { var newID = doc.getId(); 
                     if (newID.match('^"')) {   
                         newID = newUrl.replace('"', "");
                         doc.setId(newID); 
                     } 
                 } else { 
                       doc.setId(newID); 
                 } 
               return doc; 
}

What's going on here is pretty simple.  We're taking the incoming PipelineDocument (doc), pulling out the id (doc.getId), removing the unwanted characters, in this case, double quotes, from the id, and then re-placing it in the PipelineDocument.

To implement the above code, you'll want to:

     1) Select the indexing pipeline from your datasource. 

     2) Add a JavaScript stage to your pipeline, right after the Apache Tika Parser. 

     3) Add the code to your JavaScript stage, and modify to suit your needs. 

That's all there is to it.  Happy indexing! 

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk