'Dedupe' in Fusion

Lucidworks Fusion provides a very useful feature "Dedupe" which restricts Fusion to indexing similar content (based on specified criteria) to Solr.

This purpose of this article centers on configuring the Dedupe feature on Fusion until 2.4.4, how it works internally, useful de-dupe scripts and the challenges which comes with it.

Configuration:

An elaborate explanation on configuring De-dupe can be found on official Fusion documentation link: Dedupe in Connectors

Working:

First and foremost, no extra resources other than the obvious are required for the Dedupe feature in Fusion. As explained on the documentation link, Dedupe feature index "dedupeSignature_s" field on Solr depending on what you have specified on dedupe field-name or logic you have implemented on dedupe script.

Suppose we have specified the dedupe field-name to be <body> while crawling a website (Web Datasource). Fusion will index the document with a "dedupeSignature_s" field, utilizing the content of <body> tag in the HTML page as a given value. If the pipeline encounters an HTML page which has the same content signature in the <body> tag, Fusion will simply ignore the document and will not be indexed.

If you have not enabled the Dedupe feature on Fusion, unlike above, a new document will be indexed with different doc "id", same "body" content as the one we have already in the index.

Useful scripts:

As mentioned on the documentation link, you can specify either the dedupe field or dedupe script, whichever seems fit for the use-case.

If the content to be deduped is straightforward, for example any element tag of HTML Tag or column of a CSV document, you can specify the element tag name or the column name of the CSV document in the Dedupe field.

For more complex scenarios, leave the dedupe field blank and use dedupe script where you must define a 'genSignature(content){}' function. The function must return a string.

Following sample scripts are based on web-datasource:

1. Dedupe documents on <body> element tag of HTML page (case insensitive):

function genSignature(content) {
var signature = "";
if (content !== null && content.hasField("body")) { //check for null, specify the field name
var values = content.getStrings("body").toArray();
values.sort();
for each(var value in values) {
signature += value.toLowerCase(); // case insensitive}}
return signature.length > 0 ? signature : null;
}

2. Dedupe documents on more than one element tag of HTML page (case insensitive):

function genSignature(content) {
var signature = "";
if (content !== null  && content.hasField("body")) { // specify field one
var values = content.getStrings("body").toArray();
values.sort();
for each(var value in values) {
signature += value.toLowerCase();}}
if (content.hasField("title")) { // specify field two
var values = content.getStrings("title").toArray();
values.sort();
for each(var value in values) {
signature += value.toLowerCase();}}
// ----- and so on, add the field names you want -----
return signature.length > 0 ? signature : null;
}

As you can see on the script, the final 'signature' variable is the concatenation of content of all the tag/field names specified.

3. Dedupe documents on an 'attribute' value of HTML page:

Sample HTML source code:

</head><body bgcolor="#1E2F62" link="#778ED6" vlink="#8295BA" alink="#8295BA"><div align="center"><div align="center"><table width="802" bgcolor="#000000" background="images/bg-black.gif" style="border: 1px solid #778ED6" id="table4"><tr><td><div align="center"> --------

We are deduping on "link" attribute of "body" field:

function genSignature(content) {
var signature = "";
if (content !== null  && content.hasDocument()) {
var e = content.getDocument().select("body");
if (null !== e && !e.isEmpty()) {
signature = e.first().attributes().get("link");}}
return signature;
}

4. Dedupe document on a nested element tag of HTML page:

Sample HTML source code:

<html><body><div>
<span class="abc" date="11-07-2016">Span1</span>
<span class="def" date="11-08-2016">Span2</span>
</div></body></html>

We are deduping on 'span' field of class 'abc':

function genLink(content) {
var signature = "";
if (content !== null  && content.hasDocument()) {
var e = content.getDocument().select("body > div > span[class=abc]");
if (null !== e && !e.isEmpty()) {
e.sort();
for each(var value in values) {
signature += value.toLowerCase();}}}
}

More relevant scripts will be added in future.

Challenges:

Assume we are using web-datasource to index a HTML page and deduping on tag <body>:

<html><title>title1</title><body>content</body></html>

First document to get indexed is:

"title":"title1"
"body":"content"

Now we modify the same HTML page (the actual web page), <title> being "title2", but <body> being "content", the same as before.

<html><title>title2</title><body>content</body></html>

On recrawling the same webpage, Fusion will discard the changes and same old content will be present in Solr.

"title":"title1"
"body":"content"

Reason we were deduping on <body> HTML tag.

No matter how many times we recrawl, enable "Force recrawl" on Fusion web datasource, unless and until the content of <body> change, it will not update the document in the Solr.

Henceforth, considering the functionality and working of Dedupe, along with challenges, one should incorporate Dedupe in their Fusion Datasource configuration.

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk