When indexing vast volumes of data, it's almost certain that you'll come across duplicated data. However, it's important to note that duplicate data doesn't necessarily mean that two documents are identical; rather, they often share the same relevance or purpose within your business context.
You can certainly dedupe them before indexing into Solr/Fusion. But it is not always easy since you would need to maintain the state of the criteria of each document somewhere. It is harder when you have tons of documents. For this, Solr/Fusion provides a handy way to help you dedupe when indexing.
Goal:
This article uses a simple approach to illustrate how deduplication can be configured, and analyzes how it works.
Consider the following example scenario: after the crawling process, multiple documents were found to be indexed for the same page with different ID URL values.
Example 1:
"id": "https://solr.apache.org/guide/solr/latest/query-guide/faceting",
"subject_t": [ "faceting" ],
"title_s": [ "Faceting :: Apache Solr Reference Guide "],
"id": "https://solr.apache.org/guide/solr/latest/query-guide/faceting/!ut/p/z0/04_Sj9CPykssy0xPLMnMz0vMAfIjo8ziDT1NDTwsnA0MPPwCXAzM_D0szI3MnYxNLIz0g1Pz9AuyHRUBXnyf3w!!/?st=&uri=nm%3Aoid%3AZ6_1I50H8C00HNPD06OH8727B3482",
"subject_t": [ "faceting" ],
"title_s": [ "Faceting :: Apache Solr Reference Guide "],
In the scenario described above, the optimal outcome would be to have just a single document in the index, as they are identical.
Environment:
Fusion Example: Any Fusion 4.x or 5.x environment
Solr Example: Any Solr 7.x, 8.x, or 9.x environment.
Guide:
Dedupe works by maintaining a signature for each document and designating the first document it encounters with a particular signature. It ensure that exactly one document appears in index for each signature.
The combination of the subject_t and title_s fields can be assumed as the uniqueness indicator for a given documents and we can use these fields values in deduplication to generate unique signature value for the document.
In Solr:
There are two places in Solr to configure de-duplication: in solrconfig.xml
and in the schema
file.
1) In solrconfig.xml, add/enable the updateRequestProcessorChain.
<updateRequestProcessorChain default="true">
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<str name="signatureField">dedupeSignature_s</str>
<str name="fields">title_s,subject_t</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
2) Defined a field in your schema to hold the fingerprint/signature. Example: dedupeSignature_s
This request processor SignatureUpdateProcessorFactory will calculate a signature field with combination of title_s & subject_t fields, and put in a new field called dedupeSignature_s.
Lookup3Signature is the class that defines the algorithm to generate the signature hash. You could use others such as MD5.
Now you can index some data and you will see the generated signature field dedupeSignature_s with a value.
For Example 1:
"id": "https://solr.apache.org/guide/solr/latest/query-guide/faceting",
"subject_t": [ "faceting" ],
"title_s": [ "Faceting :: Apache Solr Reference Guide "],
"dedupeSignature_s": ["286004b0d7fd7de4"],
Note: the update.chain=dedupe will enable the chain processor by it’s name, dedupe. Without this, the processors won’t run. You could make the dedupe process as defaults.
Now If a new document arrives with the same values for 'subject_t' and 'title_s' but a different 'id' value, it will overwrite the previous one because the 'dedupeSignature_s' field is identical. In this scenario, the 'dedupeSignature_s' will retain the same value as the previous document, while the other non-signature fields will be updated.
Please refer to this document for additional details on parameters and deduplication configuration in Solr : https://solr.apache.org/guide/solr/latest/indexing-guide/de-duplication.html
In Fusion:
Below the steps and attached screenshot to enable dedupe in fusion.
- Select the collection and datasource in which you want to implement the dedupe.
- In the datasource, select the Advance icon ( Select_Advance_option.png ) and you will be able to see two more options as you scroll down on the same page that is dedupe and field mapping ( To change default mapping field
dedupeSignature_s
to different field)
- Select the dedupe option and then add the field (title_t) as the dedupe field parameter and select the checkboxes as shown in the screenshot ( Configure_Dedupe.png ).
- Now save these changes and then clean the datasource and rerun the indexing job.
Fusion can be configured to deduplicate documents based on:
-
The entire contents of the document
-
The contents of a specified field ( As shown in above screenshot )
-
Custom deduplication based on a document signature generated by a user-supplied JavaScript function genSignature() which returns a string.
Here is an example of agenSignature()
function: This example finds duplicates based on the h2 fields in each document.function genSignature(content) { var signature = ""; if (content.hasField("h2")) { var values = content.getStrings("h2").toArray(); values.sort(); for each (var value in values) { signature += value; } } return signature.length > 0 ? signature : null; }
Please refer to below document for additional details on parameters and deduplication configuration in Fusion:
- https://doc.lucidworks.com/fusion-connectors/5.5/57/lucid-anda-connector-framework?q=Dedupe#Lucid.andaConnectorFramework-DedupeConfigurationProperties
- https://doc.lucidworks.com/how-to/754/deduplicate-web-content-using-canonical-tags?q=dedupe
- https://doc.lucidworks.com/how-to/913/extract-content-from-web-pages?q=Dedupe
Comments
0 comments
Article is closed for comments.