Introduction
The JDBC connector fetches documents from a relational database via SQL queries. Under the hood, this connector implements the Solr DataImportHandler (DIH) plugin. This document mainly focuses on Fusion 5.9.0 (however, Compatible with Fusion version: 5.4.0 and later)
Proper configuration of the JDBC datasource is crucial to avoid errors during its execution.
Key considerations for configuring the JDBC connector include:
- Using the appropriate JDBC driver.
- Ensuring correct column names.
- Applying the appropriate date format.
- Providing the correct JDBC request/connection string.
- Managing database timeouts.
- Allocating resources effectively.
Therefore, the configuration of the JDBC connector plays a significant role in its performance.
Additional documentation worth referring to for configuring and troubleshooting JDBC V2 connector datasource jobs are:
JDBC-Sql-V2-connector
JDBC-V2-connector-configuration-reference
Troubleshoot-a-jdbc-datasource
The current document focuses on common errors and troubleshooting scenarios that can cause a JDBC datasource job failure, complementing the existing comprehensive documentation covering potential problems encountered when using a JDBC datasource.
Scenario 1: JDBC V2 - Non Oracle crawl throws errors
"com.lucidworks.connector.plugins.jdbc.exception.JdbcRuntimeException:
Cannot resolve Oracle CLOB data."
Logs from the connector backend and JDBC connector plugin pods identified the occurrence of the following error:
ERROR [fetch-input-receiver.opentext_gallery-95:com.lucidworks.connector.plugins.jdbc.fetcher.processor.PageProcessor@76] -
Error while iterating results com.lucidworks.connector.plugins.jdbc.exception.JdbcRuntimeException:
Cannot resolve Oracle CLOB data.
Caused by: java.lang.ClassNotFoundException: oracle.jdbc.internal.OracleClob
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
Few similar issues captured :
Explanation:
The JDBC V2 connector (2.3.0) introduced support for Oracle CLOB objects, but it only functions with Oracle CLOB objects. However, the new connector version (2.4.0) extends support to non-Oracle databases, such as IBM CLOB objects.
(Reference: https://doc.lucidworks.com/fusion-connectors/5.8/4/release-notes)
Scenario 2: Timeout errors with some remote JDBC v2 datasources
The JDBC V2 connector successfully indexes millions of documents but encounters timeout errors towards completion. Fusion displays error messages like:errorMessage: "The following components failed: [class com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor :
Job terminated due to no plugin activity within 600 seconds]"
errorMessage: "The following components failed: [class com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor :
Job terminated due to no plugin activity within 5000 seconds]"
“cannot obtain status from the job backend in 60000 ms”.
- Connector
- Connector-backend
- Connector plugin
- Kafka pods
Kafka :
Specifically, look for occurrences of "java.net.UnknownHostException." If found, please restart each Kafka pod before initiating the subsequent crawl.
Verifying the resource limits :
Ensuring the required resource configurations for the above-mentioned pods is imperative. Start by checking the describe pod status of the involved pods.
If any errors like OOMKilled occur, begin by adjusting the resources. Take the following steps:
- Increase resource limits and requests.
-
Check the required CPU/memory and adjust accordingly.
fusion-indexing - 4 pods | connectors-backend - 2 pods | connectors - 2 pods | connector-plugin - 2 pods |
-Xms8g -Xmx8g | -Xms12g -Xmx12g | -Xms6g -Xmx6g | -Xms6g -Xmx6g |
resources: | resources: | resources: | resources: |
limits: | limits: | limits: | limits: |
cpu: 4 | cpu: 6 | cpu: 4 | cpu: 10 |
memory: 16G | memory: 24G | memory: 24G | memory: 15G |
requests: | requests: | requests: | requests: |
cpu: 4 | cpu: 4 | cpu: 2 | cpu: 6 |
memory:12G | memory:12G | memory:12G | memory:4G |
Verifying the Datasource configuration :
- Reducing the number of connection pool to 10, (Max connections)
- Reducing the batch size to 100,
- Increasing the indexing timeout to 300000s
- Core properties (Advanced) Fetch threads: 5
After making the changes, clear the Job state collection for the datasource before rerunning:
1. Go to Collections Manager → Collection name.
2. Find the job state collection for the datasource: <datasource_name>_job_state.
3. Click on Settings → Clear collection.
If the crawl still fails, clear the job_state collection and rerun the crawl without clearing the datasource.
Note that only the job_state collection should be cleared. This will clear the crawldb data of the run. Clearing the datasource is considered a fresh crawl.
Scenario 3: Plugin can not handle the request
INTERNAL: Plugin can not handle the request
Timeout on blocking read for 60000000000 NANOSECONDS}]
Verifying the Query :
A SQL SELECT statement to choose the records to be retrieved.
For paginated queries, use the special variables ${limit} and ${offset}
For detailed instructions, please refer to the jdbc-sql-V2-connector.
In the case of Oracle queries, the syntax is explained in the SQL Beginner's Guide.
When dealing with a large dataset, such as approximately 4 million records, it becomes imperative to divide the task into 3-4 smaller jobs. For instance, by breaking down the query into chunks of 100,000 records, the process works seamlessly when data is limited.
Please note that pagination will not function correctly if the query lacks the ${limit} and ${offset} variables, as described in the field definition:
A SQL SELECT statement to choose the records to be retrieved.
For paginated queries, use the special variables ${limit} and ${offset}.
The specific syntax will be driver dependent.
Examples:
Mysql - SELECT * FROM test_table LIMIT ${limit} OFFSET ${offset},
Microsoft SQLServer - SELECT * FROM test_table ORDER BY primary_key OFFSET ${offset} FETCH NEXT ${limit} ROWS ONLY
Comments
0 comments
Article is closed for comments.