Issue :

V2 connector is not stopping during indexing scaling.
V2 connector not indexing during plugin pod scaling.

Environment

Fusion 5.9.2, .9.0 , 5.9.4 & later

Diagnosis

Quick look into V2 connector behavior.

There are three components in relation to this infrastructure:

Connector plugin
- Each connector type has its own pod which are created from the zip file of the plugin.
- Handles the crawling of the data from the source.
Indexing pod which ties with Solr
- As documents are sent from the connector plugin, they are sent to the indexing pod.
- This handles the functionality of the index pipelines usually ending with the “Solr Indexer” stage to send the result to Solr.
Connectors-backend and connector pod
- Handles communication between these components.
- Helps drive the activity as seen in the UI with starting the job, keeping track of job progress, and importantly, terminating the job if any issues arise.

These are all functionally separate components communicating with each other. The important part being that the connectors-sdk and subsequently the TestConnector is only part of the connector plugin. From there, the rest of the chain involves other components.

This is structured in this way for a number of reasons:

Originally all connectors were hosted in one pod, aka the classic-rest-service. These are currently the v1 connectors.
- Each connector having its own pod allows for clarity in logging, differences in resourcing depending on need, and allows for more configuration options.
This also separates the indexing from the connector plugin data crawl. This is important because datasource jobs can have different index pipelines and it allows easier debugging to see which section of the crawl is failing. If there are failures in the indexing pod, it is clear that the index pipelines or Solr has issues.

Important note, as mentioned, the connectors-backend is coordinating the effort. It communicates with the plugin to start the data crawl then coordinates with the indexing pod to handle the indexing to solr. The backend is what is called when you click start on the datasource job.

Because it is the connectors-backend that is handling the job run, and that the crawl is split into different components;
What constitutes document emission completion?
- Because the plugin can retrieve all the documents but then the indexing fails.
- The index pipeline could process all the documents but solr has failures.
- There could be failures with Kafka itself, the messaging service handling all the communication.

The backend has these heartbeat checks to the plugin pod and indexing pod. Eventually the plugin pod sends a crawl finished status so we know what the final count is from the plugin side. Then it checks with the indexing pod to see if the counts match up (deletes are also counted). Basically all the documents should be processed. Once the counts match up, it effectively calls the job as done.

Reference logs and screenshots

connector-plugin deployment YAML

Install the Web V2 connector and add the random start link

Here you can see documents are getting indexed in the collection

After scaling the replica count in deployment.yaml file

Also updating the plugin instance to 3 in the fetch settings. No documents are indexed in the collection. Find logs below

Success : Behaviour with default setting which is replica 1

Connector Plugin pod logs

2024-09-20T13:40:46.903Z - INFO  [testing123-fetcher-0:crawler.Crawler@427] - fetch(): [ https://www.wikipedia.org/ ] reqTime=336ms

Connector backend logs

[lw.init-in-content-testing123-0:com.lucidworks.connectors.service.components.job.processor.DefaultPhaseRunner@215] - Initial data loader finished successfully
2024-09-23T09:18:27.360Z - INFO  [DefaultDataProcessor:com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor@663] - phaseRunner.getUnprocessedCount() -> 0
2024-09-23T09:18:27.368Z - INFO  [kafka-producer-network-thread | fetch-inputs.testing123-fetch-input-producer-7@support-shared-594-connectors-backend-5f7bc66c98-bl952:org.apache.kafka.clients.Metadata@402] - [Producer clientId=fetch-inputs.testing123-fetch-input-producer-7@support-shared-594-connectors-backend-5f7bc66c98-bl952] Resetting the last seen epoch of partition fusion.connectors.fetch-inputs.testing123-1 to 0 since the associated topicId changed from null to VJvX7LCySNytVwQuFMxA0Q
2024-09-23T09:18:27.388Z - WARN  [output-processor-testing123-43:com.lucidworks.connectors.service.components.job.processor.IncomingItemProcessor@351] - Received item with an empty result Id, ignoring. input.id=, result.id=

2024-09-23T09:18:27.861Z - INFO  [lw.in-content-testing123-1:com.lucidworks.connectors.service.components.job.processor.DefaultPhaseRunner@201] - cache information: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}

2024-09-23T09:18:27.948Z - INFO  [lw.out-content-testing123-0:com.lucidworks.connectors.service.components.job.processor.DefaultPhaseRunner@439] - Dequeued batch of 1 emittable items in 65086 ms. Pending count=0

2024-09-23T09:18:27.955Z - INFO  [kafka-producer-network-thread | fusion.connectors.datasource-testing123-producer-7@support-shared-594-connectors-backend-5f7bc66c98-bl952:org.apache.kafka.clients.Metadata@402] - [Producer clientId=fusion.connectors.datasource-testing123-producer-7@support-shared-594-connectors-backend-5f7bc66c98-bl952] Resetting the last seen epoch of partition fusion.connectors.datasource-testing123-0 to 0 since the associated topicId changed from null to q4c2IGd4RgKV-AIU54-x_Q

2024-09-23T09:18:28.528Z - INFO  [lw.processing-monitor.testing123:com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor@272] - lastReceiveTimeMS=577, documentReceivedCount=1, deleteReceivedCount=0, pendingFetchCalls=1, indexResultsCount=0, submittedItemCounter=4, emitCount=4, completionIndicators={item-processor-not-busy=false, phase-runner-not-busy=true, no-pending-items=true, no-pending-doc-results=false, fetching-started=true, no-pending-fetch-calls=false}

Success : kafka logs

[2024-10-01 08:40:30,956] INFO [GroupCoordinator 2]: Dynamic member with unknown member id joins group fetch-inputs.testing123-consumer in Empty state. Created a new member id fetch-inputs.testing123-fetch-input-consumer-7@support-shared-594-connector-plugin-web-v2-7866cf9896-txzwc-3dd39cae-4b2e-4b70-9a46-797c1f4bbe83 and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2024-10-01 08:40:30,958] INFO [GroupCoordinator 2]: Preparing to rebalance group fetch-inputs.testing123-consumer in state PreparingRebalance with old generation 2 (__consumer_offsets-45) (reason: Adding new member fetch-inputs.testing123-fetch-input-consumer-7@support-shared-594-connector-plugin-web-v2-7866cf9896-txzwc-3dd39cae-4b2e-4b70-9a46-797c1f4bbe83 with group instance id None; client reason: not provided) (kafka.coordinator.group.GroupCoordinator)
[2024-10-01 08:40:33,959] INFO [GroupCoordinator 2]: Stabilized group fetch-inputs.testing123-consumer generation 3 (__consumer_offsets-45) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2024-10-01 08:40:34,250] INFO [GroupCoordinator 2]: Assignment received from leader fetch-inputs.testing123-fetch-input-consumer-7@support-shared-594-connector-plugin-web-v2-7866cf9896-txzwc-3dd39cae-4b2e-4b70-9a46-797c1f4bbe83 for group fetch-inputs.testing123-consumer for generation 3. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2024-10-01 08:41:30,399] INFO [UnifiedLog partition=__consumer_offsets-9, dir=/bitnami/kafka/data] Deleting segment files LogSegment(baseOffset=0, size=1247, lastModifiedTime=1726260658227, largestRecordTimestamp=1726260658228) (kafka.log.LocalLog$)

Success : indexing pod logs

2024-10-01T08:41:31.782Z - INFO  [pool-23-thread-1:com.lucidworks.apollo.pipeline.index.stages.LoggingIndexStage@41] - message: PipelineDocument [id=https://www.wikipedia.org/, fields={_lw_container_id_s=[PipelineField [name=_lw_container_id_s

Failure : After scaling the web V2 pods to 3. One of the pod is having issue captured w.r.t fetch failure

Connector plugin pod

[pool-4-thread-1:com.lucidworks.connectors.client.pubsub.core.DefaultClientPubSub@56] - Received stop request
2024-09-20T13:30:08.390Z - INFO  [pool-14-thread-1:crawler.control.impl.local.LocalCrawlController@177] - abort() called
2024-09-20T13:30:08.390Z - INFO  [pool-14-thread-1:crawler.control.impl.local.LocalCrawlController@182] - ABORTING...
2024-09-20T13:30:08.395Z - ERROR [testing123-fatal-error-handler:crawler.control.impl.local.LocalCrawlController$1@229] - FATAL ERROR
crawler.control.CrawlControllerException: Aborted
2024-09-20T13:30:08.414Z - INFO  [testing123-fetcher-0:crawler.Crawler$FetchCallable@531] - fetch() failure: 
id: https://www.wikipedia.org/
parentID: null
batchID: 0
depth: 0
fetchedDate: Fri Sep 20 13:30:08 GMT 2024
lastModified: Thu Jan 01 00:00:00 GMT 1970
contentSignature: null
signature: 0
linked: true
discarded: false
errorCount: 1
error:

crawler.common.CrawlItemException: phase=FETCH; This item failed during fetch()
Caused by: java.lang.InterruptedException

2024-09-20T13:30:08.419Z - INFO  [pool-14-thread-1:com.lucidworks.connectors.models.JobStatus@283] - end job id: AAozR1Tl08 took 00:00:00.669 counters:  state: ABORTED

Connector backend pod logs

2024-09-20T14:44:02.765Z - INFO  [lw.init-in-content-testing123-0:com.lucidworks.connectors.service.components.job.processor.DefaultPhaseRunner@215] - Initial data loader finished successfully
2024-09-20T14:44:02.767Z - INFO  [DefaultDataProcessor:com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor@663] - phaseRunner.getUnprocessedCount() -> 0
2024-09-20T14:44:02.774Z - INFO  [kafka-producer-network-thread | fetch-inputs.testing123-fetch-input-producer-7@support-shared-594-connectors-backend-5f7bc66c98-bl952:org.apache.kafka.clients.Metadata@402] - [Producer clientId=fetch-inputs.testing123-fetch-input-producer-7@support-shared-594-connectors-backend-5f7bc66c98-bl952] Resetting the last seen epoch of partition fusion.connectors.fetch-inputs.testing123-0 to 0 since the associated topicId changed from null to VJvX7LCySNytVwQuFMxA0Q

2024-09-20T14:44:02.841Z - WARN  [output-processor-testing123-40:com.lucidworks.connectors.service.components.job.processor.IncomingItemProcessor@351] - Received item with an empty result Id, ignoring. input.id=, result.id=
2024-09-20T14:44:03.346Z - INFO  [lw.in-content-testing123-1:com.lucidworks.connectors.service.components.job.processor.DefaultPhaseRunner@201] - cache information: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}
2024-09-20T14:44:03.433Z - INFO  [lw.out-content-testing123-0:com.lucidworks.connectors.service.components.job.processor.DefaultPhaseRunner@439] - Dequeued batch of 1 emittable items in 65083 ms. Pending count=0
2024-09-20T14:44:04.055Z - INFO  [lw.processing-monitor.testing123:com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor@272] - lastReceiveTimeMS=606, documentReceivedCount=0, deleteReceivedCount=0, pendingFetchCalls=0, indexResultsCount=0, submittedItemCounter=4, emitCount=4, completionIndicators={item-processor-not-busy=true, phase-runner-not-busy=false, no-pending-items=true, no-pending-doc-results=true, fetching-started=true, no-pending-fetch-calls=true}
2024-09-20T14:44:04.785Z - INFO  [DefaultDataProcessor:com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor@387] - Done fetching and indexing,  stats: {item-processor-not-busy=true, phase-runner-not-busy=true, no-pending-items=true, no-pending-doc-results=true, fetching-started=true, no-pending-fetch-calls=true}
2024-09-20T14:44:04.788Z - WARN  [DefaultDataProcessor:com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor@679] - Job is stopping!

Kafka pod

[2024-10-01 08:54:05,654] INFO [Controller id=2] Processing automatic preferred replica leader election (kafka.controller.KafkaController)
[2024-10-01 08:54:05,654] TRACE [Controller id=2] Checking need to trigger auto leader balancing (kafka.controller.KafkaController)
[2024-10-01 08:54:05,655] DEBUG [Controller id=2] Topics not in preferred replica for broker 1 Map(fusion.system.crud_events-1 -> Vector(1, 2)) (kafka.controller.KafkaController)
[2024-10-01 08:54:05,655] TRACE [Controller id=2] Leader imbalance ratio for broker 1 is 0.015625 (kafka.controller.KafkaController)
[2024-10-01 08:54:05,655] DEBUG [Controller id=2] Topics not in preferred replica for broker 0 Map(fusion.system.crud_events-2 -> Vector(0, 1), fusion.system.crud_events_notification-3 -> Vector(0, 1), fusion.system.crud_events_notification-0 -> Vector(0, 2)) (kafka.controller.KafkaController)
[2024-10-01 08:54:05,655] TRACE [Controller id=2] Leader imbalance ratio for broker 0 is 0.05084745762711865 (kafka.controller.KafkaController)
[2024-10-01 08:54:05,655] DEBUG [Controller id=2] Topics not in preferred replica for broker 2 Map(fusion.system.crud_events_notification-1 -> Vector(2, 1), fusion.system.crud_events-3 -> Vector(2, 1)) (kafka.controller.KafkaController)
[2024-10-01 08:54:05,655] TRACE [Controller id=2] Leader imbalance ratio for broker 2 is 0.03508771929824561 (kafka.controller.KafkaController)
[2024-10-01 08:56:05,620] INFO [GroupCoordinator 2]: Dynamic member with unknown member id joins group server-requests.lucidworks.web-v2-consumer in Stable state. Created a new member id server-requests.lucidworks.web-v2-requests-consumer-7@support-shared-594-connector-plugin-web-v2-7866cf9896-qc4qh-720a3592-fa45-4c2b-a8ba-ea65ea24f7ed and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2024-10-01 08:56:08,567] INFO [GroupCoordinator 2]: Preparing to rebalance group server-requests.lucidworks.web-v2-consumer in state PreparingRebalance with old generation 18 (__consumer_offsets-9) (reason: Adding new member server-requests.lucidworks.web-v2-requests-consumer-7@support-shared-594-connector-plugin-web-v2-7866cf9896-jp7fc-b7c4a6e7-6b3f-44c0-9f0c-7bb4b7e35170 with group instance id None; client reason: not provided) (kafka.coordinator.group.GroupCoordinator)
[2024-10-01 08:56:09,715] INFO [GroupCoordinator 2]: Stabilized group server-requests.lucidworks.web-v2-consumer generation 19 (__consumer_offsets-9) with 3 members (kafka.coordinator.group.GroupCoordinator)
[2024-10-01 08:56:09,717] INFO [GroupCoordinator 2]: Assignment received from leader server-requests.lucidworks.web-v2-requests-consumer-7@support-shared-594-connector-plugin-web-v2-7866cf9896-txzwc-1e647b7d-a94d-450a-87de-0a9b17a2d0ad for group server-requests.lucidworks.web-v2-consumer for generation 19. The group has 3 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)

No logs in indexing and kafka 0 pod

Resolution

1. V2 connector not stopping in case of indexing pod scaling

When the indexing pods are scaled while the job is running, this is causing the connector to halt or get stuck in the Running state after the completion of document emission (and indexing).

Note: This patch can also be applied to scenarios where the indexing pod is not scaling but still the connector is not stopping due to some runtime exception in the stage of the index pipeline. First we should try to increase the indexing activity timeout and if that doesn’t work we should then apply these patches.

Patch Provided : Yes

Affected Fusion Version : 5.9.2

Patch Details:

There are two patches, one for connectors-backend and one for fusion-indexing. These are the images:

gcr.io/lw-support-team/fusion-indexing:sust-773-port-fus-5698-f46e69a2

gcr.io/lw-support-team/connectors-backend:sust-773-port-fus-5698-1d227196

Go to edit the relevant deployment for the service (connectors-backend or fusion-indexing).
Replace the image with the patch image.
Save and wait for the pods to restart.

2. V2 connector not indexing in case of plugin pod scaling

When scaling the web V2 connector plugin pods , indexing seems to be not happening while its working fine when the plugin pods are kept as default which is one.
Patch Provided : Considered as bug but Engineering team needs to work upon this.

Affected Fusion Version : 5.9.0 , 5.9.4 and later/former version (Unknown yet)

Steps Taken to replicate the Issue with Web V2 connector (Scaling the plugin pods)

Default settings with replica Count : 1