Understanding the Error:
When a web crawling job fails, it's essential to understand the error message. Here's an example of such an error:
INFO [XLOS_pwccom-fetcher-4:crawler.Crawler$FetchCallable@453] - fetch() failure:
id: https://www.LWDOCS.com/us/en/tech-effect/cloud/cloud-business-survey/consumer-markets.html
parentID: https://www.LWDOCS.com/us/en/industries/consumer-markets/library/consumer-markets-trends.html
batchID: 2724
depth: 3
fetchedDate: Mon Sep 25 13:43:01 UTC 2023
lastModified: Thu Jan 01 00:00:00 UTC 1970
contentSignature: null
signature: 0
linked: true
discarded: false
errorCount: 8
error:
crawler.common.CrawlItemException: phase=FETCH; This item failed during fetch()
at crawler.Crawler$FetchCallable.callWithContext(Crawler.java:452)
at crawler.Crawler$FetchCallable.callWithContext(Crawler.java:418)
at com.lucidworks.connectors.logging.ContextAwareCallable.call(ContextAwareCallable.java:22)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Non-OK HTTP status: 403
at crawler.fetch.impl.http.WebFetcher.fetchWithRedirects(WebFetcher.java:491)
at crawler.fetch.impl.http.WebFetcher.fetch(WebFetcher.java:457)
at crawler.Crawler$FetchCallable.callWithContext(Crawler.java:444)
... 8 more
In this case, the error indicates that your web crawler attempted to fetch a resource from a web server, but the server responded with a 403 HTTP status code, indicating access is forbidden.
Troubleshooting Steps:
Here are steps to troubleshoot and address this issue:
-
Verify Web Connector Configurations: Ensure that your Web Connector configurations are correctly set up. Any misconfigurations may lead to access issues.
-
Adjust Limits and Configurations: Review and adjust any limits and configurations as needed, as they can impact the success of your crawling job.
-
Check Classic Rest Services YAML Values: The web connector relies on Classic Rest Services. Ensure the YAML values for Classic Rest Services are correctly configured. If these values are at their defaults, you may need to add the following values for the Classic Rest service container using Kubernetes commands.
yamlContainers: classic-rest-service: Container ID: containerd://1a424124c1160a98dbbf587839911458533962f2b7651c078f9ae06aa5e55381 Image: lucidworks/classic-rest-service:5.9.1 Image ID: docker.io/lucidworks/classic-rest-service@sha256:1f9976ec7b76e07ae6e35b69b2095974e725aa419201792db072ac5cb02d01f3 Port: 9000/TCP Host Port: 0/TCP State: Running Started: Tue, 17 Oct 2023 16:55:06 +0530 Ready: False Restart Count: 0 Limits: cpu: 1200m memory: 6Gi Requests: cpu: 600m memory: 4Gi
You can add the above values using the following Kubernetes command:
bashkubectl edit sts "classic-rest-pod-name" -n "namespace-name"
Alternatively, you can use GCP/AWS UI to make these settings.
-
Restart Classic Rest Pod: After updating the values for Classic Rest Service, the associated pod will restart. Ensure the pod has restarted successfully.
-
Verification and Testing: Once the pod is restarted, verify the changes and perform testing on your web crawling job to ensure it's functioning as expected.
By following these troubleshooting steps, you can resolve intermittent web crawling failures in Fusion 5 and above, particularly when encountering a 403 HTTP status code.
Comments
0 comments
Article is closed for comments.