Issue:
What are best practices to follow to ensure stability as an environment as a Fusion implementation is getting ready to go live?
Environment:
Fusion 4.x
Resolution:
-
Ensure that all testing has been completed on the staging environment -
-
Run ALL representative uncached queries and check for any exceptions thrown.
-
Run several load tests approximating the MAXIMUM load you expect your production environment to incur.
-
See the following links for more information:
https://lucidworks.com/post/solr-sizing-guide-estimating-solr-sizing-hardware/
https://support.lucidworks.com/hc/en-us/articles/360052117234-Solr-and-Fusion-Query-Load-Testing
-
-
Ensure your ulimits are appropriately set
-
This check should encompass ALL machines, virtual or otherwise, that will be used in the production environment. You should not assume that because one machine in your environment is set correctly that all instances will be identical.
-
Specifically pay attention to:
-
max memory (unlimited)
-
virtual memory (unlimited)
-
max user processes (unlimited or 64k+)
-
open files (unlimited or 32k+).
-
-
-
Ensure your GC settings and JVM parameters are appropriately tuned for your cluster
-
Consider using a gc viewer to observe gc performance over extended periods under a heavy load. Observing the gc against minimal or normal loads may not provide an accurate snapshot.
-
Make sure all instances have been allocated appropriate RAM and CPU resources.
-
Ensure RAM allocation is more than sufficient for max heap size and OS disk cache. Remember that the instance will also require readily available resources for your various Solr indices, as well as the OS and subsystem processes.
-
-
After ensuring that hardware resources are appropriate, profile your deployment.
-
Check the profiled data and make sure there are no long pauses. Long pauses indicate inordinate amounts of resources are being consumed by a large garbage collection process. Ideally, overall data throughput should be consistent and stable.
-
The heap size must be large enough to accommodate usage spikes; but at the same time must not be so large that it will retain unnecessary resources and thus create bottlenecks elsewhere in your environment.
-
-
For CMS: no concurrent mode failures.
-
For G1GC: no to-space exhaustion errors
-
Further Reading:
-
-
Make sure purging is enabled in Zookeeper
-
Check ZK snapshot purging schedule
-
Open the `zoo.cfg` file
-
Edit this line: #autopurge.snapRetainCount=3
-
Save and Close the ‘zoo.cfg’ file.
-
-
-
Ensure enough physical hard drive space allocation sufficient to handle the growth of both your various indices as well as your log files. In the past, we’ve seen several clients encounter indexing issues caused by out-of-memory errors. While memory errors can be code related (E.g. memory leaks), often such issues are related to a lack of hardware resources, either disk space or RAM. To avoid such errors:
-
Ensure log files are rotated at frequent intervals.
-
Ensure rotated log files are archived/removed at frequent intervals.
-
Ensure that your archival space allocated for both logs and indices is separate from operational diskspace. This allows archiving to occur with losing disk space critical to day-to-day operating requirements.
-
-
Lucidworks recommends a MINIMUM of 8 cores in your production cluster
-
A lack of CPU resources risks overloading your environment, which may cause critical downtime. If you see CPU usage spikes above 80 percent during profiling, you hardware is most likely underpowered for its purpose.
-
-
Double check your commit and cache warming settings in solrconfig.xml
-
Set up caches and cache-warming to accommodate your query performance.
-
Committing in frequent intervals will cause performance issues
-
Not committing often enough puts your application at risk of losing data. For example:
-
Power-outages
-
Hardware failures.
-
-
-
Set up monitoring on your cluster (In particular for Solr and Zookeeper)
-
Verify that your production environment is using a high-availability network connection.
-
Check request/response times between Solr->ZK, Solr->Solr and ZK->ZK
-
Verify that your application produces low latency stats from ZK when viewed under mntr or cons.
-
Ensure that you do not see an inordinate or high number of connections to ZK. This is often related to excessive client indexing, and can potentially cause performance issues.
-
-
Zookeeper should be in a quorum of an odd number of instances, none of which are on the same box.
-
Lucidworks recommends that all your Zookeeper instances live separately from your Fusion/Solr instances. This provides redundancy.
-
You should always have an odd number of delegates that form your quorum. This is to avoid ties should there be a run-off in electing a leader.
-
Lucidworks recommends quorums for 3 or 5 instances, depending on your application requirements. More instances (E.g. 7,9,11, etc) are supported, yet considered largely unnecessary.
-
-
Check your logs to ensure there are none of the following errors:
Comments
0 comments
Article is closed for comments.