Fix Oplog Issues¶
On this page
You can configure the following alert conditions in the project-level alert settings page to trigger alerts.
Replication Oplog Window is (X) occurs if the approximate amount of
time available in the primary replication oplog meets or goes below
the specified threshold. This refers to the amount of time that the
primary can continue logging given the current rate at which oplog
data is generated.
Oplog Data Per Hour is (X) occurs
if the amount of data per hour being written to a primary's
replication oplog meets or exceeds the specified threshold.
These are a few common events which may lead to increased oplog activity:
Fix the Immediate Problem¶
These are a few possible actions to consider to help resolve Replication Oplog Alerts:
- Increase the oplog size by editing your cluster's configuration to ensure it is higher than the peak value from the Oplog GB / Hour graph in the cluster metrics view.
Increase the oplog size if you foresee intense write and update operations occurring in a short time period.Note
You may need to increase your cluster's storage to free enough space to resize the oplog.
- Ensure that all write operations specify a
write concern of
majorityto ensure that writes are replicated to at least one node before moving on to the next write operation. This controls the rate of traffic from your application by preventing the primary from accepting writes more quickly than the secondaries can handle.
Implement a Long-Term Solution¶
Refer to the following for more information on understanding
Monitor Your Progress¶
You might observe the following scenarios when these alerts trigger:
- The Oplog GB / Hour graph in the metrics view spikes upward.
- The Replication Oplog Window graph in the metrics view is low.
We are too stale to use <node>:27017 as a sync source.
Typically, this indicates that the node has "fallen off the oplog" and is unable to keep up with the oplog data being generated by the primary node. In this case, the node will require an initial sync in order to recover and ensure that the data is consistent across all nodes. You can check the state of a node using the