Navigation

Replication Oplog Alerts

Replication Oplog alerts can be triggered when the amount of oplog data generated on a primary cluster member is larger than the cluster’s configured oplog size.

Description

Replication Oplog Window is (X) occurs if the approximate amount of time available in the primary replication oplog meets or goes below the specified threshold. This refers to the amount of time that the primary can continue logging given the current rate at which oplog data is generated.

Oplog Data Per Hour is (X) occurs if the amount of data per hour being written to a primary’s replication oplog meets or exceeds the specified threshold.

Possible Observations

These are a few common observations seen when these alerts are triggered:

  • The Oplog GB / Hour graph in the metrics view spikes upward.

  • The Replication Oplog Window graph in the metrics view is low.

  • The Atlas MongoDB Logs of secondary or unhealthy nodes display the following message:

    We are too stale to use <node>:27017 as a sync source.
    
  • An Atlas node is reporting a state of STARTUP2 and RECOVERING for an extended period of time.

    Typically, this indicates that the node has “fallen off the oplog” and is unable to keep up with the oplog data being generated by the primary node. In this case, the node will require an initial sync in order to recover and ensure that the data is consistent across all nodes. You can check the state of a node using the rs.status() shell method.

Common Triggers

These are a few common events which may lead to increased oplog activity:

  • Intensive write and update operations in a short period of time.
  • The cluster’s configured oplog size is smaller than the value in the Oplog GB / Hour graph observed in the cluster metrics view.

Possible Solutions

These are a few possible actions to consider to help resolve Replication Oplog Alerts:

  • Increase the oplog size by editing your cluster’s configuration to ensure it is higher than the peak value from the Oplog GB / Hour graph in the cluster metrics view.

  • Increase the oplog size if you foresee intense write and update operations occurring in a short time period.

    Note

    You may need to increase your cluster’s storage to free enough space to resize the oplog.

  • Ensure that all write operations specify a write concern of majority to ensure that writes are replicated to at least one node before moving on to the next write operation. This controls the rate of traffic from your application by preventing the primary from accepting writes more quickly than the secondaries can handle.

Refer to the following for more information on understanding oplog sizing requirements: