Skip to content

MachineSet controller overshoots desired replica count during large scale-ups, causing orphan nodes #1091

@gagan16k

Description

@gagan16k

How to categorize this issue?

/area robustness
/kind bug
/priority 3

What happened:
During a large scale-up event, the MachineSet controller creates more machines than the desired replica count, then immediately corrects by deleting the excess machines. This create-then-delete race causes orphan nodes to accumulate in the cluster.

The sequence of events observed:

  • MachineSet scales up from 3 to 131 replicas over several minutes via successive reconciliations.
  • The controller overshoots, creating ~43 additional machines beyond the desired count.
  • It detects the overshoot ("Too many replicas") and immediately marks excess machines for deletion.

For these excess machines, the deletion flow is triggered within seconds of creation. Because the machine's label update (node label, providerID) races with the deletion timestamp being set, the label update encounters a conflict error: "the object has been modified; please apply your changes to the latest version and try again".
The machine is deleted without its node label ever being persisted.

The underlying VM has already been created on the cloud provider. The kubelet registers the node after deletion, and getMachineFromNode cannot find a matching machine (no machine has the node label for that hostname/machine is deleted) causing errNoMachineMatch. This node then becomes an orphan, and after timeout, it gets annotated with NotManagedByMCM.

What you expected to happen:
The MachineSet controller should not create more machines than the desired replica count. Each reconciliation should account for in-flight creates and wait until they are observed.
The expectations mechanism already exists for this purpose but needs to be reviewed.

How to reproduce it (as minimally and precisely as possible):
Unsure

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/robustnessRobustness, reliability, resilience relatedkind/bugBugpriority/3Priority (lower number equals higher priority)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions