Consolidating Milvus Across AZs

I migrated a Milvus (standalone) deployment from a three-AZ Kubernetes setup (us-east-1a/b/c) to a single dedicated node in us-east-1b. I preserved all vector data, reduced infra to one m6a.xlarge node, consolidated all PVCs to 1b, and restored four collections with full integrity while untangling a handful of AWS, Kubernetes, Helm, and etcd knots.

This post documents the end-to-end path: decisions, traps, exact commands, and final checks. No fluff.

Context

Cluster: cm.montai.k8s.local (kops)
Namespace: milvus
Milvus: (standalone), deployed via Helm (`zilliztech/milvus )
Initial pain: PVCs spanned three AZs, forcing nodes in all three to satisfy volume affinity. Standalone Milvus didn’t need multi-AZ.

Goal: One dedicated nodegroup with taints in us-east-1b, with all PVCs in 1b, zero data loss.

Strategy in One Page

Create a dedicated instance group with taints, pinned to us-east-1b.
Add nodeSelectors/tolerations for Milvus, etcd, MinIO via Helm values.
Snapshot the PVCs; restore into 1b (EBS volumes can’t cross AZs; snapshots can).
Repair etcd membership from snapshot using ETCD_FORCE_NEW_CLUSTER=true.
Bring up MinIO (object storage with vector data), then Milvus.
Validate collections and segments; clean up.

Design calls:

Snapshots over cloning to cross AZ boundaries.
Preserve MinIO (data), rebuild etcd (metadata) from snapshot with FORCE_NEW_CLUSTER.
Single AZ for simplicity and cost (dev/test trade-off accepted).

What Went Wrong (and How I Fixed It)

1) AZ mismatch blocking scheduling

Symptom: volume node affinity conflict.
Cause: Node still in 1a; PVCs bound to 1b.
Fix: Move IG to 1b, apply, delete old node(s) to force recreation in 1b.

2) StatefulSet PVCs stuck in old AZs

Reality: PVC zone affinity is immutable.
Fix: VolumeSnapshot → delete PVC → recreate PVC from snapshot; let CSI bind in 1b.

3) Missing IAM for snapshot restores

Symptom: UnauthorizedOperation on ec2:CreateVolume from snapshot.

Fix: Add:

{ "Effect": "Allow", "Action": "ec2:CreateVolume", "Resource": "arn:aws:ec2:*:*:snapshot/*" }

Restart EBS CSI controller.

4) etcd membership deadlock after restore

Symptom: CrashLoop, “No active endpoints in cluster”.
Cause: Restored data contained old member IPs.
Fix (disaster recovery):
- Restore only etcd-0 PVC from snapshot.
- Scale to 3; etcd-1/2 join fresh.

Start one replica with:

kubectl set env statefulset/milvus-release-etcd \
  ETCD_FORCE_NEW_CLUSTER=true ETCD_INITIAL_CLUSTER_STATE=new -n milvus

kubectl scale statefulset milvus-release-etcd -n milvus --replicas=1

5) “Missing collections” scare

Reality: Milvus stores metadata in etcd and vectors in MinIO.
Fix: Once etcd metadata was restored from snapshot, Milvus mapped names→IDs and loaded segments. Data intact.

6) PVC selector immutability

Lesson: Don’t try to patch PVC zone/selector. Use snapshot→recreate. With WaitForFirstConsumer, node placement determines AZ.

Step-By-Step Execution

Phase 1 - Prep

Dedicated nodegroup (1b, tainted):

kops edit ig milvus --state s3://kops-cm-montai-com-state-store
# Set subnets: [us-east-1b], taint: milvus.io/node=cpu:NoSchedule

kops update cluster cm.montai.k8s.local --state s3://... --yes

Helm values with selectors/tolerations (Milvus/etcd/MinIO):

# /tmp/milvus-migration-values.yaml
standalone:
  {
    nodeSelector: { kops.k8s.io/instancegroup: milvus },
    tolerations: [{ key: milvus.io/node, operator: Equal, value: cpu, effect: NoSchedule }],
  }
etcd:
  {
    nodeSelector: { kops.k8s.io/instancegroup: milvus },
    tolerations: [{ key: milvus.io/node, operator: Equal, value: cpu, effect: NoSchedule }],
    replicaCount: 3,
  }
minio:
  {
    nodeSelector: { kops.k8s.io/instancegroup: milvus },
    tolerations: [{ key: milvus.io/node, operator: Equal, value: cpu, effect: NoSchedule }],
    replicaCount: 4,
    persistence: { size: 500Gi },
  }

Create VolumeSnapshots:

kubectl apply -f VolumeSnapshotClass(ebs.csi.aws.com)

kubectl apply -f snapshots for etcd-0, etcd-2, minio-0, minio-1

kubectl wait volumesnapshot/<name> -n milvus --for=jsonpath='{.status.readyToUse}'=true --timeout=300s

Phase 2 - IAM

Add ec2:CreateVolume on arn:aws:ec2:*:*:snapshot/*; restart EBS CSI controller.

Phase 3 - Scale down & delete old PVCs

kubectl scale sts milvus-release-etcd -n milvus --replicas=0
kubectl scale sts milvus-release-minio -n milvus --replicas=0
kubectl delete deploy milvus-release-standalone -n milvus
kubectl delete pvc data-milvus-release-etcd-{0,2} export-milvus-release-minio-{0,1} -n milvus

Phase 4 - Restore PVCs into 1b

# Recreate PVCs from snapshots (no selector); CSI will place in 1b
kubectl apply -f restored-PVCs.yaml

Phase 5 - Lock IG to 1b & replace nodes

kops edit ig milvus  # ensure only us-east-1b
kops update cluster --yes
kubectl delete node <nodes in 1a/1c>

Phase 6 - Bring up MinIO

kubectl scale sts milvus-release-minio -n milvus --replicas=4
kubectl wait -n milvus -l app.kubernetes.io/name=minio --for=condition=Ready pod --timeout=300s

Phase 7 - etcd recovery

kubectl delete pvc data-milvus-release-etcd-1 -n milvus
kubectl set env sts/milvus-release-etcd ETCD_FORCE_NEW_CLUSTER=true ETCD_INITIAL_CLUSTER_STATE=new -n milvus
kubectl scale sts milvus-release-etcd -n milvus --replicas=1
kubectl wait pod/milvus-release-etcd-0 -n milvus --for=condition=Ready --timeout=120s
kubectl delete pvc data-milvus-release-etcd-2 -n milvus
kubectl scale sts milvus-release-etcd -n milvus --replicas=3

Phase 8 - Start Milvus

helm upgrade milvus-release zilliztech/milvus -n milvus --reuse-values -f /tmp/milvus-migration-values.yaml
kubectl wait -n milvus -l app.kubernetes.io/name=milvus --for=condition=Ready pod --timeout=300s

Phase 9 - Cleanup extra nodes

Verify all pods on the single node; cordon/drain/delete any stragglers.

Validation Checklist

Infra:

kubectl get nodes -l kops.k8s.io/instancegroup=milvus -o custom-columns='NODE:.metadata.name,ZONE:.metadata.labels.topology\.kubernetes\.io/zone,STATUS:.status.conditions[-1].type'
kubectl get pods -n milvus -o wide
for pvc in $(kubectl get pvc -n milvus -o jsonpath='{.items[*].metadata.name}'); do
  vol=$(kubectl get pvc $pvc -n milvus -o jsonpath='{.spec.volumeName}')
  zone=$(kubectl get pv $vol -o jsonpath='{.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]}')
  echo "$pvc | $zone"
done
# Expect all zones = us-east-1b

Milvus health & data:

kubectl get pods -n milvus
kubectl logs -l app.kubernetes.io/name=milvus -n milvus --tail=200 | grep -i "Auditor loaded segment metadata"
kubectl port-forward -n milvus svc/milvus-release 19530:19530 &
python - <<'EOF'
from pymilvus import connections, utility
connections.connect(host="localhost", port="19530")
print(utility.list_collections())
EOF
pkill -f "port-forward.*milvus"

MinIO contents (sanity):

kubectl exec -it milvus-release-minio-0 -n milvus -- ls /export/milvus-bucket/file/index_files/

Lessons You Can Reuse

Kubernetes

Snapshot first. It’s the only sane way to cross AZs with EBS.
With CSI WaitForFirstConsumer, node placement → AZ. Don’t fight PVC immutability.
Use values files for Helm upgrades; avoid subchart auth traps.

AWS

EBS volumes don’t cross AZs; snapshots do (within region).
IAM for CSI is granular: creating a volume from a snapshot needs explicit rights.

Distributed Milvus

In Milvus: MinIO = data, etcd = metadata. Protect MinIO PVCs; snapshot etcd.
etcd DR: Start one restored member with ETCD_FORCE_NEW_CLUSTER=true, then scale.

Final State

Single node (m6a.xlarge) in us-east-1b, tainted and isolated.
All PVCs consolidated in 1b (2,070 Gi total).
Services: Milvus standalone, etcd (3), MinIO (4).
Data: Four collections, 12 segments, full integrity.

Optional Cleanup & Monitoring

Watch stability and resources:

kubectl top node
kubectl top pods -n milvus
kubectl logs -l app=milvus-release -n milvus --since=24h | grep -i error

Remove snapshots if policy allows:

kubectl delete volumesnapshot -n milvus snapshot-etcd-{0,2} snapshot-minio-{0,1}

Appendix - Minimal IAM Addition for Snapshot Restore

{
	"Effect": "Allow",
	"Action": "ec2:CreateVolume",
	"Resource": "arn:aws:ec2:*:*:snapshot/*"
}

Add to your EBS CSI controller role, then restart the controller.

Outcome: single-AZ Milvus, clean scheduling, lower cost, no data loss, reproducible steps.

Context

Strategy in One Page

What Went Wrong (and How I Fixed It)

1) AZ mismatch blocking scheduling

2) StatefulSet PVCs stuck in old AZs

3) Missing IAM for snapshot restores

4) etcd membership deadlock after restore

5) “Missing collections” scare

6) PVC selector immutability

Step-By-Step Execution

Phase 1 - Prep

Phase 2 - IAM

Phase 3 - Scale down & delete old PVCs

Phase 4 - Restore PVCs into 1b

Phase 5 - Lock IG to 1b & replace nodes

Phase 6 - Bring up MinIO

Phase 7 - etcd recovery

Phase 8 - Start Milvus

Phase 9 - Cleanup extra nodes

Validation Checklist

Lessons You Can Reuse

Final State

Optional Cleanup & Monitoring

Appendix - Minimal IAM Addition for Snapshot Restore

K3s on Oracle Cloud Always Free: GitOps Kubernetes (Gateway API + Auto HTTPS)

Bare Metal Kubernetes Homelab Setup

Install JupyterHub on AWS Elastic Kubernetes Service (EKS)

Context

Strategy in One Page

What Went Wrong (and How I Fixed It)

1) AZ mismatch blocking scheduling

2) StatefulSet PVCs stuck in old AZs

3) Missing IAM for snapshot restores

4) etcd membership deadlock after restore

5) “Missing collections” scare

6) PVC selector immutability

Step-By-Step Execution

Phase 1 - Prep

Phase 2 - IAM

Phase 3 - Scale down & delete old PVCs

Phase 4 - Restore PVCs into 1b

Phase 5 - Lock IG to 1b & replace nodes

Phase 6 - Bring up MinIO

Phase 7 - etcd recovery

Phase 8 - Start Milvus

Phase 9 - Cleanup extra nodes

Validation Checklist

Lessons You Can Reuse

Final State

Optional Cleanup & Monitoring

Appendix - Minimal IAM Addition for Snapshot Restore

Keep reading

K3s on Oracle Cloud Always Free: GitOps Kubernetes (Gateway API + Auto HTTPS)

Bare Metal Kubernetes Homelab Setup

Install JupyterHub on AWS Elastic Kubernetes Service (EKS)