Featured AWS DevOps Kubernetes

Consolidating Milvus Across AZs

by Sudhanva Narayana a month ago 5 min read

I migrated a Milvus (standalone) deployment from a three-AZ Kubernetes setup (us-east-1a/b/c) to a single dedicated node in us-east-1b. I preserved all vector data, reduced infra to one m6a.xlarge node, consolidated all PVCs to 1b, and restored four collections with full integrity while untangling a handful of AWS, Kubernetes, Helm, and etcd knots.

This post documents the end-to-end path: decisions, traps, exact commands, and final checks. No fluff.

Context

Cluster: cm.montai.k8s.local (kops)
Namespace: milvus
Milvus: (standalone), deployed via Helm (`zilliztech/milvus )
Initial pain: PVCs spanned three AZs, forcing nodes in all three to satisfy volume affinity. Standalone Milvus didn’t need multi-AZ.

Goal: One dedicated nodegroup with taints in us-east-1b, with all PVCs in 1b, zero data loss.

Strategy in One Page

Create a dedicated instance group with taints, pinned to us-east-1b.
Add nodeSelectors/tolerations for Milvus, etcd, MinIO via Helm values.
Snapshot the PVCs; restore into 1b (EBS volumes can’t cross AZs; snapshots can).
Repair etcd membership from snapshot using ETCD_FORCE_NEW_CLUSTER=true.
Bring up MinIO (object storage with vector data), then Milvus.
Validate collections and segments; clean up.

Design calls:

Snapshots over cloning to cross AZ boundaries.
Preserve MinIO (data), rebuild etcd (metadata) from snapshot with FORCE_NEW_CLUSTER.
Single AZ for simplicity and cost (dev/test trade-off accepted).

What Went Wrong (and How I Fixed It)

1) AZ mismatch blocking scheduling

Symptom: volume node affinity conflict.
Cause: Node still in 1a; PVCs bound to 1b.
Fix: Move IG to 1b, apply, delete old node(s) to force recreation in 1b.

2) StatefulSet PVCs stuck in old AZs

Reality: PVC zone affinity is immutable.
Fix: VolumeSnapshot → delete PVC → recreate PVC from snapshot; let CSI bind in 1b.

3) Missing IAM for snapshot restores

Symptom: UnauthorizedOperation on ec2:CreateVolume from snapshot.

Fix: Add:

{ "Effect":"Allow","Action":"ec2:CreateVolume","Resource":"arn:aws:ec2:*:*:snapshot/*" }

Restart EBS CSI controller.

4) etcd membership deadlock after restore

Symptom: CrashLoop, “No active endpoints in cluster”.
Cause: Restored data contained old member IPs.
Fix (disaster recovery):
- Restore only etcd-0 PVC from snapshot.
- Scale to 3; etcd-1/2 join fresh.

Start one replica with:

kubectl set env statefulset/milvus-release-etcd \
  ETCD_FORCE_NEW_CLUSTER=true ETCD_INITIAL_CLUSTER_STATE=new -n milvus

kubectl scale statefulset milvus-release-etcd -n milvus --replicas=1

5) “Missing collections” scare

Reality: Milvus stores metadata in etcd and vectors in MinIO.
Fix: Once etcd metadata was restored from snapshot, Milvus mapped names→IDs and loaded segments. Data intact.

6) PVC selector immutability

Lesson: Don’t try to patch PVC zone/selector. Use snapshot→recreate. With WaitForFirstConsumer, node placement determines AZ.

Step-By-Step Execution

Phase 1 - Prep

Dedicated nodegroup (1b, tainted):

kops edit ig milvus --state s3://kops-cm-montai-com-state-store
# Set subnets: [us-east-1b], taint: milvus.io/node=cpu:NoSchedule

kops update cluster cm.montai.k8s.local --state s3://... --yes

Helm values with selectors/tolerations (Milvus/etcd/MinIO):

# /tmp/milvus-migration-values.yaml
standalone: { nodeSelector: { kops.k8s.io/instancegroup: milvus }, tolerations: [{ key: milvus.io/node, operator: Equal, value: cpu, effect: NoSchedule }] }
etcd:       { nodeSelector: { kops.k8s.io/instancegroup: milvus }, tolerations: [{ key: milvus.io/node, operator: Equal, value: cpu, effect: NoSchedule }], replicaCount: 3 }
minio:      { nodeSelector: { kops.k8s.io/instancegroup: milvus }, tolerations: [{ key: milvus.io/node, operator: Equal, value: cpu, effect: NoSchedule }], replicaCount: 4, persistence: { size: 500Gi } }

Create VolumeSnapshots:

kubectl apply -f VolumeSnapshotClass(ebs.csi.aws.com)

kubectl apply -f snapshots for etcd-0, etcd-2, minio-0, minio-1

kubectl wait volumesnapshot/<name> -n milvus --for=jsonpath='{.status.readyToUse}'=true --timeout=300s

Phase 2 - IAM

Add ec2:CreateVolume on arn:aws:ec2:*:*:snapshot/*; restart EBS CSI controller.

Phase 3 - Scale down & delete old PVCs

kubectl scale sts milvus-release-etcd -n milvus --replicas=0
kubectl scale sts milvus-release-minio -n milvus --replicas=0
kubectl delete deploy milvus-release-standalone -n milvus
kubectl delete pvc data-milvus-release-etcd-{0,2} export-milvus-release-minio-{0,1} -n milvus

Phase 4 - Restore PVCs into 1b

# Recreate PVCs from snapshots (no selector); CSI will place in 1b
kubectl apply -f restored-PVCs.yaml

Phase 5 - Lock IG to 1b & replace nodes

kops edit ig milvus  # ensure only us-east-1b
kops update cluster --yes
kubectl delete node <nodes in 1a/1c>

Phase 6 - Bring up MinIO

kubectl scale sts milvus-release-minio -n milvus --replicas=4
kubectl wait -n milvus -l app.kubernetes.io/name=minio --for=condition=Ready pod --timeout=300s

Phase 7 - etcd recovery

kubectl delete pvc data-milvus-release-etcd-1 -n milvus
kubectl set env sts/milvus-release-etcd ETCD_FORCE_NEW_CLUSTER=true ETCD_INITIAL_CLUSTER_STATE=new -n milvus
kubectl scale sts milvus-release-etcd -n milvus --replicas=1
kubectl wait pod/milvus-release-etcd-0 -n milvus --for=condition=Ready --timeout=120s
kubectl delete pvc data-milvus-release-etcd-2 -n milvus
kubectl scale sts milvus-release-etcd -n milvus --replicas=3

Phase 8 - Start Milvus

helm upgrade milvus-release zilliztech/milvus -n milvus --reuse-values -f /tmp/milvus-migration-values.yaml
kubectl wait -n milvus -l app.kubernetes.io/name=milvus --for=condition=Ready pod --timeout=300s

Phase 9 - Cleanup extra nodes

Verify all pods on the single node; cordon/drain/delete any stragglers.

Validation Checklist

Infra:

kubectl get nodes -l kops.k8s.io/instancegroup=milvus -o custom-columns='NODE:.metadata.name,ZONE:.metadata.labels.topology\.kubernetes\.io/zone,STATUS:.status.conditions[-1].type'
kubectl get pods -n milvus -o wide
for pvc in $(kubectl get pvc -n milvus -o jsonpath='{.items[*].metadata.name}'); do
  vol=$(kubectl get pvc $pvc -n milvus -o jsonpath='{.spec.volumeName}')
  zone=$(kubectl get pv $vol -o jsonpath='{.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]}')
  echo "$pvc | $zone"
done
# Expect all zones = us-east-1b

Milvus health & data:

kubectl get pods -n milvus
kubectl logs -l app.kubernetes.io/name=milvus -n milvus --tail=200 | grep -i "Auditor loaded segment metadata"
kubectl port-forward -n milvus svc/milvus-release 19530:19530 &
python - <<'EOF'
from pymilvus import connections, utility
connections.connect(host="localhost", port="19530")
print(utility.list_collections())
EOF
pkill -f "port-forward.*milvus"

MinIO contents (sanity):

kubectl exec -it milvus-release-minio-0 -n milvus -- ls /export/milvus-bucket/file/index_files/

Lessons You Can Reuse

Kubernetes

Snapshot first. It’s the only sane way to cross AZs with EBS.
With CSI WaitForFirstConsumer, node placement → AZ. Don’t fight PVC immutability.
Use values files for Helm upgrades; avoid subchart auth traps.

AWS

EBS volumes don’t cross AZs; snapshots do (within region).
IAM for CSI is granular: creating a volume from a snapshot needs explicit rights.

Distributed Milvus

In Milvus: MinIO = data, etcd = metadata. Protect MinIO PVCs; snapshot etcd.
etcd DR: Start one restored member with ETCD_FORCE_NEW_CLUSTER=true, then scale.

Final State

Single node (m6a.xlarge) in us-east-1b, tainted and isolated.
All PVCs consolidated in 1b (2,070 Gi total).
Services: Milvus standalone, etcd (3), MinIO (4).
Data: Four collections, 12 segments, full integrity.

Optional Cleanup & Monitoring

Watch stability and resources:

kubectl top node
kubectl top pods -n milvus
kubectl logs -l app=milvus-release -n milvus --since=24h | grep -i error

Remove snapshots if policy allows:

kubectl delete volumesnapshot -n milvus snapshot-etcd-{0,2} snapshot-minio-{0,1}

Appendix - Minimal IAM Addition for Snapshot Restore

{
  "Effect": "Allow",
  "Action": "ec2:CreateVolume",
  "Resource": "arn:aws:ec2:*:*:snapshot/*"
}

Add to your EBS CSI controller role, then restart the controller.

Outcome: single-AZ Milvus, clean scheduling, lower cost, no data loss, reproducible steps.

Sudhanva Narayana

Sudhanva Narayana is a senior software engineer in Boston with experience in Machine Learning, Platform Engineering, and Full Stack Web Development.