At Devs.lv, we've managed 50+ Kubernetes clusters across industries — from fintech to e-commerce. These are the 5 most common mistakes we see again and again, with specific advice on how to fix them.
1. No Resource Limits — The "Noisy Neighbor" Problem
Without CPU/memory limits, one pod can consume the entire node's resources. The result — other pods start getting OOMKilled or CPU throttled, and your production service slows down with no obvious cause.
Solution:
- Always define
resources.requestsandresources.limitsfor every container - Start with
requests: cpu: 100m, memory: 128Miand adjust based on actual usage data - Use
LimitRangeandResourceQuotaat the namespace level as a safety net - Set up monitoring with Prometheus + Grafana to see actual resource consumption
2. Single Replica — The "Hope It Doesn't Crash" Approach
Deployment with replicas: 1 is not production-ready. If that single pod crashes or the node restarts, your service is offline. Even for a few seconds, that can mean lost orders or frustrated customers.
Solution:
- Minimum 3 replicas for every production deployment
- Configure
podAntiAffinityto spread replicas across different nodes - Use
PodDisruptionBudget(PDB) to guarantee minimum replica count during node maintenance - Consider
topologySpreadConstraintsfor distribution across availability zones
3. No Health Checks — The "Schrödinger Pod"
Without liveness and readiness probes, Kubernetes can't tell if your application is actually working. A pod can be in "Running" status while the service inside is frozen or not handling requests.
Solution:
- Readiness probe — determines when a pod is ready to receive traffic. Without it, Kubernetes sends traffic to pods that aren't ready yet.
- Liveness probe — determines if a pod is "alive." If not, Kubernetes automatically restarts it.
- Startup probe — use for applications with long startup times (Java, .NET) so the liveness probe doesn't kill slow-starting pods.
- HTTP endpoints are better than TCP checks —
/healthzor/readycan verify database connections and other dependencies.
4. Secrets in YAML Files — "They Really Did Commit Credentials"
Kubernetes Secrets are base64 encoded, not encrypted. If you commit them to a Git repository, anyone with repo access can read them. We've seen database passwords, API keys, and even private certificates in public Git repos.
Solution:
- External Secrets Operator — syncs secrets from AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault directly into your Kubernetes cluster
- Sealed Secrets — encrypts secrets with a public key so they can be safely stored in Git
- SOPS — Mozilla's tool for encrypting YAML/JSON files with KMS or PGP
- Add
.gitignoreentries andpre-commit hooksthat block accidental secret commits
5. No Network Policies — "Everyone Sees Everything"
By default, all pods can communicate with all other pods in the cluster. This means a compromised pod in your staging namespace can reach the production database. One vulnerability can become a full cluster breach.
Solution:
- Start with a "deny all" default policy for each namespace
- Open only necessary connections — e.g., frontend → backend → database
- Use namespace isolation — production, staging, and development should never communicate with each other
- Consider a service mesh (Istio, Linkerd) if you need mTLS and granular traffic control
Bonus: Lack of Monitoring
Many set up Kubernetes and forget about monitoring. Without metrics and alerts, you'll learn about problems only when a customer reports an error.
Minimum monitoring stack: Prometheus + Grafana + Alertmanager. Add Loki for log analysis and Jaeger or Tempo for distributed tracing.
Conclusion
These mistakes are simple to prevent, but they cause 80% of production incidents we see. Most are preventive measures that take a few hours but save you from nights spent on incident resolution.
Want a Kubernetes security audit? We can review your cluster in 2-3 days and provide a concrete action plan.
