Cloud link collection
Container
- Wikipedia: Hype-Zyklus
- Fernando Álvarez, 37signals, 2023-01-13: Our cloud spend in 2022
- Farah Schüller, 37signals, 2023-03-21: De-cloud and de-k8s — bringing our apps back home
For the Operations team at 37signals, the biggest effort in 2023 is removing our dependencies on the cloud and migrating our application stacks back into the data center onto our own hardware. We’ve already made amazing progress in a fairly short time — let’s get into some details!
- Moritz Förster, heise online, 2023-04-08: Drei Fragen und Antworten: Es gibt einen Weg aus der Cloud
Raus aus der Cloud-Abhängigkeit: Das eigene Rechenzentrum kann nicht nur wirtschaftlich besser sein – und der Umzug zurück ist auch kein Alptraum.
- Stefan Krempl, heise online, 2023-04-09: EU-Cloud-Wettbewerber: Microsofts Preissteigerungen reichen an Erpressung
Zum 1. April hat Microsoft die Preise für Cloud-Produkte um elf Prozent erhöht. Betroffenen und Konkurrenten stößt die Teuerung verstärkt äußerst übel auf.
- Jahir B Navaz, Chhavi Saluja, KPMG UK, 2023-04-13: Why a “Cloud Exit Strategy” is essential to enable the future
- Pete Scott, Percona, 2023-05-09: Vendor Lock-in: What It Is and How To Avoid It
- Daniel Nichter 2023-05-12: Are Aurora Performance Claims True?
- David Heinemeier Hansson, hey.com, 2023-06-23: We have left the cloud
- David Heinemeier Hansson, 2023-09-15: Our cloud exit has already yielded $1m/year in savings
- Microsoft, 2023-09-22: Azure Database for MariaDB will be retired on 19 September 2025 – Migrate to Azure Database for MySQL Flexible Server
- Microsoft, 2023-09-22: What’s happening to Azure Database for MariaDB?
- Renato Losio, InfoQ, 2023-09-30: Azure Database Drops Support for MariaDB
- David Linthicum, InfoWorld, 2024-02-09: Why companies are leaving the cloud
Cloud is a good fit for modern applications, but most enterprise workloads aren’t exactly modern. Security problems and unmet expectations are sending companies packing.
- Rob Pankow, The Newstack, 2024-11-05: Why Companies Are Ditching the Cloud: The Rise of Cloud Repatriation - Major organizations like 37signals and GEICO highlight the economic and strategic reasons to reconsider cloud infrastructure.
- Nia Teerikorpi, Continuent, 2025-03-13: The Cloud Repatriation Debate: Why Compute Flexibility Is the Real Trend
- Nick Van Wiggeren, PlanetScale, 2025-03-18;: The Real Failure Rate of EBS
Virtualisierung
- Dynatrace: Virtualization’s Impact on Performance Management - Why System Metrics in the Guest Are Not Trustworthy
- Frank Denneman, 2010-12-16: Impact of oversized virtual machines part 1
- Frank Denneman, 2010-12-17: Impact of oversized virtual machines part 2
- Peter Senna Tschudin, 2012: Performance Overhead and Comparative Performance of 4 Virtualization Solutions, local [copy](/computer/cloud-link-collection/lokal copy peters-top4-virtualization-benchmark-1.29.pdf)
- Frank Denneman, 2013-09-18: vCPU configuration. Performance impact between virtual sockets and virtual cores?
- Frank Denneman, 2016-12-12: Decoupling of Cores per Socket from Virtual NUMA Topology in vSphere 6.5
- Seyed Alireza Mustafa ob medium.com, 2020-02-02; Virtualization Performance Penalty
Tests showed that performance loss due to virtualization on ESXi varies from almost nothing to 29% depending on the operation type. … You may think that adding more threads (of course less than the number of real cores) may add to your performance. This is simply wrong. An operation run at high thread count, may indeed saturate the CPU caches or the memory bus and lead to an obvious performance loss.
- Gordan Bobic, Shattered Silicon, 2020-03-19: Virtualization Performance – or lack thereof
People always seem very shocked when I suggest that virtualization comes with a very substantial performance penalty even when virtualization hardware extensions are used. Concerningly, this surprise often comes from people who have already either committed their organization’s IT infrastructure to virtualization, or have made firm plans to do so. The only thing I can conclude in these cases, unbelievable as it may appear, is that they haven’t done any performance testing of their own to assess the solution they are planning to adopt.
… The difference is substantial even with the least poorly performing hypervisor. Virtualization performance is over a 5th (21%) down with paravirtualized Xen down compared to bare metal, and nearly a quarter (24%) lower than bare metal with VMware ESXi, and even worse with KVM. Or if you prefer to look at it the other way around, bare metal is more than a quarter as fast again (26.32%) as the best performing hypervisor on the same hardware.
- Gordan Bobic, Shattered Silicon, 2020-10-13: Virtualization Performance Overheads – Part 2 – Nehalem and Sandy Bridge
VM is 1.979x slower than bare metal per core-GHz.
VM has 50.52% of performance of bare metal per core-GHz.
- VMware, 2022: Performance Best Practices for VMware vSphere 8.0 or local copy
For a small percentage of workloads, for which CPU virtualization adds overhead and which are CPU-bound, there might be a noticeable degradation in both throughput and latency.
If an ESXi host becomes CPU saturated (that is, the virtual machines and other loads on the host demand all the CPU resources the host has), latency-sensitive workloads might not perform well. In this case you might want to reduce the CPU load, for example by powering off some virtual machines or migrating them to a different host (or allowing Distributed Resource Scheduler (DRS) to migrate them automatically).
For the best performance, try to size your virtual machines to stay within a physical NUMA node.
For example, if you have a host system with six cores per NUMA node, try to size your virtual machines with no more than six vCPUs.
- Broadcom, 2025-04-01: Performance Implications of CPU Virtualization
CPU virtualization adds varying amounts of overhead depending on the workload and the type of virtualization used.
An application is CPU-bound if it spends most of its time executing instructions rather than waiting for external events such as user interaction, device input, or data retrieval. For such applications, the CPU virtualization overhead includes the additional instructions that must be executed. This overhead takes CPU processing time that the application itself can use. CPU virtualization overhead usually translates into a reduction in overall performance.
For applications that are not CPU-bound, CPU virtualization likely translates into an increase in CPU use. If spare CPU capacity is available to absorb the overhead, it can still deliver comparable performance in terms of overall throughput.
ESXi supports up to 128 virtual processors (CPUs) for each virtual machine.
Container / Docker / Kubernetes
- plusserver, 2024-10-10: 12 Dos & Don‘ts für Kubernetes-Container
- Don’t: Updates ignorieren
… Container- und Basis-Images sowie die Kubernetes-Version regelmäßig zu aktualisieren, um potenzielle Sicherheitslücken zu schließen
- Don’t: Sensible Daten in Container-Images speichern
Für sensible Daten ist es empfehlenswert, dedizierte Tools wie HashiCorp Vault oder ähnliche Secret-Management-Tools zu nutzen, um notwendige Credentials zu speichern und zu verwalten. …
- Berk Ulsoy, 2022-11-27: Do’s and Dont’s When Moving to Kubernetes
- Start with a managed service
… avoid owning the control plane and node management because it increases the workload on the team immensely. Even managed service option will still require some level of management that we don’t do with the many other managed services. The team still has to test and execute the worker node upgrades (on top of k8s version upgrades), manage node pools and costs. There is still work to be done for the security of the worker nodes even when it’s given by a cloud provider.
- Avoid persistent volumes
Keeping the cluster stateless, free to destroy and rebuild any part of it anytime you want greatly decreases operational complexity. Once you put persistent volumes to the cluster, you will have to start thinking about the provisioning, migration, backup, resizing, performance, security, monitoring of those volumes in an automated manner. You will need to work on integrating the storage solutions of your provider to the cluster.
Instead, just keep the stateful data outside of kubernetes as long as possible, keep your cluster stateless.
- Conclusion
Creating a new product that will run on kubernetes is an exciting journey. It also opens many doors to business as a direct or indirect enabler of various outcomes. However, greed brings risk of spoiling all if it is the first time for the team to work with it. Adoption of kubernetes goes smoother if we resist adding complexity until they are unavoidable and until we gain enough knowledge and experience on our new platform.
- Pavan Belagatti, 2022-02-14: Kubernetes Mistakes: A Beginner’s Guide to Avoiding Common Pitfalls
- Maryna Cherednychenko: When Don’t You Need Kubernetes?
- Projects that do not need to use Kubernetes
For small and medium-sized companies, containerization in general and Kubernetes in particular may not be needed. Such companies may find the Kubernetes setup too complex and resource intensive.
Also, if you have monolithic architecture, deploying and managing it manually is simpler than using automatic deployment tools.
When it becomes too difficult to maintain the code, you may ‘cut’ the project into microservices with their own interfaces. It is still possible, then, to manage microservices without the orchestrator or with a less complicated orchestrator.
Additionally, if you have a limited budget and are not going to extend your development team, consider using virtual machines. You can manage them with the help of a configuration management system, for example, Ansible. Such machines do not require high qualification of staff that maintains them.
All in all, the types of projects where you do not need Kubernetes are as follows:
- Monolithic applications
- Applications with predictable user traffic
- Low-loaded or mid-loaded applications
- Static websites
- Single-instance applications
- Resource-constrained environments
- Kubernetes cons
- Complexity. Kubernetes is a sophisticated technology that cannot be figured out in one day. You should hire experienced DevOps to use Kubernetes in your project or invest in the education and training of less experienced experts.
- Networking setup. Kubernetes has a higher level of networking abstraction than traditional networking technologies. It treats containers and pods as first-class citizens. Each such citizen has their own IP address. Such an abstraction provides greater flexibility but, at the same time, might be confusing for those used to a more traditional setup.
- Compatibility. If you decide to upgrade or downgrade Kubernetes to a specific version, you may face compatibility problems with existing application components. It is crucial to check the Kubernetes release documentation before the upgrade and make sure that version migration does not ruin the current infrastructure setup.
- Debugging challenges. Kubernetes is used for setting up infrastructure in complex distributed systems. Such systems have numerous interconnected components. When an issue appears, it is challenging to detect the source of the problem and troubleshoot it immediately.
- Security concerns. Misconfigurations in Kubernetes can lead to application vulnerabilities and potential data breaches. Some common mistakes during the Kubernetes configuration are inadequate role-based access control, excessive pod permissions, insecure network policies, unsecured API servers, lack of network isolation, and unchecked admission controllers.
- Vendor lock-in. You may face many challenges if you decide to switch from Kubernetes to another orchestration system. These may include making changes to configuration files and deployment manifests, implementing app refactoring, and revising the integration ecosystem.
- Kubernetes costs
- Kubernetes alternatives
- Herve Khg, 2023-11-12: 3 years managing Kubernetes clusters, my 10 lessons
- Lesson 1: Use Kubernetes in the cloud
Unless there’s an extreme constraint, it’s unnecessary to manage Kubernetes’ underlying infrastructure yourself. You’ll spend your time debugging problems that don’t add value to your business. …
- Lesson 8: Think stateless
Ideally, it’s better to avoid persisting data in your pods. If for some reason it’s not possible otherwise, then prefer mounts on NAS rather than on disks. …
- Lesson 10: Don’t be afraid of change
On average, you should plan for three version upgrades of your cluster per year, about one update every four months. Some updates are transparent, but often there will be changes with impacts. To better prepare for these updates, I recommend reading, re-reading, and revisiting the release notes and the experiences of those who have updated before you.
Feedback von KrisK
Wir haben immer wieder und häufiger Kunden mit grossen VM’s 16 Core VM, 32 Core VM… (all virtualized policy).
Und wenn die Last generieren sehen wir 1600% CPU usage bzw. 3200% CPU usage.
Hast Du mir irgend welche belastbare Dokumente/Links, dass grosse VMs quatsch sind? Bzw. wenn es kein Quatsch ist, kann ich damit leben…
Ich habe bis jetzt nichts brauchbares finden können:
“This article does not focus on large virtual machines that are correctly configured for their workloads.”
“For a small percentage of workloads, for which CPU virtualization adds overhead and which are
CPU-bound, there might be a noticeable degradation in both throughput and latency.”
Große VMs sind kein Quatsch.
ich habe Oracle MySQL 8.0 auf einer AMD Epyc 2nd Gen Single Socket Maschine ohne my.cnf betrieben, mit 400.000 QPS ohne irgendeine Optimierung (das ist keine VM, das war Bare Metal). Aktuelles MySQL skaliert gut, und mit Optimierung und Konfiguration kommt man leicht auf > 1M QPS.
Wir haben “memcached”-Queries (SELECT value from t where id = , mit der Row im Speicher) von 125µs Query Resolution time gesehen, durch das SQL Interface, also nicht mit X-Protocol. Aber ohne all die Skalierungsprobleme und all die Persistenzprobleme von memcached.
Bei großen CPUs und großen VMs gilt aber:
-
NUMA ist eine Hünding. Am Besten ist es, eine Single Socket Kiste zu haben. Also eine große AMD EPYC mit einem Sockel und wenn man mehr will, eine 2. Kiste statt Dual-Socket. AMD gitb Dir 128 PCI Lanes mit Single Socket und 128 PCI Lanes mit Dual-Socket, weil die 2. 128 PCI Lanes für den Interconnect drauf gehen. NUMA dann dazu, also linker Socket auf Speicher vom rechten Socket = Additional Latenz = doof.
-
VMs pinnen. Bei Dual-Socket also “numactl -N 0 -m 0 command …” oder was auch immer. Das eine pinnt die Cores auf Socket 0, das andere den Speicher auf Sockets 0. Dadurch liegt die VM bei ihrem Speicher auf demselben Core fest.
-
Harte CPU kaufen. Wenn Du alle Cores auf einem Chip nutzt, wird die CPU langsamer, weil sie thermale Probleme bekommt. AMD ist da resilienter als Intel, und bei Intel sind die Dinger gestaffelt Scheiße. Siehe zum Beispiel die https://en.wikichip.org/wiki/intel/xeon_gold/6230#Frequencies Die 6230 ist (angeblich) eine harte CPU, aber sogar die macht 1 Core Busy 3900 MHz, alle Cores busy 2800 MHz.
Das - crosstalk - ist ein ernstes Problem bei VMs, die CPU-intensiv genutzt werden – Datenbanken eher nicht, die sind I/O und Memory.
Xeon Gold 6230
-
Balance. 1 vCPU = 4 GB RAM, bei Datenbanken auch gerne 1:8 oder gar 1:16. Wenn Du zu wenig RAM hast, machst Du I/O und dann wartest Du schneller und mit mehr Cores. Nutzt gar nix eine Kiste imba zu bauen (imbalanced, also mit dem falschen Verhältnis Core - Memory - Netz)
-
RAM muß auch voll gemacht werden, also Daten von Platte laden und Daten kommen ultimativ auf die Platte aus dem Netz. Eine balanced Kiste ist also 1 core - 4 GB RAM und x MBit/s Netz, wir rechnen oft mit 100 MBit/s oder so. Du kannst Dir das ausrechnen, indem Du Dir überlegst, wie lange Du auf 1 GB, 100 GB oder 1 TB Daten warten magst. 200 MBytes/s sind eine Platte, und das sind 1600 MBit/s, also 1.6 GBit/s. 400 MBytes/s sind eine SSD linear, also 3200 MBit/s, 3.2 GBit/s, also 100 MBit/s bei einer 32 vCPU VM.
Firma B bis 2020: Baremetal Blade mit Dual 4110 (32 vCPU), 128 GB RAM, 10 GBit/s Netzwerkkarte, lokal 1x oder 2x 1.92 TB Micron SSD (800.000 IOPS, 1/20_000s clat).
Firma B jetzt (sagt man mir, ich bin ja nicht mehr da): EPYC Kisten 4 Gen mit imba viel RAM (2 TB, 4 TB) und dann dicht gepackt VMs drauf, jede beliebige Größe bis zu 1/2 Maschine (die haben dummerweise Dual-Socket gekauft).
Die Datenbanker freut das, viel RAM viel gut.
Ansonsten sind “32 Core VM” keine großen Maschinen. Das sind Butter und Brot Bare-Metal Blades bei Firma B für MySQL, davon habe ich 50.000 Stück gehabt, als ich bei Firma B Rechenzentren geplant habe. Es sind in etwa die kleinsten Maschinen, die man kosteneffektiv kaufen kann. Sie kommen Firma B 120 Euro/Monat (150 Euro/Monat für 2x NVME) teuer, bei einer Nutzungszeit von 5 Jahren. Dabei sind 50% der Kosten Abschreibung von Maschine, Chassis, Netzwerk- und Rackanteil, RZ-Kosten, und 50% Strom/Energiekosten.
Firma B und Oracle MySQL haben quartalsweise zusammengesessen und Dinge besprochen, seit 2015 oder so sind solche Maschinen für keine MySQL Version je ein Problem gewesen – sonst hätte ich den MySQL-Meschen da leider einen Schlag an den Hals geben müssen, weil sie im Job gepennt haben.
Bei unseren VM-Experimenten mit wirklich großen Maschinen (96 Core pro Socket, 192 vCPU) war wie gesagt lediglich das Pinning und Numa-Management ein Problem, weil alles schlimm wird, wenn man 128 GB Buffer Pool hat, von dem die Hälfte hinter der anderen CPU liegen und 3x langsamer sind.
–membind=nodes, -m nodes
Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes.
nodes may be specified as noted above.
–cpunodebind=nodes, -N nodes
Only execute command on the CPUs of nodes. Note that nodes may consist of multiple CPUs. nodes may be specified as noted above.
Okay, “numactl -N 0 -m 0 blabla” zum Binden an Socket 0 und “-N 1 -m 1” für den anderen Socket.
Hi Kris, vielen Dank für Deine ausführliche Antwort! Ich fasse zusammen: VMware VMs mit “vielen” vCores (16 - 32) sind kein Problem auch wenn die CPU zu 100% user time ausgelastet sind (1600% - 3200% CPU user time, kein I/O da nur aus Buffer Pool gelesen wird). Sofern man beachtet, dass die VM auf eine Numa Node gepinnt ist… Ansonsten muss man mit 1/3 der Leistung rechnen… Habe ich das so in ewa richtig augefasst? LG, Oli
Wieviel weniger es ist ist variabel, da auch alle möglichen anderen Dinge passieren können, wenn man die Bridge zwischen den Cores dicht zieht. Da wird dann schnell alles schlimm und es ist schwer zu messen
Aber ja, soweit richtig. Bis 192 vCores habe ich ein 8.0 testen können, ist nicht schlimm, aber man muss es ggf. ein wenig befummeln
Wieso wird alles schlimm? Bei den meisten Intels sind Netz und I/o per Hardware an Sockel 0 angeflanscht. Macht man bei einer Intel die Bridge zwischen den Sockeln dicht, verhungert alles außer socket 0
Um das zu handhaben machen Gerätetreiber interrupts, Irq handling ist also meist zwangsweise auf socket 0. Will sagen, die Kisten sind nicht symmetrisch auf Hardware Ebene. In einer VM spielt das keine Rolle, aber auf dem Host schon