I've run an Openstack cloud. Local to the host NVME's directly attached to VMs is unbeatable. All clouds offer this. But that storage is ephemeral and it was when I implemented it in Openstack too.
There's not enough redundancy. You could raid1 those NVME's when before they get attached to a VM and that helps with hardware failures, but you get less of them to attach. Even if you RAID them, there's not a good way to move that VM to another host if there's a RAM or CPU or other hardware issue on that host.
These VM's with NVME's directly attached have to basically be treated as bare metal servers and you have to do redundancy at the application layer (like database replication).
But again, all of the major cloud services offer these types of machines if you NEED NVME IO speed. There are quirks though. For example, in Azure it seems like you have to be able to expect the VM to be moved whenever Azure feels like it and expect that ephemeral data to be wiped. Whereas in Openstack, we would do local block level migrations if we HAD to move the VM to another host. That block level migration required the VM to be turned off but it did copy the local NVME data to another host. If this happened it was all planned and the particular application had app level redundancy built in so it was not a problem. If the host crashed, that particular VM would just be down till the host was fixed and came back online.
> There's not enough redundancy. You could raid1 those NVME's when before they get attached to a VM and that helps with hardware failures, but you get less of them to attach. Even if you RAID them, there's not a good way to move that VM to another host if there's a RAM or CPU or other hardware issue on that host.
The trick is building a block storage system that treats the local disk as write-back cache with async replication to networked storage. Like the blog post says they'll be doing.
The async replication has some integrity/recovery concerns for sure, but it the trick that enables local speeds. And people have been happy with async replication for their database for a very long time. Just need good observability for the durability delay.
Once you have that, you can do live VM migration if you're careful enough about dirty data. The new node just starts out with an empty cache.
It's not exactly trivial, but it's also probably not the biggest challenge if you're genuinely building a brand new cloud and going to compete against the hyperscalers. (Hell, hire me and I can write it for you. It'll take time and CPU hours to get stable, but the magic required is only mildly arcane.)
> Even if you RAID them, there's not a good way to move that VM to another host if there's a RAM or CPU or other hardware issue on that host.
This is the critical point. All hardware fails eventually. The CPU and RAM are, in a real sense, also ephemeral. The only relevant question is what the risk tolerance of the use-case is. If restoring from async backup is sufficient, then embrace ephemerality and keep backups. If you need round-the-clock availability, pick an architecture that lets you fall over gracefully to another machine, and embrace the ephemerality when you inevitably need to do so.
When you're an OpenStack cloud provider, your customers choose.
When you're a customer using Open Source software, your vendors choose.
Using a mixture of directly attached NVMe and network-attached volumes with backup is the sweet spot for me.
I don't need to maintain my own network filesystem (Ceph), and I can put applications that mirrors its database natively on NVMe and everything I don't have much control over on network-attached volumes.
I feel like there's something better not yet made.
There's not enough redundancy. You could raid1 those NVME's when before they get attached to a VM and that helps with hardware failures, but you get less of them to attach. Even if you RAID them, there's not a good way to move that VM to another host if there's a RAM or CPU or other hardware issue on that host.
These VM's with NVME's directly attached have to basically be treated as bare metal servers and you have to do redundancy at the application layer (like database replication).
But again, all of the major cloud services offer these types of machines if you NEED NVME IO speed. There are quirks though. For example, in Azure it seems like you have to be able to expect the VM to be moved whenever Azure feels like it and expect that ephemeral data to be wiped. Whereas in Openstack, we would do local block level migrations if we HAD to move the VM to another host. That block level migration required the VM to be turned off but it did copy the local NVME data to another host. If this happened it was all planned and the particular application had app level redundancy built in so it was not a problem. If the host crashed, that particular VM would just be down till the host was fixed and came back online.