Thick Disk v. Thin Disk -- "It Depends!"

It is surprising how often this scenario comes up, and I get asked the question, "Should we thin-provision or thick-provision our VM's?" My answer is almost always the same, "it depends." Today, I have the same answer, but for very different reasons from in the past. In the past, my answer was based on performance-related issues and saving expensive disk space. These are still very, valid concerns, however, now I answer based on risk. In recent days have been working with three different clients that have all experienced outages related to thin-disk provisioning.

In each case the outages were caused by a lack of disk space. A critial storage LUN filled up and the VMs could no longer log, or write to their disks and crashed. In two of this incidents alerting was enabled, and the disk space was being monitored. The monitoring came too late because a run-away process started filling up the disk. The provisioned disk grew to it's "allocated" size and filled the LUN. In the second incident the monitoring happened and it was ignored, and then over the weekend the LUN filled up completely. In both of these situations the monitoring wasn't adequately understood and some alerts were ignored. The third... well, they just didn't know until too late.

I am not going to talk about monitoring or how to monitor. That's another subject for another day.

There are huge benefits for entities to do thin disk provisioning. And the Vendors are right to push this hard, helping corporations get the storage costs - down. And corporations are right to accept thin-disk provisioning. However, and here is where the problem lies, monitoring must trend not just the allocated amounts but the pace of consumption, and there must be a planned response to the outage before it happens.

  • Are you monitoring the LUNs, not just for % of use, but also the trend in growth?
    A 1TB LUN with 150GB free, that is comsuming 10GB an hour is going to run out of space in less than a day.
  • Would you catch this?
  • How? Are you sure?
  • Do you test your monitoring?

Do you have a plan for disk-space shortages? Planning your response to shortages is critical.

  • Where and how are you going to move the VMs on a LUN that is growing?
  • Who and how are the VMs going to be shuffled around?
  • How will you get more space?
  • How long does the process to get more space take? SAN storage is expensive. In most corporations, it isn't easy to get more storage to solve this problem. Why, because the Storage Admins know if they give it out, they won't get it back. And it may not exist to be given out.

Do you have a response to prevent the above scenario, running out of disk space? If you are doing Thin-Disk provisioning then you need one!!! And you need effective monitoring, either people watching or automated systems alerting appropriately. And from where I sit most places don't have effective monitoring on an ongoing basis (24x7).

Without good preventive systems, running out of disk space when you over-provision storage (that is what thin-disk allows you to do), you will run out of space, and you will have an outage. The question becomes how bad will it be. The answer to this, "it depends." It depends on what VMs are on the LUN when it happens, it depends on how quickly you respond, it depends on when the outage occurs, and "it depends," on your response.

My recommendation today is to do thick-disk provisioning. It is all about risk-mitigation. There is less risk of an outage due to no disk space. Yes, the cost is higher on the front-end. Yes, it is a very real, quantifiable cost. The cost of the outage you'll suffer, well, "it depends." One of the clients above has now lost customer confidence in their solution and is losing some customers; another has lost real money to the outage, they are now evaluating if thin-disk is worth the risk. The answer so far is no.

In the past, I would have answered the question purely on performance. Now, performance is a minor factor. I am more concerned about the corporate response, and the risks. For production solutions and where risk needs to be minimized I lean heavily towards Thick-Disk provisioning. In test environments and non-critical solutions, I lean the other way. It only depends on your risk, and maturity. Most of the time, I recommend thick-disk, and probably you should too!!!