Q: I deleted a bunch of files from one of my virtual machines yesterday. Deduplication happened overnight, but the total disk space in use didn’t go down. That doesn’t make any sense.
Q: I completely evacuated one of the LUNs on my NetApp array, but the NetApp still says that the LUN is almost completely full, even after deduplication. How can that be?
A: To understand what is happening you need to know a little bit about how a file system works. A simple way to explain it is that a file system stores the data in a file as data blocks, and it stores the name of the file (and other data, like access times, etc.) in a directory block. The entry in a directory block points to where all the file’s data blocks are.
Files are found on disk via their directories, and when you move a file from one directory to another that file gets transferred to the new directory block, and removed from the old one. All the data blocks that belong to the file stay put, because it’s only the directory blocks that need to change.
When you rename a file the file system just changes its name in the directory block, and leaves all the data alone.
When you want to delete a file all the file system has to do is remove the entry in the correct directory block. Once that happens the file is “gone.” However, the filesystem does nothing to remove the data blocks that were part of that file, though. They’re still out there on disk, just not visible to the filesystem.
This is why deduplication doesn’t instantly shrink your disk space. All that data is still out there, just like it was before, it’s just that your OS can’t see it anymore. In the case of VMware you also have to remember that not only do you have filesystems in your VMs, but VMFS is indeed a filesystem, too, with these same properties. Which is why it’s possible to have a completely empty VMFS volume but have your NetApp array complaining that the LUN/Volume/etc. is full.
If you want to have deduplication reclaim the space you have to actually overwrite that old data with something that’s easy to deduplicate, like a huge file full of zeros. On Linux you’d do something like what Leo Raikhman suggests as his zero-out script, and on Windows you can use sdelete to do the same.
Within a vm you can also use the vmtools shrink option. On a vm hosted on ESX this will not shrink the disk but it will zero out the file system. Works for windows vm’s, not sure about others. We use it before we do VRanger backups because it reduces the size of the compressed backup.
wharlie that’s correct except that a shrink operation cannot be scheduled.
That’s why sdelete works.
I like to use a map analogy for non-techies. I helped my friend recover from images on his corrupt camera memory card and he asked how it worked. I told him that the memory card has a map on it, and the pictures are all the places on the map. If the map gets lost, the places don’t disappear, just the route you take to get to them. I don’t think he cared – he was just happy to have his photos back.