I've been running a XenServer 6.2 server for a month or so and recently one of my VMs will not perform snapshots.

I receive the error "The snapshot chain is too long" although I have removed all of the snapshots. I've found similar problems reported for older versions of XenServer but always with a this-has-been-solved-in-6.2 type note.

Here is the end of the many lines in SMlog created when attempting a snapApr 28 21:39:02

normandy SM: [10191] lock: closed /var/lock/sm/lvm-d964aab2-8278-2e43-d79b-4cdb394a6646/4cef1b2c-9461-4525-851d-1f908087a8b2
Apr 28 21:39:02 normandy SM: [10191] lock: acquired /var/lock/sm/lvm-d964aab2-8278-2e43-d79b-4cdb394a6646/173a35a9-8aff-42b3-9cbb-a6d05ec3e9dc
Apr 28 21:39:02 normandy SM: [10191] Refcount for lvm-d964aab2-8278-2e43-d79b-4cdb394a6646:173a35a9-8aff-42b3-9cbb-a6d05ec3e9dc (2, 0) + (-1, 0) => (1, 0)
Apr 28 21:39:02 normandy SM: [10191] Refcount for lvm-d964aab2-8278-2e43-d79b-4cdb394a6646:173a35a9-8aff-42b3-9cbb-a6d05ec3e9dc set => (1, 0b)
Apr 28 21:39:02 normandy SM: [10191] lock: released /var/lock/sm/lvm-d964aab2-8278-2e43-d79b-4cdb394a6646/173a35a9-8aff-42b3-9cbb-a6d05ec3e9dc
Apr 28 21:39:02 normandy SM: [10191] lock: closed /var/lock/sm/lvm-d964aab2-8278-2e43-d79b-4cdb394a6646/173a35a9-8aff-42b3-9cbb-a6d05ec3e9dc
Apr 28 21:39:02 normandy SM: [10191] ***** generic exception: vdi_snapshot: EXCEPTION SR.SROSError, The snapshot chain is too long
Apr 28 21:39:02 normandy SM: [10191]   File "/opt/xensource/sm/SRCommand.py", line 106, in run
Apr 28 21:39:02 normandy SM: [10191]     return self._run_locked(sr)
Apr 28 21:39:02 normandy SM: [10191]   File "/opt/xensource/sm/SRCommand.py", line 153, in _run_locked
Apr 28 21:39:02 normandy SM: [10191]     return self._run(sr, target)
Apr 28 21:39:02 normandy SM: [10191]   File "/opt/xensource/sm/SRCommand.py", line 231, in _run
Apr 28 21:39:02 normandy SM: [10191]     return target.snapshot(self.params['sr_uuid'], self.vdi_uuid)
Apr 28 21:39:02 normandy SM: [10191]   File "/opt/xensource/sm/LVMSR", line 1448, in snapshot
Apr 28 21:39:02 normandy SM: [10191]     return self._snapshot(snapType)
Apr 28 21:39:02 normandy SM: [10191]   File "/opt/xensource/sm/LVMSR", line 1546, in _snapshot
Apr 28 21:39:02 normandy SM: [10191]     raise xs_errors.XenError('SnapshotChainTooLong')
Apr 28 21:39:02 normandy SM: [10191]   File "/opt/xensource/sm/xs_errors.py", line 49, in __init__
Apr 28 21:39:02 normandy SM: [10191]     raise SR.SROSError(errorcode, errormessage)
Apr 28 21:39:02 normandy SM: [10191]
Apr 28 21:39:02 normandy SM: [10191] lock: closed /var/lock/sm/d964aab2-8278-2e43-d79b-4cdb394a6646/sr

I am pulling my hair out, I'd really appreciate it if someone can point me in the right direction.

Thank you

Paul Whalley
  • 743
  • 1
  • 7
  • 6

2 Answers2


You should make sure that the coalesce process is already done. There's a lot of ways to check if everything is went fine.

First of all ssh into your XenServer main node and do the following:

xe sr-list

Get the UUID of the Storage Repository of the VM's you're working on. After that check if there are any chained VHD files with vhd-util.

vhd-util scan -f -m "VHD-*" -l "VG_XenStorage-${UUID-Of-Your-SR}" -p

Replace ${UUID-Of-Your-SR} with your SR UUID from the first command.

It will output all VHDs in the SR, and those with a VHD chain will de shown as a tree. If still exists a tree you can check if xe is still processing the VHDs. To do that just type:

xe task-list

And observe the output. If the output was empty you should check in every server of your pool if theres a vhd-utilprocess running. If yes, it should be treated as a problem in the Xe Toolstack.

Another way to solve the problem, is copy the problematic VM disc and try to start a new VM with this disk. Since it will be copied, XenServer will look through the VHD chain and create a one single VHD image with all VHDs coalesced in one image.

I know that's a huge problem, but VHD's are the only thing in XenServer that fails to work as expected.

Vinícius Ferrão
  • 5,400
  • 10
  • 52
  • 91

I have had this problem from Xenserver 5.5 through 6.02 and a total change in hardware. The only sure way to fix this is to copy the server to a new storage repository and delete the old VM. Our main servers run at about 2% cpu, so waiting for a background process like coalesce to finish is not an issue.

/usr/bin/vhd-util scan -f -a -p -c -m VHD-* -l `/usr/sbin/vgdisplay|grep Name|awk '{print $3}'`

gets me a list of all the chains, as Mr. Ferrao indicates above. If you redirect that list to a file, then you will see what I call "good" chains and "bad" chains. A good chain:

vhd=VHD-7c12552c-96fb-413f-8cc7-4cb7a6a1bd88 capacity=8589934592 size=4777312256 hidden=1 parent=none
vhd=VHD-f9a91117-0062-473b-89f9-95030f57b736 capacity=8589934592 size=8615100416 hidden=0 parent=VHD-7c12552c-96fb-413f-8cc7-4cb7a6a1bd88
vhd=VHD-1d070bb9-1dda-4e13-a732-9bbc3e7e0af2 capacity=8589934592 size=4236247040 hidden=1 parent=VHD-7c12552c-96fb-413f-8cc7-4cb7a6a1bd88
  vhd=VHD-6f9b7573-0ef5-44d9-bde9-47587f78fc86 capacity=8589934592 size=8388608 hidden=0 parent=VHD-1d070bb9-1dda-4e13-a732-9bbc3e7e0af2
  vhd=VHD-f15cc2d8-d1ee-4b11-9853-5c84cab81715 capacity=8589934592 size=2646605824 hidden=1 parent=VHD-1d070bb9-1dda-4e13-a732-9bbc3e7e0af2
     vhd=VHD-32266eef-6665-4aac-83c5-5e1ab0c01861 capacity=8589934592 size=8388608 hidden=0 parent=VHD-f15cc2d8-d1ee-4b11-9853-5c84cab81715
     vhd=VHD-a910a28c-a484-48ae-86fb-8a53eab7db65 capacity=8589934592 size=2176843776 hidden=1 parent=VHD-f15cc2d8-d1ee-4b11-9853-5c84cab81715
        vhd=VHD-ecf62cd9-a76f-4a28-a27d-6a1f7b464554 capacity=8589934592 size=8388608 hidden=0 parent=VHD-a910a28c-a484-48ae-86fb-8a53eab7db65
        vhd=VHD-1ec4deff-f04f-4272-9edc-78b0f9fd9cff capacity=8589934592 size=2122317824 hidden=1 parent=VHD-a910a28c-a484-48ae-86fb-8a53eab7db65
           vhd=VHD-026f73b5-8600-47ee-ada1-3628b4a04a19 capacity=8589934592 size=8388608 hidden=0 parent=VHD-1ec4deff-f04f-4272-9edc-78b0f9fd9cff
           vhd=VHD-4659cef9-64a3-4fca-bf54-3bcc23665c36 capacity=8589934592 size=8615100416 hidden=0 parent=VHD-1ec4deff-f04f-4272-9edc-78b0f9fd9cff

I realize that the box is wrapping the lines here, so not obvious, but there is normally a hidden and unhidden line, then another hidden, unhidden line (hidden=1 or hidden=0) Only the hidden=0 lines can be seen in XenCenter as snapshots. However, the vms that are building towards a "chains too long" status look different:

vhd=VHD-970758dc-a396-4503-ae24-ebf093759947 capacity=19864223744 size=19633537024 hidden=1 parent=none
vhd=VHD-9ef661b3-d20e-401a-be01-d4a020960c17 capacity=19864223744 size=1769996288 hidden=1 parent=VHD-970758dc-a396-4503-ae24-ebf093759947
  vhd=VHD-00864374-1fa2-4492-9c1c-0e6fdf89de7a capacity=19864223744 size=3133145088 hidden=1 parent=VHD-9ef661b3-d20e-401a-be01-d4a020960c17
     vhd=VHD-101649bf-13af-4ba2-948d-d7db192ca7ad capacity=19864223744 size=1950351360 hidden=1 parent=VHD-00864374-1fa2-4492-9c1c-0e6fdf89de7a
        vhd=VHD-83dca990-f158-41bc-b32b-69f8f8862c15 capacity=19864223744 size=3233808384 hidden=1 parent=VHD-101649bf-13af-4ba2-948d-d7db192ca7ad
           vhd=VHD-8cb96357-c872-40e2-adb2-aa1bbe613dca capacity=19864223744 size=1610612736 hidden=1 parent=VHD-83dca990-f158-41bc-b32b-69f8f8862c15
              vhd=VHD-84dca005-af4b-4615-88cb-124977b13c8e capacity=19864223744 size=3468689408 hidden=1 parent=VHD-8cb96357-c872-40e2-adb2-aa1bbe613dca
                 vhd=VHD-b0904a6f-c169-4d6b-816d-9d775600535d capacity=19864223744 size=1925185536 hidden=1 parent=VHD-84dca005-af4b-4615-88cb-124977b13c8e
                    vhd=VHD-e268d580-a245-4960-a13f-9a9c252fc9e8 capacity=19864223744 size=3980394496 hidden=1 parent=VHD-b0904a6f-c169-4d6b-816d-9d775600535d
                       vhd=VHD-ac706540-ba7c-4eba-b919-aa88784ae796 capacity=19864223744 size=1933574144 hidden=1 parent=VHD-e268d580-a245-4960-a13f-9a9c252fc9e8
                          vhd=VHD-96a39f51-5c1a-4234-974e-7de91b4e49f2 capacity=19864223744 size=3170893824 hidden=1 parent=VHD-ac706540-ba7c-4eba-b919-aa88784ae796
                             vhd=VHD-32b1d67c-1011-460b-ac5d-5d83ade7e5f2 capacity=19864223744 size=1673527296 hidden=1 parent=VHD-96a39f51-5c1a-4234-974e-7de91b4e49f2
                                vhd=VHD-81f9dda9-e26d-49bb-97f3-72cbb9a4c4bf capacity=19864223744 size=19910361088 hidden=0 parent=VHD-32b1d67c-1011-460b-ac5d-5d83ade7e5f2

Again, I don't know if this will come out without wrapping, but notice that the lines are all hidden, hidden, hidden etc. instead of hidden, unhidden, hidden,unhidden etc. as in the first example.

I made a set of scripts to add and subtract each set of hidden, unhidden lines, and if the hiddens start to add up beyond 5 or 6, it emails me. I don't know how much trouble it is in your case to run the line above and look at the resulting list of chains twice a week, but I find that a 3 second glance immediately shows me double-stepped (good) chains vs singly indented lines for the "bad" chains. (We run about 35 vms on a pool of 2 machines, so not a big operation.)

How to work back from the "bad" chain to see what server belongs to it: A simple manual way is to copy out the "bad" chain(s) and run a script on them. I run this:

TODAY=`date +"%m.%d.%y"`
filearray=(`cat $1`)

for lin in ${filearray[@]}
  echo $lin|grep "hidden=0" >NULL
  if [ ${PIPESTATUS[1]} == "0" ];
   matchstr=$(echo $lin|awk '{print $1}'|awk -F"-" '{print $6}')
echo "vhd search string=" $matchstr
/var/log/namefromchain.sh $matchstr

which calls namefromchain.sh, which is:

xe vbd-list|grep -B1 $1|grep vm-name-label|awk -F"RO): " '{print $2}'

I can't remember why they are two separate scripts, but I'm not very experienced at this stuff. You will have to take the warts off and adapt to your situation, but the concepts are there.

  • 11
  • 1