-1

Cloud data warehousing being so popular recently, I am wondering if there is an inherent contradiction in the manner in which they are architected:

Terradata, Greenplum etc. require 'Shared Nothing' architectures to perform well (per the vendor documentation) however the nature of the cloud is that most things are shared.

When you spin up a VM in whichever vendor's cloud you favour, you are always going to be using shared storage (such is the nature of virtualization).

Surely this opens up the possibility of storage array and/or SAN contention? Can anyone help me understand:

  • how any vendor can reliably ensure storage throughput (which is critical for DW performance) without creating a configuration bottleneck?
  • Why do we still talk about 'shared nothing' when every single cloud supplier uses virtualization and therefore shared storage?
Peter
  • 99
  • 1

1 Answers1

3

how any vendor can reliably ensure storage throughput (which is critical for DW performance) without creating a configuration bottleneck?

By hiring really smart people to design their back-end systems.

Why do we still talk about 'shared nothing' when every single cloud supplier uses virtualization and therefore shared storage?

Shared nothing. To quote Inigo Montoya:

enter image description here

When applied to distributed systems, shared nothing typically does not mean that they have exclusive access to the underlying hardware. Instead, shared nothing refers to the fact that the members of the distributed system do not need to have access to the same shared resource...shared storage, for instance.

To give a concrete example, in the recent past, in a VMware vCenter cluster, one needed to use shared storage volumes to be able to use their live migration "VMotion" technology. Each ESXi member host would have access to the same back-end storage, where the virtual machine data was actually stored. This is not a shared nothing system, as the hosts did have to share something. Storage in this case.

Fast forward to the current vCenter/ESXi release. Now, member ESXi hosts no longer need to have access to the same shared storage volume. They can migrate VMs between hosts directly, including transferring the VM backing data (vmdk/vmx files, etc.) from one host to the other. This is a shared nothing system.

Going back to your question about cloud vendors, shared resources, and performance assurance: just because a resource is shared does not mean that controls cannot be put in place to ensure a certain level of performance. For instance, in AWS, one can provision an EBS volume according to the specific IOPs required for their application. When one does this, AWS will guarantee that your specified IOPs setting can be met at all times. I use this type of configuration in AWS extensively, and can vouch for the fact that they do very well at meeting the IOPs settings their customers require.

VMware (and I presume Hyper-V) has similar technologies available to restrict and prioritize the storage, network, and CPU usage of virtual machines so that they behave in a predictable manner and do not adversely affect each other.

EEAA
  • 108,414
  • 18
  • 172
  • 242