官术网_书友最值得收藏!

Performance SLA

We have covered the metrics; how do we put them to use?

As usual, let's begin with the customer. In this case, it is your CIO or head of Infrastructure, as the scope now is all VMs, and not just one VM.

For performance, the main requirement from your CIO or management is typically around your IaaS system's ability to deliver. They want your IaaS to perform, as their business runs on it. The question is this:

How do you prove that… not a single VM… in the past 1 month… suffers unacceptable performance hit because of non-performing IaaS?

That's an innocent, but loaded, question. You need to consider the impact carefully before answering, "That's easy!"

If you have 1000 VMs, you need to answer for 1000 VMs. For each VM, you need to answer for CPU, RAM, disk and network. That's 4000 metrics. If your management or customer agrees on a 5-minute sampling period, you have 12 samples in 1 hour. In 1 day, you have 288 samples. In 1 month, you have nearly 8750 samples (30.4 days on average). For 1000 VMs, this means 4000 x 8750 = 35,000,000 chances where your IaaS can fail in serving the customers! 35 million chances in a month, and you need to repeat this performance every month.

You are right! As an IaaS architect, you are actually a magician. The way you achieve that feat is via service tiering. You should always have at least two tiers. The VM owner is used to the physical environment. She will compare you with the physical world as that represents the ideal. So if you only have one tier, you will have no choice but to deliver a tier that can match a physical server. As a physical server is dedicated, it means you will not be able to overcommit. If you overcommit, you run the risk of contention, and hence, failure to meet performance SLA.

Having just a single tier means any mission-critical VM is getting the same class of service as a development VM—sacrificing either the performance of mission critical VMs or the cost of the overall solution. On the other hand, having too many tiers adds operational complexity and hence increases cost.

We use three service tiers as that provides a good balance. The table in the following figure provides an example that is suitable for a large environment:

This table should look logical to you. One thing that you need to pay attention to is the oversubscription ratio. You should emphasize to your customers that they are nothing but rough guidelines. You may go above them if the performance is still acceptable.

In a small environment, you should have just two tiers. Performance and capacity are much more easily done on a per-cluster basis. Having just two tiers means you can go as low as two clusters.

In a very large environment, with more than 100,000 VMs, you should also have just three tiers. Avoid having four or five tiers as that complicates operation. Providing too many choices can result in more confusion and frustration. Remember, each tier carries a different pricing. I'd make the gap large enough so it's easy to choose.

Do not confuse the SLA you promise with the design you implement to achieve it. Your customers only care, and should only care, about the SLA. The design is your internal matter.

As the performance SLA is driven by the quality of the design, you need to be clear on the key design differentiators. In the preceding table, I've provided my key design differentiators. Your design may differ. That is perfectly fine. The key thing is your customers, not just you, must be able to see the difference. If they cannot, that means your tiers are too close to each other.

Here are a few highlights of my design:

  • Tier 1 is your "physical tier". It matches the performance of a physical server. This is possible as it does not have CPU or RAM over commitment. No VM needs to wait or contend for resources. As a result, reservation is not even applicable. We can guarantee that the value of the CPU contention metric will be near zero, and the value of RAM contention will be zero.
  • All hosts in the Tier 1 cluster are also identical. This means the specifications of all the hosts are identical. This makes performance predictable. We cannot make such a guarantee in Tier 2 and Tier 3. The cluster may start with four identical nodes, but over time may grow into 16 nodes. The 16-node cluster is certainly not identical in terms of performance as the new nodes will sport faster technology.
  • Tier 2 is where the majority of production VMs live. If the majority of your production VMs are in Tier 1, there is something wrong with your definition of critical. It is not granular enough. Yes, all VMs in production are important. However, some are more important than others.
  • In Tier 1, the virtual disk (VMDK) is thick provisioned, so there is no performance penalty in the first write. We do not provide the same service quality in lower tiers.

With these guidelines, you have a clear 3-tier IaaS based on performance. Let's now cover the actual performance SLA that you put in the contract with your customers. You need to define the service for each of the four infrastructure components (CPU, RAM, disk, and network). For each one, specify the actual value and metric to track.

The table in the following figure provides an example of server VMs. For VDI VMs, we need to have a different definition. It is also a sample recommendation. This is not an official guideline from VMware but is just my experience. The actual numbers you set may not be the same as mine. You should choose numbers that you are comfortable with and have been agreed upon by your customers (the application team or business units).

If you have no idea what numbers you should set, use vRealize Operations to look at your actual data in the past so that your number is backed by fact. It is sufficient to go back a few months. The larger your sample, the shorter you can look back. For example, if you have, say, 100 clusters, you can in fact just take a few clusters and go back 1 month on each:

Notice what's missing? It's something that you normally have if you are doing capacity management using a spreadsheet.

Yup, it's the famous consolidation ratio. In fact, your entire infrastructure is no longer there. It is all about the VM. The values apply to VMs, not to your infrastructure.

It's not there because it's not relevant. Mark Achtemichuk, an expert on VMware performance, explains it in his article at https://blogs.vmware.com/vsphere/2015/11/vcpu-to-pcpu-ratios-are-they-still-relevant.html. As explained in the blog, oversubscription is an incomplete policy. It fails to take into account contention. I've seen this in a global bank, where the higher tier performed worse than the lower tier. Once you oversubscribe, you are no longer able to guarantee consistent performance. Contention can happen, even if your ESXi utilization is not high.

You should have two numbers:

  • One is the SLA your customers agree on
  • The other is an internal number for your own proactive monitoring

The second number is naturally lower. For example, you may set 10 percent as the official SLA and 8 percent as the internal threshold for you to start proactive adjustment. The delta is a buffer you have for proactive troubleshooting.

With that, let's look at each component.

CPU SLA

Tier 1 is not 0 percent as the CPU ready counter in vCenter does not hit zero even if there is no contention. From my experience, a 5 minute average of 0.5 percent is a perfectly achievable SLA when a VM experiences no contention.

Tier 2 is set at 3 percent, as that's a reasonable drop from 1 percent. You can set yours to 5 percent if you want a wider gap or if your customer can tolerate slower performance.

Tier 3 is set at 13 percent as it is no longer using the maximum CPU power management setting. It is left at the default setting, which is Balanced. This impacts the latency counter; hence, we have to set a lower SLA. If you set CPU power management to maximum, then you should lower the number to 5-6 percent.

Memory SLA

For CPU and RAM, you can notice that their numbers are not consistent. This is because their nature is different. RAM is not affected by power management, so consistent scaling can be used.

Tier 1 is 0 percent because every single VM has its memory backed by the hypervisor. There will be no ballooning, swap, or compression. As a result, the value for contention in Tier 1 will be 0 percent.

Network SLA

Ideally, you use network latency as the metric. Latency, by definition, is the time taken between a source and destination pair. Different pairs can, and generally will, certainly have different latencies.

Network has a fundamentally different nature to compute and storage. Network is an interconnect, not a node. A VM does not "deal with" the network. It communicates with other VMs or physical machines, using the network as the medium. So, the destination is not constant. A VM may have network latency at 9 in the morning. Say it is a server that serves a thousand users in the same office building. By the time you get the information, that VM may have stopped communicating with the same users and is now serving other users. It is hard to troubleshoot the network latency as the pair has changed.

This unique nature of network is elaborated in Chapter 9, Infrastructure Monitoring Using Blue Medora. For the purpose of SLA, we need to pick a counter that your customers can accept. We pick the drop packet counter as it's the best we can do without injecting a lot of extra packets and doing a lot of measurement. Performance monitoring needs to be light, else the monitoring itself will degrade the performance. I'm sure you have experienced issues where the performance monitoring agent is the one causing the performance issue. Yup, it's typically the collector agent.

VM dropped packets is actually not a good proxy for all cases. A VM may drop packets if its CPU is saturated. It may also drop packets if its network is wrongly configured. As a result, you should complement it with drop packet monitoring at the ESXi vmnic level.

You may notice that I used the same performance SLA as for service tiers. This is because your network should not be dropping packets.

Storage SLA

The performance SLA is set at 10, 20, and 30 milliseconds for each tier, respectively. This is reasonable given that it is a 5-minute average. A VM that generates 500 IOPS for 5 minutes is performing 500 x 60 x 5 = 150,000 SCSI commands. It is possible that one of the commands does not return within the stipulated SLA.

Just a recap: do you know why we do not track the latency at the datastore level?

You are right. It is irrelevant.

主站蜘蛛池模板: 江达县| 长白| 德钦县| 西和县| 鲁山县| 石台县| 鞍山市| 资阳市| 福清市| 于田县| 九江县| 德保县| 丁青县| 墨脱县| 维西| 台东县| 桑日县| 陆川县| 南阳市| 夏邑县| 霍山县| 乌兰浩特市| 永福县| 泸西县| 宁德市| 平顺县| 毕节市| 涡阳县| 南投县| 新乡县| 闵行区| 双峰县| 武义县| 铜鼓县| 德钦县| 西乡县| 蒙城县| 府谷县| 五指山市| 广水市| 西藏|