官术网_书友最值得收藏!

Disks

When choosing the disks to build a Ceph cluster with, there is always the temptation to go with the biggest disks you can, as the figures look great on paper. Unfortunately, in reality, this is often not a great choice. Although disks have dramatically increased in capacity over the past 20 years, their performance hasn't. First, ignore any sequential MBps figures, and you will never see them in enterprise workloads. There is always something making the I/O pattern nonsequential enough that it might as well be random. Second, remember these figures:

7.2k disks = 70-80 4k IOPS

10k disks = 120-150 4k IOPS

15k disks = You should be using SSDs

As a general rule, if you are designing a cluster that will offer active workloads rather than bulk inactive/archive storage. Design for the required Input/Output Operations Per Second (IOPS), not capacity. If your cluster will contain largely spinning disks with the intention of providing storage for an active workload, an increased number of smaller capacity disks are normally preferred over the use of larger disks. With the decrease in cost of SSD capacity, serious thought should be given to using them in your cluster, either as a cache tier or even for a full SSD cluster.

A thought should also be given to the use of SSDs as either journals with Ceph's filestore or for storing the DB and write-ahead log (WAL) when using BlueStore. Filestore performance is dramatically improved when using SSD journals and would not be recommended to be used without unless the cluster is designed to be used with very cold data.

Also, consider that the default replication level of 3 will mean that each client write I/O will generate at least 3x the I/O on the backend disks. In reality, due to the internal mechanisms in Ceph, this number in some instances will be nearer six times write amplification. If no SSD journals are to be used in the cluster, then this number might be nearer 12 times write amplification in the worst case scenarios.

Understand that although Ceph enables much more rapid recovery from a failed disk as every disk in the cluster will take part in the recovery. However, larger disks still pose a challenge, particularly when looking at having to recover from a node failure. In a cluster comprising 10 1 TB disks each 50% full, in the event of a disk failure, the remaining disks would have to recover 500 GB of data between them or around 55 GB each. At an average recovery speed of 20 MBps, recovery would be expected in around 45 minutes. A cluster with a hundred 1 TB disks would still only have to recover 500 GB of data, but this time, that task is shared between 99 disks. In theory for the larger cluster to recover from a single disk failure, it would take around four minutes. In reality, these recovery times will be higher as there are additional mechanisms at work, which increases recovery time. In smaller clusters, recovery times should be a key factor when selecting disk capacity.

主站蜘蛛池模板: 南召县| 景洪市| 南汇区| 青岛市| 江源县| 天全县| 石家庄市| 临武县| 汉源县| 呼图壁县| 牙克石市| 洛宁县| 桃园县| 家居| 崇文区| 响水县| 浦江县| 长丰县| 嵊泗县| 二连浩特市| 宜丰县| 策勒县| 水富县| 上饶县| 娱乐| 临沧市| 鄂州市| 榆树市| 屏东市| 四平市| 曲水县| 南陵县| 嵩明县| 尤溪县| 定边县| 郯城县| 旬邑县| 大石桥市| 尉犁县| 天台县| 贵港市|