官术网_书友最值得收藏!

Chapter 1. Introduction

Imagine you're working as an administrator of a large IT infrastructure. You have just started receiving emails that a web application has stopped working. When you try to access the same page, it just doesn't load. What are the possibilities? Is it the router? Or the firewall? Perhaps the machine hosting the page is down? Before you even start thinking rationally about what is to be done, your boss calls about the critical situation and demands an explanation. In this panic situation, you'll probably start plugging everything in and out of the network, rebooting the machine and so on, and that doesn't help.

After hours of nervously digging into the issue you finally find the solution— the web server was working properly, but was timing out on communication with the database server. This was because the machine with the database was not getting a correct IP as yet another box had run out of memory and Dynamic Host Configuration Protocol (DHCP) server had stopped working. Imagine how much time it would take to find all that out manually. It would be a nightmare if the database server was in another branch of the company, in a different time zone, and perhaps the people over there were still sleeping.

And what if you had Nagios up and running across your entire company? You would just need to go to the web interface, see that there are no problems with the web server and the machine it is running on. There would also be a list of what's wrong – that the machine serving IP addresses to the entire company is not doing its job and that the database is down. If the set-up also monitored the DHCP server, you would get a warning email that very little swap memory is available on it, or that too many processes are running. Maybe it would even have an event handler for such cases to just kill or restart noncritical processes. Also, Nagios would try to restart the DHCP server process over the network, in case it is down.

In the worst case, Nagios would speed up hours of investigation to 10 minutes. In the best case, you would just get an email that there was a problem, followed by another one saying that the problem is already fixed. You would just disable a few services and increase the swap size for the DHCP machine and solve the problem once for all. And nobody would even notice there was a problem.

主站蜘蛛池模板: 九寨沟县| 化隆| 进贤县| 通道| 江华| 岐山县| 洞口县| 北宁市| 内江市| 开封县| 正定县| 泾川县| 绍兴县| 乳源| 扶余县| 丰都县| 齐河县| 唐山市| 莆田市| 东源县| 民勤县| 永泰县| 奉化市| 阿克苏市| 吉水县| 若尔盖县| 广平县| 大庆市| 定远县| 德清县| 都兰县| 玛纳斯县| 尼玛县| 长沙县| 蕲春县| 九江市| 高尔夫| 旬邑县| 铜山县| 陇南市| 松滋市|