官术网_书友最值得收藏!

The Two-Part, 10 Step Implementation Guide

So now we know our risks and have an understanding of how to mitigate them. We know that we need a Disaster Recovery plan, and a Business Continuity Plan to go with it.

The following is a 10 step walk-through to easily implement a good Disaster Recovery plan in an organization. The first part, which describes the steps that should be taken during any disaster recovery implementation, is generic. The second part is oriented at Active Directory. For a quick overview, here is a listing of the steps:

General Steps

  1. Calculate and analyze
  2. Create a Business Continuity Plan
  3. Make a presentation to the Management (for Part 1 and Part 2 of this implementation guide)
  4. Define roles and responsibilities
  5. Train the staff
  6. Test the Disaster Recovery plan frequently

Active Directory oriented Steps

  1. Writing it is not all
  2. Ensure that everyone is aware of locations of the DRP
  3. Define the order of restoration for different systems (root first in hub site, then add one server, and so on)
  4. Go back to "Presentation to the Management".

Part One: The Steps for General Implementation

In this part, we go through the steps that are needed to implement pretty much any Disaster Recovery plan, for any purpose. The first few steps are always the same, since every DRP is similar. Only certain sections, numbers, and risks differ.

Calculate and Analyze

The first step is to calculate the risk associated with losing the Active Directory infrastructure, either partly or wholly. These numbers must be well-calculated and cannot be pulled out of thin air. It is best to include people from other business units into making this calculation. To calculate the numbers, always remember that:

  • Threat is the possibility of a certain destructive event.
  • Risk is the likelihood of a threat to occurring.
  • Cost is the cost in real money terms, including the time of IT staff, hardware and so on.
  • Vulnerability is a weakness in a system that can be exploited.

(please refer to http://www.microsoft.com/technet/security/guidance/complianceandpolicies/secrisk/srsgch01.mspx for more terms regarding risk management):

The calculation and analysis should also tale into account that if a site has two DCs and one fails, the other one will take the load and there won't be a break in service while the IT staff repairs the other DC.

Create a Business Continuity Plan

Business Continuity Plans are, as mentioned earlier, high-level documents and procedures. These should always accompany Disaster Recovery guides. A BCP can be created for the Active Directory as well, and the sample in Appendix can help us get started. But in order to create one, we need to have a clear view of our infrastructure and what impact any outage has on our business. The key thing that needs to be done is to define the acceptable downtime and recovery time.

The communications department should also be involved in this process so that the right communications channels and responsibilities are used and defined. Communications, within the company and with external entities, can be crucial in the event of a disaster if an organization has responsibilities to investors or is in collaboration with partners. Setting and defining the right channels and processes for company personnel helps to mitigate the outage because users will then know that there is an issue and that the IT department is working on it. They won't bombard you with phone calls complaining that they cannot work properly.

The second important thing, though no less critical, is to define a call tree. We need to have a complete contact list and an escalation path clearly defined in our BCP. The communications department also needs to be involved in this.

The call tree is a diagram with different levels of escalation, with the responsible person and phone number listed. With this, it is easy for someone to follow the chain of command and understand who needs to give the go-ahead for a certain action.

The following diagram shows the call tree for NailCorp as an example:

Create a Business Continuity Plan

During an outage or disaster, the communications department should take responsibility for communicating the issue to the entire workforce, and not just the technical staff. For example, the information bulletin could state that the IT department is aware of the problem and is working on solving it, and also give a rough estimate of the time within which the problem is expected to be fixed and normal operations resumed.

The BCP needs to be clearly understandable and well written, because in the event of a disaster, confusing instructions can hardly be helpful. Once the final draft is ready, it would be best to have the communications department or technical writer(s) go over it to ensure an easily-readable yet professional-looking BCP.

Present it to the Management (Part 1 and 2)

This is a step that should be done by someone who has good presentation skills and an in-depth knowledge of the BCP that was designed. It is also a "two-part step" because the project has to get going start before the final draft can be approved. In order to clear this process with the management, the importance and the consequences of the BCP have to be communicated to them in a non-threatening manner.

Often, people who were deeply involved in the design of the BCP and the DRP failed in making it official due to their lack of presentation skills and "social connectivity". Explaining in detail what we are trying to achieve and why it is crucial for the organization is essential. Once the process has been cleared and has received the go-ahead for creation of the BCP, we must proceed to the next step, and then come back to this step later.

Ultimately, it is in the best interest of the organization to have a proper DRP. Obtaining management clearance, and therefore being able to make the BCP and DRP an official standard in the organization, can open a lot of doors for you in the acceptance department. Whenever you hear complaints regarding the implementation, or disagreements in terms of content or testing, you can point to the directive and say: "take your complaints up to the next level". Nine times out of ten, the discussion ends at that point.

Define Roles and Responsibilities

This step is an important one because the people who have been delegated responsibilities are also accountable for them. This might not be what some people want, so the roles and responsibilities have to be discussed with the staff to ensure that they understand the implications of them.

A clear list of contacts and their roles in the BCP and DRG should be drawn up. This is not a step to be rushed. Make sure that everyone involved, including the managers, know what they are supposed to be doing when push comes to shove.

Also important here is the on-call role. Someone from the IT department should always be contactable. Rotation of this role, as well as adequate compensation for this duty, need to be clearly defined. The on-call person needs to have a clear understanding of what steps to take when something happens, and how he or she can determine whether this needs to be escalated or not.

Once everyone is on board and clear with their responsibilities, we need to put this into a visual form, a call tree. Many people, especially a lot of technical staff, complain about presenting things visually. B lot of professionals agree that a visual representation helps immensely in understanding a process, a visual representation of that process helps immensely. When you then read the text regarding that representation, most likely you will understand and memorize the process steps easier.

To get a clear picture of what roles and responsibilities should be included in the BCP of NailCorp, see the following table. This example gives an overview of who should be included.

Role/Title

Name

Email

Telephone

Chief Information Officer

   

Office phone and emergency number

Global IT Manager(s)

   

Office phone and emergency number

Regional IT Manager(s)

   

Office phone and emergency number

Regional Technicians or Specialist(s)

   

Office phone and emergency number

Global System specialist(s)

   

Office phone and emergency number

Regional System Specialist(s)

   

Office phone and emergency number

Internal Communications

   

Office phone and emergency number

External Communications

   

Office phone and emergency number

Train the Staff for DR

Training, it seems, is one of the things that is cut very frequently in an IT budget. In this case, you need to spend money on training your IT staff in disaster recovery. This can and should be part of the internal training sessions meant for your systems and your DRP. The IT staff, such as the help desk or first level support, must be trained in the procedures that involve them. This includes verification of a DC that no longer responds, or answering the phone and letting people know that things are already being investigated. The technical staff has to be aware of the correct steps for recovery, and the chain of command. Even if the work culture is very open and relaxed, it is very important to take the hierarchy and responsibility very seriously.

Testing a DRP for Active Directory is a fairly straightforward task. Today's virtualization products, such as VMWare Server or Microsoft's Virtual Server, form invaluable components for testing any kind of DRP. Especially for AD, testing the recovery process in an isolated environment with the production backup is possible.

Testing full recoveries, or even partial ones involves having at least two machines, acting as DCs, to test replication, and at least one for each of the application servers that you may have, to communicate with the AD. Also needed are a couple of clients who have accounts within your AD, whom you can use to test connectivity and recovery of computer accounts.

An example test would involve the following steps to test recovery and verify whether the recovery was successful. We are starting from a DC that is completely clean and not yet a fully-functioning DC.

Prerequisites:

  • A virtual server with enough RAM and disk space to run multiple virtual machines.
  • A virtual client machine that has a computer account in your current AD. Many organizations create a separate production OU specifically for machines like this and apply a generic GPO, which includes some specific settings, to it.
  • A few blank virtual Windows 2003 server installations or pre-installation images. (The resealing option in the sysprep utility in your support tools can save a lot of time and hassle here.)
  • Application server images to suit the needs of your organization.
  • A current backup of the AD.

Steps that Need to be Completed During Testing:
  • Promote the DC to make it the first DC in the domain with the same DNS name as your AD.
  • Restore the DNS database to the DC or to a separate server, if that is how your organization has set it up.
  • Restore the AD database from a backup onto the DC.
  • Verify that all the data in the AD is present, and that all the GPOs are linked to their respective OUs as they should be.
  • Verify that the DNS is working properly, and that your DC is registered in it.
  • Start one of the client VMs and make sure that it gets GPOs applied and can communicate with the DCs, say by accessing some arbitrary shares, or even just authenticating users on the machine.
  • Start the application servers, and verify whether the applications can access or communicate with the AD.
  • Verify that all these machines work perfectly with the isolated AD and can perform all the functions that they need to do.

The testing process should be documented, and only the DRP and its included documents should be used as guidance material. Any problems encountered on the way should be noted, and your DRP adjusted accordingly after the test.

Discipline during the whole process, and a clear understanding of the preventive measurements to be applied, when the technical staff start working overtime, will ensure a safe and speedy recovery without major inconvenience for other workers. The whole process of disaster recovery has to be as transparent to the user as possible, and proper training and implementation is the key to success.

Test Your DRP Frequently

Many organizations write their Disaster Recovery Plan and get everything approved, yet over the course of a few years, it has become so outdated that they have to start from scratch again. The importance is to test, test, and re-test the DRP frequently.

This doesn't mean that it should be tested on a weekly or monthly basis, but at least a bi-annual test should be in your plans. The DRP test should be based on real-life scenarios. So you create a "disaster" at a testing facility, and start the whole process all the way to the recovery of a functional system. All this should be written down and then analyzed. You will be amazed at how much you can learn and improve during your first two tests.

Excuses for non-participation in the test from anyone involved in the process should be unacceptable because in real life you wouldn't be able to say: "Well you know, this emergency doesn't fit in with my weekend plans…". However, tests should be announced well in advance to everyone involved. A coordinated schedule is recommended. Some organizations have a bi-yearly test for which the date is published well in advance so that all parties involved can prepare themselves for participation.

Part Two: Implementing a Disaster Recovery Plan for AD

These steps are Active Directory-oriented and geared for successful implementation to suit your AD environment. Going back to the management presentation in the last step means that you have to repeat the presentation in order to get a final "OK" from the top management and, in effect, make your DRP a standard in your organization.

Writing is Not All

The flow of information to all parties involved, especially your staff, is very important. For example, they need to be aware of fundamental changes in the domain structure, new important corporate policies, and the steps required for system recovery.

The implementation, meaning the approval and standardization of this task in your organization, is the hardest part. Once it becomes a standard and is implemented, getting approval for things necessary to keep the DRP and BCP up-to-date and well-tested should be easier.

In case of a DRP for AD, the recovery process will really have to be communicated, trained, and tested well. This is a critical service for your organization and the better you have your process working, the faster the recovery of the service will be. This will ensure that the users feel the least impact of a disaster, or at least experience the shortest downtime possible.

Ensure that Everyone is Aware of Locations of the DRP

This has happened twice in companies that I worked with. They had invested a lot of money into a DRP process and tested it once. They passed with flying colors, but the man in charge subsequently left the company. The DRP was put on ice because no one took the responsibility and even worse, the whole plan got "lost".

When I asked for the BCP and DRP, I got a blank face saying: "Well, we have it somewhere". Eventually, someone dug up a draft version from their archived inbox. After 2 weeks of searching, I found the actual plan in an obscure and forgotten place on their intranet. Not really a good thing.

Note

Please make sure that the location of the DRP is well known. Make a section in your IT pages in your intranet, print it out, and hand it to everyone, and always mail the latest version to the people involved. An off-site, updated, copy of the DRP and all its related documents, along with copies of software that is running in your organization, is absolutely critical. The process of keeping the DRP off-site in printed form and possibly also in electronic form is likely going to be an enormous time and money saver. This way, many copies will be around in case of an emergency.

Define the Order of Restoration for Different Systems (Root First in Hub Site, then Add One Server etc.)

The contents to be recovered and their order of recovery should be clearly defined in the DRP and the BCP. (This means, first the root DC in the hub site, then the first Domain Controller, then the second, then one at a regional sites, and so on.) Also, if the Active Directory Application Mode (ADAM) is deployed in any way, issues, such as at what point it needs to be recovered, should also be considered. If you have deployed an application specific to a single department that relies on ADAM, you need to make sure that the ADAM gets restored properly before the application is recovered or re-installed. For more information on ADAM, please see http://www.microsoft.com/windowsserver2003/techinfo/overview/adam.mspx.

This is probably also one of the crucial steps to be taken, because mix-ups, replication errors, and bad timings can put you end up in the same situation as you were in before you started recovery, or an even worse situation if you have no coordination and communication.

Go back to "Presentation to Management"

This is the final step. Once everything is implemented, documented, and tested, go back to the management and tell them that the task is complete. Show them numbers for recovery times, pie charts of possibilities, and maximum outage numbers. Once they are convinced that money was not wasted, get it all approved and standardized.

You should be well known by then as "the man" for disaster recovery and your job, in case of an emergency, just got much, much easier.

主站蜘蛛池模板: 绥化市| 望谟县| 峡江县| 安塞县| 页游| 申扎县| 大关县| 淅川县| 河北省| 罗平县| 潢川县| 松桃| 昭通市| 平顺县| 巩义市| 吴堡县| 扎赉特旗| 建平县| 中江县| 枞阳县| 子长县| 洮南市| 通山县| 陇西县| 邢台县| 丹阳市| 东源县| 民勤县| 顺义区| 上饶县| 彩票| 西青区| 铁岭县| 玉龙| 西藏| 丹江口市| 淮安市| 遂昌县| 利川市| 同江市| 永年县|