Think about your own IT environment for a second and tell me what you use for DR?
You just thought of a product name didn’t you? You thought VMware Site Recovery Manager (SRM), Veeam Backup and Replication, Hyper-V Replica or Zerto maybe. Whilst all those products are fantastic in their own right they are NOT “DR” they are primarily replication products.
The replication part of DR is an integral part! If that’s all you’re hanging your hat on then you’re going to have a rough time when it comes time for the “R” part in “DR”. This relies on your ability to restore your applications. Workloads could be inaccessible by staff in head office or external websites no longer up.
I speak to customers, partners, large enterprises and small 2 man operations daily around Cloud, Networks, Backup and DR. I am seeing a common theme of customers relating Disaster Recovery to a singular product and I’m concerned that a lot is being left on the table.
The risk here being that your IT department, integrator or MSP aren’t able to give you a complete solution. In the (hopefully unlikely) situation that you are needing to invoke DR do you really want anything holding up that quick return to service? The top brass in the company will be watching your every move if a DR event happens! You better make damn sure it goes smoothly and you are in control of the situation and all its moving parts.
So, what does make up DR then?
The way I see it there are 5 areas that make up a successful DR solution;
- A discrete physical location to run the DR applications/workloads
- Networking and associated routing, switching and collocated devices
- Replication software
- Automation/middleware to stitch DR all together
- Management in DR of people and processes
Let’s drill deeper and articulate why we need to care about all 5. Go through the risks are if we don’t give each section the attention it deserves. Also, more importantly what can we do to mitigate the risk and remove it as a concern.
1) A discrete physical location to run the DR applications/workloads
A disaster could mean hundreds of things. It could fall anywhere on the sliding scale of Tsunami wiping out half of the country to a tradesman accidentally putting his shovel in the ground and cutting fibre to the data centre. Yes, a very large number of events you would classify as disasters could be remedied quickly without requiring another physical presence. A great many DO require compute, storage and networking to be ready at a moment’s notice somewhere else.
Does your business already have a secondary data centre and you’ve already got another SAN, hosts and switches? If you can “hand on heart” say that it can sufficiently run the production workloads that the stakeholders and users expect then cool use that!
If not then look to use something else. How you determine you might need something else is if you answer yes to any of the following;
- We don’t have a secondary DC
- We have compute in another DC but it’s no longer under warranty
- Our secondary SAN is much smaller and can’t really handle production workloads
- Our secondary DC is still semi reliant on the primary DC
- We can’t really afford the CAPEX required to buy DR equipment
- We don’t really have the skills to setup and manage the secondary environment
If any of the above mirrors your business then I suggest going to a cloud provider. One you can trust who can act as a DR site for you and run the workloads locally.
Ensuring that users do not feel any performance degradation typically means you need a trusted provider. Also a local provider as latency can be a killer for internal applications that users rely on.
Companies like Zettagrid have built their name and reputation on being able to run production workloads in the cloud locally for customers. I encourage you to leverage those capabilities where appropriate for your business.
2) Networking and associated routing, switching and collocated devices
In terms of potential for visible impact and the proverbial “egg on your face” this is the big one. Appropriate networking and actual segregation from the production site is often done poorly.
These are the top 7 network related issues I discover upon having to go and improve an existing DR solution that’s wasn’t up to scratch. Note that none of these are due to the wrong product chosen!
- The DR site networking still relies on the production core routing/switching stack.
- Not enough bandwidth to replicate the required applications to meet the desired RPO.
- The edge device in the DR site cannot perform the advanced routing/firewall rules/load balancer/certificate tasks that production can do.
- Physical devices such as tape libraries or firewalls that needed to be collocated in the DR site were not included in the design process.
- External DNS had no thought so mail servers etc were inaccessible externally. Setup of BGP to advertise ranges.
- The test networks either have too much or not enough visibility. Meaning that they either impact with production or are so locked down you can’t test many scenarios in your testing plans.
- The firewall devices between production and the DR site are incompatible, making a reliable connection between the two problematic and/or flaky.
As we can see each of the above 7 points are a full blog in and unto itself. I won’t go into detail for each step but will say that the top couple of things to consider are;
- Make sure the DR site networks are not reliant on production.
- Don’t guess, use a bandwidth sizing tool like the one here to get enough bandwidth.
- Use a firewall, load balancer, certificate, edge device that has the features your business needs like the NSX Edge firewall Zettagrid uses.
- Make sure that if you need to collocate equipment that the cloud provider or data centre has adequate space and the ability to accommodate this.
Tomorrow in part 2 we will dive into Replication, Automation and Management.